Demystifying
Language
with AI
Dr. Satadisha Saha
Bhowmick
Preceptor
Objectives
• Field Overview
• Background
• Building blocks of text
• Preparing Text for Automation
What is Natural Language Processing?
• Language is complex, varied, ambiguous, and unstructured.
• Over 7000 known languages
• Study of automating languages
• Intersection of Computer Science and Linguistics
• Tough job of finding algorithmic commonalities among the
cognitive diversity of the many known languages of the world.
• Two concomitant line of work
• Natural Language Understanding (NLU)
• Natural Language Generation (NLG)
Natural Language Processing (NLP) is a field of study that is primarily
focused upon developing algorithms that gives computers the ability to
interpret, manipulate, and comprehend human language.
What is Natural Language Processing?
• Language is complex, varied and ambiguous.
• There could be a gap between the literal sense and the contextual
sense implied by a set of word in a particular language.
• NLP is an interdisciplinary field: Computer Science and Linguistics.
• NLP as a field is most focused upon investigating into two concomitant lines
of work which both coexist as well as enhance each other.
• NLU: core tasks that captures the grammar and syntax of language
specific unstructured data.
• NLG: Generate relevant content while adhering to the structural
integrity and material correctness of a language.
• Example: Conversational AI like Siri/Alexa
• Need to be able to process human languages in which instructions
are provided to these devices
• Correctly generate responses in human understandable languages or
carrying out tasks that were originally intended by the user
Many Tasks of NLP
• Natural Language Generation
• Machine Translation
• Conversational AI
• Abstractive Text Summarization
• Question Answering
• Natural Language Understanding
• Part-of-Speech Tagging
• Coreference resolution
• Information Extraction with Named Entities
• Named Entity Recognition and Disambiguation
• Text Classification
• Sentiment Analysis
• Spam Filtering
• Extractive Text Summarization
What does a NLP
pipeline look like?
• Massive amount of unstructured text data available for
training and evaluation
• Text Preprocessing and Feature Engineering
• Humans label this data for specific downstream task.
• Supervised learning : model trained using labelled data
• Machine learning
• Deep Learning
• Heuristics
• Model Finetuning and Evaluation
• Deployment
Prerequisites
● Linear Algebra
● Statistics
● Software Engineering: Python
○ Pandas
○ Numpy
● Machine Learning
○ Scikit Learn
● Deep Learning
○ Pytorch
○ TensorFlow
● Information Retrieval
Building Blocks of Text
He stepped out into the hall, was
• Corpora: NLP applications rely on digitized delighted to encounter a water brother.
(computer readable) collections of text or
speech for learning They picnicked by the pool, then lay back
• The popular Brown corpus1 is a collection on the grass and looked at the stars.
of samples from 500 written English texts How many words in this excerpt?
from different genres (newspaper, fiction,
non-fiction, academic, etc.)
• Assembled at Brown University in 1963–64
• Fundamental unit of text processing:
• Tokens/Words : Popular Brown corpus is a
1M-word collection
• Sometimes sentences depending on
application
Building Blocks of Text
• Units of text processing:
• Word Types – Number of unique words in the corpus
• Set of word types is called a Vocabulary
• Number of types is the Vocabulary size |𝑉|
• Word instances or Tokens are the total number 𝑁 of running words.
• Heap’s Law – 𝑉 = 𝑘𝑁! ; 0 < 𝛽 < 1
• 𝛽 depends on corpus size and genre; ~ .67 𝑡𝑜 .75
• We still have decisions to make!
• Should text be treated capitalized or uncapitalized?
• They and they might be the same word type for the latter
• Do we care about the ordering of words?
• A word can be both a noun and a verb depending on context
Building Blocks of Text
• Units of text processing:
• Lemma – set of lexical forms having the same stem, the same
part-of-speech, and the same word sense
• Wordform is the full inflected or derived form of the word.
• Wordforms are sufficient for Text Processing in English but
morphologically complex languages like Arabic require Lemmatization!
• For most text processing applications, we do not use words as the unit
of computation. Seuss’s cat in the hat is
• Tokenize input strings into tokens! different from other cats!
• Could be words or only parts of words i.e., subwords. Same lemma, different wordforms
Text Preprocessing and Normalization
• Every NLP task requires preprocessing and normalization
• Tokenizing words
• Normalizing word formats
• Segmenting Sentences
• Tokenization
• Space-based tokenization - Segment off a token between instances of spaces
• Simple and effective for languages that use space between words: Arabic,
Cyrillic, Greek, Latin, etc.
• Issues in Tokenization
• Cannot blindly remove punctuations so how to deal with sentences?
• Ph.D., AT&T, $45.55, 01/02/1996
• Should multiword expressions be words?
• New York, rock ‘n’ roll
• Ciltic: a word that doesn’t stand on its own
• are in we’re
Tokenization in languages
• Many languages like Chinese or Japanese do not use spaces to separate words: Linguistic knowledge matters!
• Chinese:
• Chinese words are composed of characters called "hanzi" (or sometimes just "zi")
• Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
• 姚明进⼊总决赛 : 姚明 || 进⼊ || 总决赛 : “Yao Ming reaches the finals”
• Other languages require more complex segmentation
• Neural Sequence models trained by supervised machine learning required.
• The data tells us how to tokenize.
• Subword tokenization : Tokens can be parts of words
as well as whole words
• Byte-Pair Encoding
• Unigram Language Modeling
Word Normalization Issues
• Words/tokens need to be in a standard format
• Examples: U.S.A. or USA, or am/is/be/are
• Case Folding: Reduce all letters to lower case.
• Some Exceptions: General Motors, US vs us, SAIL vs sail
• Capitalization useful in some language-specific applications: Named Entity Recognition in English
• Lemmatizaton: Represent all words as their shared root or lemma
• Examples:
• am/is/be/are à be
• He is reading detective stories à He be read detective story
• Lemmatization is done by Morphological Parsing
• Morphemes are small meaningful units that make up words
• Stems : meaning-bearing unit; Affixes : prefix/suffix often with grammatical functions
• Example: Parse cats into two morphemes cat and s
Word Normalization
Issues
• Dealing with morphology can be complex in
many languages.
• Stemming: Reduce terms to stems,
chopping off affixes crudely
• Porter Stemmer: Rule based Produces unrecognizable words as tokens.
stemming
• ATIONAL à ATE (e.g.,
relational ! relate)
• ING à if stem contains vowel
(e.g., motoring ! motor)
• SSES ! SS (e.g., grasses ! grass)
Sentence Segmentation
• Contextual Unit of Text : context is important to retain, and
words don’t have it.
• !, ? mostly unambiguous but period “.” is very ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Common algorithm: Tokenize first: use rules or ML to classify a
period as either (a) part of the word or (b) a sentence-boundary.
• An abbreviation dictionary can help.
• Sentence segmentation can then often be done by rules based
on this tokenization.