KEMBAR78
Introduction To NLP | PDF | Artificial Intelligence | Intelligence (AI) & Semantics
0% found this document useful (0 votes)
12 views15 pages

Introduction To NLP

The document provides an overview of Natural Language Processing (NLP), highlighting its complexity and the intersection of computer science and linguistics. It discusses the two main components of NLP: Natural Language Understanding (NLU) and Natural Language Generation (NLG), along with various tasks and the NLP pipeline. Additionally, it covers essential prerequisites and the building blocks of text processing, including tokenization and normalization techniques.

Uploaded by

peacelena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Introduction To NLP

The document provides an overview of Natural Language Processing (NLP), highlighting its complexity and the intersection of computer science and linguistics. It discusses the two main components of NLP: Natural Language Understanding (NLU) and Natural Language Generation (NLG), along with various tasks and the NLP pipeline. Additionally, it covers essential prerequisites and the building blocks of text processing, including tokenization and normalization techniques.

Uploaded by

peacelena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Demystifying

Language
with AI
Dr. Satadisha Saha
Bhowmick

Preceptor
Objectives
• Field Overview
• Background
• Building blocks of text
• Preparing Text for Automation
What is Natural Language Processing?
• Language is complex, varied, ambiguous, and unstructured.
• Over 7000 known languages

• Study of automating languages


• Intersection of Computer Science and Linguistics
• Tough job of finding algorithmic commonalities among the
cognitive diversity of the many known languages of the world.

• Two concomitant line of work


• Natural Language Understanding (NLU)
• Natural Language Generation (NLG)

Natural Language Processing (NLP) is a field of study that is primarily


focused upon developing algorithms that gives computers the ability to
interpret, manipulate, and comprehend human language.
What is Natural Language Processing?
• Language is complex, varied and ambiguous.
• There could be a gap between the literal sense and the contextual
sense implied by a set of word in a particular language.
• NLP is an interdisciplinary field: Computer Science and Linguistics.
• NLP as a field is most focused upon investigating into two concomitant lines
of work which both coexist as well as enhance each other.
• NLU: core tasks that captures the grammar and syntax of language
specific unstructured data.
• NLG: Generate relevant content while adhering to the structural
integrity and material correctness of a language.
• Example: Conversational AI like Siri/Alexa
• Need to be able to process human languages in which instructions
are provided to these devices
• Correctly generate responses in human understandable languages or
carrying out tasks that were originally intended by the user
Many Tasks of NLP
• Natural Language Generation
• Machine Translation
• Conversational AI
• Abstractive Text Summarization
• Question Answering

• Natural Language Understanding


• Part-of-Speech Tagging
• Coreference resolution
• Information Extraction with Named Entities
• Named Entity Recognition and Disambiguation
• Text Classification
• Sentiment Analysis
• Spam Filtering
• Extractive Text Summarization
What does a NLP
pipeline look like?
• Massive amount of unstructured text data available for
training and evaluation
• Text Preprocessing and Feature Engineering
• Humans label this data for specific downstream task.
• Supervised learning : model trained using labelled data
• Machine learning
• Deep Learning
• Heuristics
• Model Finetuning and Evaluation
• Deployment
Prerequisites
● Linear Algebra

● Statistics

● Software Engineering: Python


○ Pandas
○ Numpy

● Machine Learning
○ Scikit Learn

● Deep Learning
○ Pytorch
○ TensorFlow

● Information Retrieval
Building Blocks of Text
He stepped out into the hall, was
• Corpora: NLP applications rely on digitized delighted to encounter a water brother.
(computer readable) collections of text or
speech for learning They picnicked by the pool, then lay back
• The popular Brown corpus1 is a collection on the grass and looked at the stars.
of samples from 500 written English texts How many words in this excerpt?
from different genres (newspaper, fiction,
non-fiction, academic, etc.)
• Assembled at Brown University in 1963–64
• Fundamental unit of text processing:
• Tokens/Words : Popular Brown corpus is a
1M-word collection
• Sometimes sentences depending on
application
Building Blocks of Text
• Units of text processing:
• Word Types – Number of unique words in the corpus
• Set of word types is called a Vocabulary
• Number of types is the Vocabulary size |𝑉|
• Word instances or Tokens are the total number 𝑁 of running words.
• Heap’s Law – 𝑉 = 𝑘𝑁! ; 0 < 𝛽 < 1
• 𝛽 depends on corpus size and genre; ~ .67 𝑡𝑜 .75
• We still have decisions to make!
• Should text be treated capitalized or uncapitalized?
• They and they might be the same word type for the latter
• Do we care about the ordering of words?
• A word can be both a noun and a verb depending on context
Building Blocks of Text
• Units of text processing:
• Lemma – set of lexical forms having the same stem, the same
part-of-speech, and the same word sense
• Wordform is the full inflected or derived form of the word.
• Wordforms are sufficient for Text Processing in English but
morphologically complex languages like Arabic require Lemmatization!
• For most text processing applications, we do not use words as the unit
of computation. Seuss’s cat in the hat is
• Tokenize input strings into tokens! different from other cats!
• Could be words or only parts of words i.e., subwords. Same lemma, different wordforms
Text Preprocessing and Normalization
• Every NLP task requires preprocessing and normalization
• Tokenizing words
• Normalizing word formats
• Segmenting Sentences
• Tokenization
• Space-based tokenization - Segment off a token between instances of spaces
• Simple and effective for languages that use space between words: Arabic,
Cyrillic, Greek, Latin, etc.
• Issues in Tokenization
• Cannot blindly remove punctuations so how to deal with sentences?
• Ph.D., AT&T, $45.55, 01/02/1996
• Should multiword expressions be words?
• New York, rock ‘n’ roll
• Ciltic: a word that doesn’t stand on its own
• are in we’re
Tokenization in languages
• Many languages like Chinese or Japanese do not use spaces to separate words: Linguistic knowledge matters!
• Chinese:
• Chinese words are composed of characters called "hanzi" (or sometimes just "zi")
• Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
• 姚明进⼊总决赛 : 姚明 || 进⼊ || 总决赛 : “Yao Ming reaches the finals”
• Other languages require more complex segmentation
• Neural Sequence models trained by supervised machine learning required.
• The data tells us how to tokenize.
• Subword tokenization : Tokens can be parts of words
as well as whole words
• Byte-Pair Encoding
• Unigram Language Modeling
Word Normalization Issues
• Words/tokens need to be in a standard format
• Examples: U.S.A. or USA, or am/is/be/are
• Case Folding: Reduce all letters to lower case.
• Some Exceptions: General Motors, US vs us, SAIL vs sail
• Capitalization useful in some language-specific applications: Named Entity Recognition in English
• Lemmatizaton: Represent all words as their shared root or lemma
• Examples:
• am/is/be/are à be
• He is reading detective stories à He be read detective story
• Lemmatization is done by Morphological Parsing
• Morphemes are small meaningful units that make up words
• Stems : meaning-bearing unit; Affixes : prefix/suffix often with grammatical functions
• Example: Parse cats into two morphemes cat and s
Word Normalization
Issues
• Dealing with morphology can be complex in
many languages.
• Stemming: Reduce terms to stems,
chopping off affixes crudely
• Porter Stemmer: Rule based Produces unrecognizable words as tokens.
stemming
• ATIONAL à ATE (e.g.,
relational ! relate)
• ING à if stem contains vowel
(e.g., motoring ! motor)
• SSES ! SS (e.g., grasses ! grass)
Sentence Segmentation
• Contextual Unit of Text : context is important to retain, and
words don’t have it.
• !, ? mostly unambiguous but period “.” is very ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Common algorithm: Tokenize first: use rules or ML to classify a
period as either (a) part of the word or (b) a sentence-boundary.
• An abbreviation dictionary can help.
• Sentence segmentation can then often be done by rules based
on this tokenization.

You might also like