Introduction To NLP

The document provides an overview of Natural Language Processing (NLP), highlighting its complexity and the intersection of computer science and linguistics. It discusses the two main components of NLP: Natural Language Understanding (NLU) and Natural Language Generation (NLG), along with various tasks and the NLP pipeline. Additionally, it covers essential prerequisites and the building blocks of text processing, including tokenization and normalization techniques.

Uploaded by

peacelena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

Introduction To NLP

Uploaded by

peacelena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Demystifying

Language
with AI
Dr. Satadisha Saha
Bhowmick

Preceptor
Objectives
• Field Overview
• Background
• Building blocks of text
• Preparing Text for Automation
What is Natural Language Processing?
• Language is complex, varied, ambiguous, and unstructured.
• Over 7000 known languages

• Study of automating languages

• Intersection of Computer Science and Linguistics
• Tough job of finding algorithmic commonalities among the
cognitive diversity of the many known languages of the world.

• Two concomitant line of work

• Natural Language Understanding (NLU)
• Natural Language Generation (NLG)

Natural Language Processing (NLP) is a field of study that is primarily

focused upon developing algorithms that gives computers the ability to
interpret, manipulate, and comprehend human language.
What is Natural Language Processing?
• Language is complex, varied and ambiguous.
• There could be a gap between the literal sense and the contextual
sense implied by a set of word in a particular language.
• NLP is an interdisciplinary field: Computer Science and Linguistics.
• NLP as a field is most focused upon investigating into two concomitant lines
of work which both coexist as well as enhance each other.
• NLU: core tasks that captures the grammar and syntax of language
specific unstructured data.
• NLG: Generate relevant content while adhering to the structural
integrity and material correctness of a language.
• Example: Conversational AI like Siri/Alexa
• Need to be able to process human languages in which instructions
are provided to these devices
• Correctly generate responses in human understandable languages or
carrying out tasks that were originally intended by the user
Many Tasks of NLP
• Natural Language Generation
• Machine Translation
• Conversational AI
• Abstractive Text Summarization
• Question Answering

• Natural Language Understanding

• Part-of-Speech Tagging
• Coreference resolution
• Information Extraction with Named Entities
• Named Entity Recognition and Disambiguation
• Text Classification
• Sentiment Analysis
• Spam Filtering
• Extractive Text Summarization
What does a NLP
pipeline look like?
• Massive amount of unstructured text data available for
training and evaluation
• Text Preprocessing and Feature Engineering
• Humans label this data for specific downstream task.
• Supervised learning : model trained using labelled data
• Machine learning
• Deep Learning
• Heuristics
• Model Finetuning and Evaluation
• Deployment
Prerequisites
● Linear Algebra

● Statistics

● Software Engineering: Python

○ Pandas
○ Numpy

● Machine Learning
○ Scikit Learn

● Deep Learning
○ Pytorch
○ TensorFlow

● Information Retrieval
Building Blocks of Text
He stepped out into the hall, was
• Corpora: NLP applications rely on digitized delighted to encounter a water brother.
(computer readable) collections of text or
speech for learning They picnicked by the pool, then lay back
• The popular Brown corpus1 is a collection on the grass and looked at the stars.
of samples from 500 written English texts How many words in this excerpt?
from different genres (newspaper, fiction,
non-fiction, academic, etc.)
• Assembled at Brown University in 1963–64
• Fundamental unit of text processing:
• Tokens/Words : Popular Brown corpus is a
1M-word collection
• Sometimes sentences depending on
application
Building Blocks of Text
• Units of text processing:
• Word Types – Number of unique words in the corpus
• Set of word types is called a Vocabulary
• Number of types is the Vocabulary size |𝑉|
• Word instances or Tokens are the total number 𝑁 of running words.
• Heap’s Law – 𝑉 = 𝑘𝑁! ; 0 < 𝛽 < 1
• 𝛽 depends on corpus size and genre; ~ .67 𝑡𝑜 .75
• We still have decisions to make!
• Should text be treated capitalized or uncapitalized?
• They and they might be the same word type for the latter
• Do we care about the ordering of words?
• A word can be both a noun and a verb depending on context
Building Blocks of Text
• Units of text processing:
• Lemma – set of lexical forms having the same stem, the same
part-of-speech, and the same word sense
• Wordform is the full inflected or derived form of the word.
• Wordforms are sufficient for Text Processing in English but
morphologically complex languages like Arabic require Lemmatization!
• For most text processing applications, we do not use words as the unit
of computation. Seuss’s cat in the hat is
• Tokenize input strings into tokens! different from other cats!
• Could be words or only parts of words i.e., subwords. Same lemma, different wordforms
Text Preprocessing and Normalization
• Every NLP task requires preprocessing and normalization
• Tokenizing words
• Normalizing word formats
• Segmenting Sentences
• Tokenization
• Space-based tokenization - Segment off a token between instances of spaces
• Simple and effective for languages that use space between words: Arabic,
Cyrillic, Greek, Latin, etc.
• Issues in Tokenization
• Cannot blindly remove punctuations so how to deal with sentences?
• Ph.D., AT&T, $45.55, 01/02/1996
• Should multiword expressions be words?
• New York, rock ‘n’ roll
• Ciltic: a word that doesn’t stand on its own
• are in we’re
Tokenization in languages
• Many languages like Chinese or Japanese do not use spaces to separate words: Linguistic knowledge matters!
• Chinese:
• Chinese words are composed of characters called "hanzi" (or sometimes just "zi")
• Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
• 姚明进⼊总决赛 : 姚明 || 进⼊ || 总决赛 : “Yao Ming reaches the finals”
• Other languages require more complex segmentation
• Neural Sequence models trained by supervised machine learning required.
• The data tells us how to tokenize.
• Subword tokenization : Tokens can be parts of words
as well as whole words
• Byte-Pair Encoding
• Unigram Language Modeling
Word Normalization Issues
• Words/tokens need to be in a standard format
• Examples: U.S.A. or USA, or am/is/be/are
• Case Folding: Reduce all letters to lower case.
• Some Exceptions: General Motors, US vs us, SAIL vs sail
• Capitalization useful in some language-specific applications: Named Entity Recognition in English
• Lemmatizaton: Represent all words as their shared root or lemma
• Examples:
• am/is/be/are à be
• He is reading detective stories à He be read detective story
• Lemmatization is done by Morphological Parsing
• Morphemes are small meaningful units that make up words
• Stems : meaning-bearing unit; Affixes : prefix/suffix often with grammatical functions
• Example: Parse cats into two morphemes cat and s
Word Normalization
Issues
• Dealing with morphology can be complex in
many languages.
• Stemming: Reduce terms to stems,
chopping off affixes crudely
• Porter Stemmer: Rule based Produces unrecognizable words as tokens.
stemming
• ATIONAL à ATE (e.g.,
relational ! relate)
• ING à if stem contains vowel
(e.g., motoring ! motor)
• SSES ! SS (e.g., grasses ! grass)
Sentence Segmentation
• Contextual Unit of Text : context is important to retain, and
words don’t have it.
• !, ? mostly unambiguous but period “.” is very ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Common algorithm: Tokenize first: use rules or ML to classify a
period as either (a) part of the word or (b) a sentence-boundary.
• An abbreviation dictionary can help.
• Sentence segmentation can then often be done by rules based
on this tokenization.

NLP m2
No ratings yet
NLP m2
71 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Unit1 (Part1)
No ratings yet
Unit1 (Part1)
49 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
NLP for AI and Tech Enthusiasts
No ratings yet
NLP for AI and Tech Enthusiasts
30 pages
Lecture 2 - NLP-I
No ratings yet
Lecture 2 - NLP-I
91 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Formatted-Document NLP
No ratings yet
Formatted-Document NLP
11 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Module 1
No ratings yet
Module 1
49 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
Chapter - 6 Communicating, Perceiving, and Acting
No ratings yet
Chapter - 6 Communicating, Perceiving, and Acting
30 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
NLP - Natural Language Processing and APPLICATION
No ratings yet
NLP - Natural Language Processing and APPLICATION
31 pages
Natural Language Processing: By-Himani (ROLL NO. 43)
No ratings yet
Natural Language Processing: By-Himani (ROLL NO. 43)
19 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Natural Language Processing
No ratings yet
Natural Language Processing
30 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Introduction
No ratings yet
Introduction
23 pages
NLP Unit-1 - 1
No ratings yet
NLP Unit-1 - 1
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
14 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
NLP Final
No ratings yet
NLP Final
27 pages
Natural Language Processing
No ratings yet
Natural Language Processing
24 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
519 Assignment
No ratings yet
519 Assignment
26 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
Module 1
No ratings yet
Module 1
27 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
NLP Introduction Notes Anna University
No ratings yet
NLP Introduction Notes Anna University
2 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
NLP Unit 1
No ratings yet
NLP Unit 1
43 pages
Hadi Pres, 21-12-24-1
No ratings yet
Hadi Pres, 21-12-24-1
16 pages
PMTA Notes
No ratings yet
PMTA Notes
39 pages
Language Functions and Forms Guide
No ratings yet
Language Functions and Forms Guide
2 pages
Topic 21
No ratings yet
Topic 21
21 pages
4th Quarter Lesson 3 Research Design, Sample & Sampling Procedure
No ratings yet
4th Quarter Lesson 3 Research Design, Sample & Sampling Procedure
29 pages
Ghana SHS Geography Curriculum 2023
No ratings yet
Ghana SHS Geography Curriculum 2023
126 pages
1 - History and Historians
No ratings yet
1 - History and Historians
3 pages
Understanding Behaviorism Basics
No ratings yet
Understanding Behaviorism Basics
4 pages
Katia Sachoute Gcu 533 Empowering Leaders Professional Development 1
No ratings yet
Katia Sachoute Gcu 533 Empowering Leaders Professional Development 1
11 pages
Final Microteach Deep in The Heart of Texas Lesson Plan
No ratings yet
Final Microteach Deep in The Heart of Texas Lesson Plan
5 pages
Chapter 3 March 12
No ratings yet
Chapter 3 March 12
5 pages
Mini Mental Status Exam
100% (3)
Mini Mental Status Exam
2 pages
Social Psychology Study Guide
100% (2)
Social Psychology Study Guide
59 pages
FS 2 Experiencing The Teaching Learning Process
100% (5)
FS 2 Experiencing The Teaching Learning Process
28 pages
Data Science Terms Pocket Guide
No ratings yet
Data Science Terms Pocket Guide
28 pages
Introduction To Student Motivation
No ratings yet
Introduction To Student Motivation
8 pages
Test 5/11/2019, Tuesday, 8Pm - 10Pm, T&L Lab 1/2 Cetal - 20 Questions - 40 Marks
No ratings yet
Test 5/11/2019, Tuesday, 8Pm - 10Pm, T&L Lab 1/2 Cetal - 20 Questions - 40 Marks
7 pages
Would Rather
No ratings yet
Would Rather
7 pages
Expressive Intp
No ratings yet
Expressive Intp
18 pages
Coding and Writing Analytic Memos On Qualitative Data - A Review o
No ratings yet
Coding and Writing Analytic Memos On Qualitative Data - A Review o
6 pages
Module 1
No ratings yet
Module 1
22 pages
Fallacies of Presumption
100% (1)
Fallacies of Presumption
6 pages
Module 3 Student's Worksheets
No ratings yet
Module 3 Student's Worksheets
14 pages
Checked Mabanag - March 30 LP
No ratings yet
Checked Mabanag - March 30 LP
4 pages
Novice To Expert
100% (2)
Novice To Expert
26 pages
COM3107 Week 1
No ratings yet
COM3107 Week 1
63 pages
Comparison of Quantitatave and Qualitative Research
No ratings yet
Comparison of Quantitatave and Qualitative Research
6 pages
Stong Vs Weak Beats
No ratings yet
Stong Vs Weak Beats
3 pages
From Interest To Question
No ratings yet
From Interest To Question
11 pages
Revised Teaching Practice Guide
No ratings yet
Revised Teaching Practice Guide
10 pages
Reading
No ratings yet
Reading
12 pages