0% found this document useful (0 votes)

149 views35 pages

Natural Language Processing

The document discusses text processing in natural language processing. It begins with an overview of text processing, including defining graphemes, phonemes, words, and sentences. It then covers challenges in text processing, such as different writing systems, and outlines the modules to be covered, including character encoding, word segmentation, and sentence segmentation. The document focuses on character encoding types like ASCII and Unicode and discusses challenges related to character set dependence, language dependence, corpus dependence, and application dependence in text processing.

Uploaded by

shuchis785

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views35 pages

Natural Language Processing

Uploaded by

shuchis785

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Natural Language Processing

CSE4022

Lecture-04: Text Processing - Module 2

Dr. Durgesh Kumar

Assistant Professor, SCOPE, VIT Vellore
Table of contents

1 Recap from previous Module

2 Text Processing Module outline

3 Character Encoding types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34

Recap from previous Module- Intro to NLP

Module 1 Summary

What is NLP?
Stages of NLP
Dis-ambiguity in NLP
Challenges of NLP.
Models and Algorithm categories in NLP.
Real world Applications of NLP
Heroes of NLP and related online courses.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34

Text Processing Introduction

Text Processing
The task of converting a raw text file, essentially a sequence of digital
bits, into a well-defined sequence of linguistically meaningful units.

1 Grapheme:the smallest units of writing that correspond with sounds

(more accurately phonemes).
2 Phoneme:A unit of sound that can distinguish one word from
another in a particular language.
3 words: consists of one or more characters
4 sentence: consisting of one or more words.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34

Text Processing Introduction (Contd.)

At the lowest level characters represents: individual grapheme, word,

sentence in a language’s written system.
Text pre-processing is an essential part of any NLP system.
The characters, words, and sentences identified at this stage are the
fundamental units passed to all further processing stages, from
document analysis, NER, POS, Sentiment Analysis, Information
Retrieval, etc.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34

Text Processing Introduction (Contd.)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34

Progress of NLP (Text Processing)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34

Text Processing - Challenges & Need

NLP contains inherent ambiguities, and is further amplified and

generated by writing systems.
The explosion in corpus size and variety has necessities techniques
for automatically harvesting and preparing text corpora for various
NLP tasks.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34

Outline of Text-processing

Figure: Text Processing types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34

Text Processing

Text pre-processing: converting sequence of digital bits, into a

well-defined sequence of linguistically meaningful units.
Text pre-processing can be divided into two stages:
1 document triage: process of converting a set of digital files into
well-defined text documents.
2 text segmentation: process of converting a well-defined text corpus
into its component words, and sentences.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34

Text Processing - Document Triage
In order for any natural language document to be machine readable, its
characters must be represented in a character encoding, in which one or
more bytes in a file maps to a known character.
1 Character encoding determines the character encoding (or
encodings) for any file and optionally converts between encodings.
2 language identification: determines the natural language for a
document; this step is closely linked to, but not uniquely determined
by, the character encoding.
3 text sectioning identifies the actual content within a file while
discarding undesirable elements, such as images, tables, headers,
links, and HTML formatting.
The o/p the document triage stage is a well-defined text corpus,
organized by language, suitable for text segmentation, and further
analysis.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 10 / 34
Text Processing - Text segmentation
1 Word segmentation breaks up the sequence of characters in a text
by locating the word boundaries. In computation linguistics, words
are oaften referred to as tokens, and word segmentation as
tokenization.
2 Text normalization is a related step that involves merging different
written forms of a token into a canonical normalized form; e.g.
convert tokens “Mr.”, “Mr”, “mister”, and “Mister” to a single
normalized form.
3 Sentence segmentation is the process of determining the longer
processing units consisting of one or more words. This task involves
identifying sentence boundaries between words in different
sentences. sentence boundary detection, sentence boundary
disambiguation or sentence boundary recognition.
=⇒ Most written languages have punctuation marks.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 11 / 34
Module 2 outline- from Syllabus

Character Encoding : In the linguistic analysis of a digital natural

language text, it is necessary to clearly define the characters, words,
and sentences in any document.
Word Segmentation
Sentence Segmentation
Intro to Corpora
Corpora Analysis

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34

Challenges of Text Processing

Types of Writing systems

1 Logo-graphic: individual symbol represent words; large number

(often thousands) of symbols. e.g.: Chinese.
2 Syllabic: individual symbol represents syllables. e.g.: Japanese kana
3 alphabetic: where individual symbol represent sounds. e.g.: English

Syllabic and alphabetic system have fewer than 100 symbols.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34

Challenges of Text Processing

Syllables

a unit of spoken language that is next bigger than a speech sound

is a sequence of speech sounds (formed from vowels and consonants)
organized into a single unit.
consists of one or more vowel sounds alone or of a syllabic consonant
alone or of either with one or more consonant sounds preceding or
following
act as the building blocks of a spoken word, determining the pace
and rhythm of how the word is pronounced.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34

Challenges of Text Processing
Table: Syllab Example

suffix Words Written Syllables Spoken Syllables

alarming a·larm·ing uh-lahr-ming
asking ask·ing
baking bak·ing
“-ing” words
eating eat·ing
learning learn·ing
thinking think·ing

atomize at·om·ize
customize cus·tom·ize
internalize in·ter·nal·ize
“-ize” words
moisturize mois·tur·ize

See here for details

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 15 / 34
Challenges of Text Processing

Different types of writing systems. e.g. logo-graphics, syllabic,

alphabetic.
Majority of all written languages uses an alphabetic or syllabic
system [Corrie et al. 1996].
Modern writing system employs more than one types.
English predominantly uses roman base script (an alphabetic writing
system), but also utilizes logo-graphic symbols.
Arabic (0-9)
Currency symbol: dollar($), Euro (€) , Rupee(|) , Pounds (£), Yen
(¥)
And other symbols (%, &, #)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34

Challenges of Text Processing

Challenges of Text processing can be categorized into 4 categories:

1 Character set dependence
(a) Character encoding types
(b) Character encoding identification & its implication on tokenization.

2 Language dependence
3 Corpus dependence
4 Application dependence

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34

1(A). Character Encoding types

ASCII encoding (7-bit)

Extended ASCII encoding (8-bit)
ISCII encoding (8-bit)
16 bit - ( for encoding Chinese, Japanese)
Unicode 5.0
UTF-8: Variable length Encoding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34

1(A). Character Encoding Types - ASCII (7 bit)

American Standard Code for Information Interchange.

Historically all digital text were encded in 7-bit ASCII code.
ASCII code include 128 characters (27 ).
It includes only Roman or Latin alphabets and essentials English
alphabets. e.g. [0-9], [A-Z], [a-z].

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34

1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34

1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning in rectangular format.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34

1(A). Character Encoding Types- ASCII (Contd.)

ASCII codes can be broadly classified into two categories:

1 ASCII control characters (character code 0-31): NULL Character
(0), Start of Heading (1), Backspace (8), Horizontal tab (9), Line
Feed (10), Vertical Tab (11), Device Control 1 (17).
2 ASCII printable characters (character code 32-127): represent
letters, digits, punctuation marks, and a few miscellaneous
symbols.
digits [0-9]: 48-57
Capital letters [A-Z]: 65-90
Small letters [a-z]: 97-122
Punctuation: Space (32), Exclamation mark (33), double
quote(34), Single quote (39), Comma (44), hypen (45), Period, dot
or full stop (46), colon (58), Semicolon (59), question mark (63)
Special Symbols: #: (35), $: (36), % (37), &: (38), <: (60), =
(61), >: (62), @: (64)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 22 / 34
1(A). Character Encoding Types - ASCII Limitations

Limitation of 7-bit ASCII code

Required “asciification” or “romanization” of characters not present

in ASCII table. e.g. German über would be written as u”ber or
ueber.
the French word déjà would be written as de‘ja’ or de1ja2.
complex ASCII-fication for non-roman based languages:
a phonetic mapping of the source characters to the
roman characters . e.g.: Russian, Arabic, Chinese.
The Pinyin transliteration of Chinese writing is one such example.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34

1(A). Character Encoding Types - Extended ASCII

8 bits: It can represents 256 characters (28 ).

First 128 characters (0-128) reserved for original ASCII characters.
Eight-bit encoding exist for all common alphabetic and some syllabic
writing systems; e.g. ISO-8859.
Cons: Overlapping character sets in different languages.
ISO-8859

It contains encoding definitions for most European characters.

HTML syntax: < meta charset="ISO-8859-1">
ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe),
ISO-8859-3 (Southern Europe), ISO-8859-4 (Baltic) ISO-8859-5
(Cyrillic)
ISO-8859-6 (Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew),
ISO-8859-9 (Turkish), ISO-8859-15 (Latin 9)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 24 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Indian Script Code for Information Interchange.

ISCII is an encoding scheme that represents various languages that
are written and spoken in India.
ISCII was introduced in the year 1991 by the Bureau of Indian
Standards(BIS).
ISCII code include 256 characters (28 ).
The first 128 characters, that is, from 0-127 are same as that for
ASCII.
The next characters, that are from 128-255 represent the characters
from the Indian scripts.
Most of the Indian language characters are taken from the ancient
Brahmi script and resemble close to each other due to having similar
phonetic structure.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 25 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Advantage

Majority of languages that are spoken in India are represented in this.

Character set is simple and easy to understand.
Easy transliteration between languages is possible.
Supported Languages: Devanagari, Punjabi, Gujarati, Oriya,
Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil

Disadvantages

We need a special keyboard which contains ISCII character keys.

As the Unicode was invented later, and with Unicode having the
characters of ISCII, ISCII became obsolete.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34

1(A). Character Encoding Types - A two byte
character set - for Japanese and Chinese

Chinese and Japanese, which have several thousand distinct

characters, require multiple bytes to encode a single character.
A two-byte character or 16 bits set can represent 65,536 (216 )
distinct characters.
single-byte letters, spaces, punctuation marks (e.g., periods,
quotation marks, and parentheses), and Arabic numerals (0–9) are
commonly interspersed (distributed) with 2-byte Chinese and
Japanese characters

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34

1(A). Character Encoding Types - A two byte
character set- Challenges

Challenges

Code switching: Characters from many different writing systems

occur within the same text. e.g. i am going to School. (where all
words except School are written in Indian languages such as hindi,
Tamil, Telugu, Bengali etc.).
Multiple Encoding Multiple encodings also exist for chinese e.g.
Big-5 for the complex-form (traditional) character set and GB for
the simple-form (simplified) set.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34

1(A). Character encoding - Unicode Encoding - UTF-8

The default character encoding for HTML5 is UTF-8.

To specify encoding in HTML : <meta charset="UTF-8">

Figure: Character encoding in HTML

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 29 / 34
1(A). Character Encoding Types - Unicode encodings

It seeks to eliminate this character set ambiguity by specifying a

Universal Character Set that includes over 100,000 distinct coded
characters derived from over 75 supported scripts representing all the
writing systems commonly used today.
most commonly implemented in the UTF-8 variable-length
character encoding, in which each character is represented by a 1 to
4 byte encoding.
UTF-8 allow for the encoding of all supported characters with no
overlap or confusion between conflicting byte ranges.
it is rapidly replacing older character encoding sets for multilingual
applications.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34

1(A). Character encoding - Unicode Encoding - UTF-8

In the UTF-8 encoding

ASCII characters require 1 byte
other characters included in ISO-8859 and other alphabetic
characters require 2 bytes.
All other characters, including Chinese, Japanese, and Korean,
require 3 bytes (and very rarely 4 bytes).
1 import pandas as pd
2 df = pd . read_csv ( ’ fname . csv ’ , encoding = ’ utf8 ’)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34

1 (B). Character Encoding Identification and Its
Impact on Tokenization

the header of a digital document may contain information regarding

its character encoding.
this information is not always present or even reliable.
in which case the encoding must be determined automatically.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34

1 (B). Character Encoding Identification and Its
Impact on Tokenization

Challenges of Character Encoding Identification.

Despite popularity of Unicode encoding (UTF-8), still many sources

of documents uses different encoding schemes.
same range of numeric values can represent different characters in
different encodings. e.g. English or Spanish are both normally stored
in the common 8-bit encoding Latin-1 (or ISO-8859-1).

Python Librarry for character encoding detection

chardet
cchardet
Reference blog

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 34 / 34

NLP Text Processing Guide
No ratings yet
NLP Text Processing Guide
47 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
35 pages
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
No ratings yet
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
34 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
Utter
No ratings yet
Utter
15 pages
6-Text Preprocessing-01-08-2024
No ratings yet
6-Text Preprocessing-01-08-2024
52 pages
Lecture 02
No ratings yet
Lecture 02
62 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
NLP Study Notes
No ratings yet
NLP Study Notes
13 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
NLP Intro
No ratings yet
NLP Intro
4 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
NLP Unit II Notes
No ratings yet
NLP Unit II Notes
31 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Introduction To
No ratings yet
Introduction To
16 pages
2 - 6N302 Natural Language Processing
No ratings yet
2 - 6N302 Natural Language Processing
6 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Computational Linguistic Notes
No ratings yet
Computational Linguistic Notes
27 pages
Nayie Bayes Classifier 21 Page
No ratings yet
Nayie Bayes Classifier 21 Page
28 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
1.chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
NLP Course for Students
No ratings yet
NLP Course for Students
25 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Introduction To The Module: John Barnden School of Computer Science University of Birmingham
No ratings yet
Introduction To The Module: John Barnden School of Computer Science University of Birmingham
18 pages
Lec 2
No ratings yet
Lec 2
21 pages
Theory of Computation - Practical
No ratings yet
Theory of Computation - Practical
23 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
30 pages
Introduction
No ratings yet
Introduction
24 pages
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Natural Language Processing - Session 4 - Tokenization and Stemming
No ratings yet
Natural Language Processing - Session 4 - Tokenization and Stemming
63 pages
837b806c-ca53-4e5e-a4c3-a7d989cf8668
No ratings yet
837b806c-ca53-4e5e-a4c3-a7d989cf8668
28 pages
NLP Lab File
100% (2)
NLP Lab File
66 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
TSA Book
No ratings yet
TSA Book
154 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
NLP-Lect 4-01.02.2021
No ratings yet
NLP-Lect 4-01.02.2021
16 pages
Introduction To Natural Language Processing-03-01-2024
No ratings yet
Introduction To Natural Language Processing-03-01-2024
27 pages
Lecture 2 - NLP-I
No ratings yet
Lecture 2 - NLP-I
91 pages
NLP Unit 1
No ratings yet
NLP Unit 1
68 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
NLP for Computer Science Students
No ratings yet
NLP for Computer Science Students
16 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
57 pages
Natural Language Processing Slides
No ratings yet
Natural Language Processing Slides
1,027 pages
Module 3: Morphology Inflectional and Derivation Morphology
No ratings yet
Module 3: Morphology Inflectional and Derivation Morphology
17 pages
Module 3: Morphology Morphological Parsing With Finite State
No ratings yet
Module 3: Morphology Morphological Parsing With Finite State
29 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
Futureinternet 14 00253 v2
No ratings yet
Futureinternet 14 00253 v2
17 pages
Paper 3308
No ratings yet
Paper 3308
19 pages
Digital Assignment LSM
No ratings yet
Digital Assignment LSM
7 pages
Alt Codes
No ratings yet
Alt Codes
9 pages
Tabela de Cores CMYK e Hexadecimal PDF
No ratings yet
Tabela de Cores CMYK e Hexadecimal PDF
19 pages
Line Printer Plus
No ratings yet
Line Printer Plus
348 pages
Grade 5 Decimal Numbers Answers
No ratings yet
Grade 5 Decimal Numbers Answers
8 pages
Data Representation
No ratings yet
Data Representation
17 pages
IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
38 pages
Caie Igcse Computer Science 0478 Theory 660982a473c35727dc3c585f 895
No ratings yet
Caie Igcse Computer Science 0478 Theory 660982a473c35727dc3c585f 895
7 pages
Manual
No ratings yet
Manual
67 pages
Starrett To Mitutoyo SPC Converter
100% (1)
Starrett To Mitutoyo SPC Converter
1 page
Alasan Penghakiman Ragesuthen & 4 Ors. +@001
No ratings yet
Alasan Penghakiman Ragesuthen & 4 Ors. +@001
80 pages
KL-Read and Write Khmer-Lesson 1
No ratings yet
KL-Read and Write Khmer-Lesson 1
9 pages
Binary Conversion
No ratings yet
Binary Conversion
8 pages
Number System Conversion Guide
No ratings yet
Number System Conversion Guide
3 pages
Binary and Decimal Basics
No ratings yet
Binary and Decimal Basics
23 pages
A High Speed Reed-Solomon Decoder
No ratings yet
A High Speed Reed-Solomon Decoder
8 pages
BTC Script
100% (1)
BTC Script
3 pages
Binary To Denary Conversions
No ratings yet
Binary To Denary Conversions
4 pages
Session 08 4
No ratings yet
Session 08 4
41 pages
Pm7shortcuts Mac
No ratings yet
Pm7shortcuts Mac
2 pages
Soft Decision FEC
No ratings yet
Soft Decision FEC
10 pages
Nida e Shahi Feb 2017
No ratings yet
Nida e Shahi Feb 2017
68 pages
Digital Electronics: Number Systems
No ratings yet
Digital Electronics: Number Systems
19 pages
Module 2 Datarepresentation PartA Bhanu Chander
No ratings yet
Module 2 Datarepresentation PartA Bhanu Chander
28 pages
Digital Electronics - Number System PDF
No ratings yet
Digital Electronics - Number System PDF
48 pages
Character Sets
No ratings yet
Character Sets
2 pages
4 - Harvard Style
No ratings yet
4 - Harvard Style
11 pages
Everyday Math 4.1 Understanding Decimals
No ratings yet
Everyday Math 4.1 Understanding Decimals
64 pages
Tamil A B C (6 To +2) - 1
No ratings yet
Tamil A B C (6 To +2) - 1
297 pages
05.signal Encoding Techniques Part 1
No ratings yet
05.signal Encoding Techniques Part 1
56 pages
PostgreSQL Data Types Guide
No ratings yet
PostgreSQL Data Types Guide
3 pages

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

Natural Language Processing

Lecture-04: Text Processing - Module 2

Dr. Durgesh Kumar

1 Recap from previous Module

2 Text Processing Module outline

3 Character Encoding types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34

1 Grapheme:the smallest units of writing that correspond with sounds

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34

At the lowest level characters represents: individual grapheme, word,

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34

NLP contains inherent ambiguities, and is further amplified and

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34

Figure: Text Processing types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34

Text pre-processing: converting sequence of digital bits, into a

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34

Character Encoding : In the linguistic analysis of a digital natural

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34

Types of Writing systems

1 Logo-graphic: individual symbol represent words; large number

Syllabic and alphabetic system have fewer than 100 symbols.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34

a unit of spoken language that is next bigger than a speech sound

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34

suffix Words Written Syllables Spoken Syllables

See here for details

Different types of writing systems. e.g. logo-graphics, syllabic,

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34

Challenges of Text processing can be categorized into 4 categories:

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34

ASCII encoding (7-bit)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34

American Standard Code for Information Interchange.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34

Figure: ASCII codes and it’s meaning.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34

Figure: ASCII codes and it’s meaning in rectangular format.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34

ASCII codes can be broadly classified into two categories:

Limitation of 7-bit ASCII code

Required “asciification” or “romanization” of characters not present

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34

8 bits: It can represents 256 characters (28 ).

It contains encoding definitions for most European characters.

Indian Script Code for Information Interchange.

Majority of languages that are spoken in India are represented in this.

We need a special keyboard which contains ISCII character keys.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34

Chinese and Japanese, which have several thousand distinct

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34

Code switching: Characters from many different writing systems

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34

The default character encoding for HTML5 is UTF-8.

Figure: Character encoding in HTML

It seeks to eliminate this character set ambiguity by specifying a

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34

In the UTF-8 encoding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34

the header of a digital document may contain information regarding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34

Challenges of Character Encoding Identification.

Despite popularity of Unicode encoding (UTF-8), still many sources

Python Librarry for character encoding detection

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34

You might also like