Natural Language Processing
CSE4022
Lecture-04: Text Processing - Module 2
Dr. Durgesh Kumar
Assistant Professor, SCOPE, VIT Vellore
Table of contents
1 Recap from previous Module
2 Text Processing Module outline
3 Character Encoding types
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34
Recap from previous Module- Intro to NLP
Module 1 Summary
What is NLP?
Stages of NLP
Dis-ambiguity in NLP
Challenges of NLP.
Models and Algorithm categories in NLP.
Real world Applications of NLP
Heroes of NLP and related online courses.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34
Text Processing Introduction
Text Processing
The task of converting a raw text file, essentially a sequence of digital
bits, into a well-defined sequence of linguistically meaningful units.
1 Grapheme:the smallest units of writing that correspond with sounds
(more accurately phonemes).
2 Phoneme:A unit of sound that can distinguish one word from
another in a particular language.
3 words: consists of one or more characters
4 sentence: consisting of one or more words.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34
Text Processing Introduction (Contd.)
At the lowest level characters represents: individual grapheme, word,
sentence in a language’s written system.
Text pre-processing is an essential part of any NLP system.
The characters, words, and sentences identified at this stage are the
fundamental units passed to all further processing stages, from
document analysis, NER, POS, Sentiment Analysis, Information
Retrieval, etc.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34
Text Processing Introduction (Contd.)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34
Progress of NLP (Text Processing)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34
Text Processing - Challenges & Need
NLP contains inherent ambiguities, and is further amplified and
generated by writing systems.
The explosion in corpus size and variety has necessities techniques
for automatically harvesting and preparing text corpora for various
NLP tasks.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34
Outline of Text-processing
Figure: Text Processing types
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34
Text Processing
Text pre-processing: converting sequence of digital bits, into a
well-defined sequence of linguistically meaningful units.
Text pre-processing can be divided into two stages:
1 document triage: process of converting a set of digital files into
well-defined text documents.
2 text segmentation: process of converting a well-defined text corpus
into its component words, and sentences.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34
Text Processing - Document Triage
In order for any natural language document to be machine readable, its
characters must be represented in a character encoding, in which one or
more bytes in a file maps to a known character.
1 Character encoding determines the character encoding (or
encodings) for any file and optionally converts between encodings.
2 language identification: determines the natural language for a
document; this step is closely linked to, but not uniquely determined
by, the character encoding.
3 text sectioning identifies the actual content within a file while
discarding undesirable elements, such as images, tables, headers,
links, and HTML formatting.
The o/p the document triage stage is a well-defined text corpus,
organized by language, suitable for text segmentation, and further
analysis.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 10 / 34
Text Processing - Text segmentation
1 Word segmentation breaks up the sequence of characters in a text
by locating the word boundaries. In computation linguistics, words
are oaften referred to as tokens, and word segmentation as
tokenization.
2 Text normalization is a related step that involves merging different
written forms of a token into a canonical normalized form; e.g.
convert tokens “Mr.”, “Mr”, “mister”, and “Mister” to a single
normalized form.
3 Sentence segmentation is the process of determining the longer
processing units consisting of one or more words. This task involves
identifying sentence boundaries between words in different
sentences. sentence boundary detection, sentence boundary
disambiguation or sentence boundary recognition.
=⇒ Most written languages have punctuation marks.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 11 / 34
Module 2 outline- from Syllabus
Character Encoding : In the linguistic analysis of a digital natural
language text, it is necessary to clearly define the characters, words,
and sentences in any document.
Word Segmentation
Sentence Segmentation
Intro to Corpora
Corpora Analysis
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34
Challenges of Text Processing
Types of Writing systems
1 Logo-graphic: individual symbol represent words; large number
(often thousands) of symbols. e.g.: Chinese.
2 Syllabic: individual symbol represents syllables. e.g.: Japanese kana
3 alphabetic: where individual symbol represent sounds. e.g.: English
Syllabic and alphabetic system have fewer than 100 symbols.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34
Challenges of Text Processing
Syllables
a unit of spoken language that is next bigger than a speech sound
is a sequence of speech sounds (formed from vowels and consonants)
organized into a single unit.
consists of one or more vowel sounds alone or of a syllabic consonant
alone or of either with one or more consonant sounds preceding or
following
act as the building blocks of a spoken word, determining the pace
and rhythm of how the word is pronounced.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34
Challenges of Text Processing
Table: Syllab Example
suffix Words Written Syllables Spoken Syllables
alarming a·larm·ing uh-lahr-ming
asking ask·ing
baking bak·ing
“-ing” words
eating eat·ing
learning learn·ing
thinking think·ing
atomize at·om·ize
customize cus·tom·ize
internalize in·ter·nal·ize
“-ize” words
moisturize mois·tur·ize
See here for details
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 15 / 34
Challenges of Text Processing
Different types of writing systems. e.g. logo-graphics, syllabic,
alphabetic.
Majority of all written languages uses an alphabetic or syllabic
system [Corrie et al. 1996].
Modern writing system employs more than one types.
English predominantly uses roman base script (an alphabetic writing
system), but also utilizes logo-graphic symbols.
Arabic (0-9)
Currency symbol: dollar($), Euro (€) , Rupee(|) , Pounds (£), Yen
(¥)
And other symbols (%, &, #)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34
Challenges of Text Processing
Challenges of Text processing can be categorized into 4 categories:
1 Character set dependence
(a) Character encoding types
(b) Character encoding identification & its implication on tokenization.
2 Language dependence
3 Corpus dependence
4 Application dependence
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34
1(A). Character Encoding types
ASCII encoding (7-bit)
Extended ASCII encoding (8-bit)
ISCII encoding (8-bit)
16 bit - ( for encoding Chinese, Japanese)
Unicode 5.0
UTF-8: Variable length Encoding
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34
1(A). Character Encoding Types - ASCII (7 bit)
American Standard Code for Information Interchange.
Historically all digital text were encded in 7-bit ASCII code.
ASCII code include 128 characters (27 ).
It includes only Roman or Latin alphabets and essentials English
alphabets. e.g. [0-9], [A-Z], [a-z].
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34
1(A). Character Encoding Types- ASCII (Contd.)
Figure: ASCII codes and it’s meaning.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34
1(A). Character Encoding Types- ASCII (Contd.)
Figure: ASCII codes and it’s meaning in rectangular format.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34
1(A). Character Encoding Types- ASCII (Contd.)
ASCII codes can be broadly classified into two categories:
1 ASCII control characters (character code 0-31): NULL Character
(0), Start of Heading (1), Backspace (8), Horizontal tab (9), Line
Feed (10), Vertical Tab (11), Device Control 1 (17).
2 ASCII printable characters (character code 32-127): represent
letters, digits, punctuation marks, and a few miscellaneous
symbols.
digits [0-9]: 48-57
Capital letters [A-Z]: 65-90
Small letters [a-z]: 97-122
Punctuation: Space (32), Exclamation mark (33), double
quote(34), Single quote (39), Comma (44), hypen (45), Period, dot
or full stop (46), colon (58), Semicolon (59), question mark (63)
Special Symbols: #: (35), $: (36), % (37), &: (38), <: (60), =
(61), >: (62), @: (64)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 22 / 34
1(A). Character Encoding Types - ASCII Limitations
Limitation of 7-bit ASCII code
Required “asciification” or “romanization” of characters not present
in ASCII table. e.g. German über would be written as u”ber or
ueber.
the French word déjà would be written as de‘ja’ or de1ja2.
complex ASCII-fication for non-roman based languages:
a phonetic mapping of the source characters to the
roman characters . e.g.: Russian, Arabic, Chinese.
The Pinyin transliteration of Chinese writing is one such example.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34
1(A). Character Encoding Types - Extended ASCII
8 bits: It can represents 256 characters (28 ).
First 128 characters (0-128) reserved for original ASCII characters.
Eight-bit encoding exist for all common alphabetic and some syllabic
writing systems; e.g. ISO-8859.
Cons: Overlapping character sets in different languages.
ISO-8859
It contains encoding definitions for most European characters.
HTML syntax: < meta charset="ISO-8859-1">
ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe),
ISO-8859-3 (Southern Europe), ISO-8859-4 (Baltic) ISO-8859-5
(Cyrillic)
ISO-8859-6 (Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew),
ISO-8859-9 (Turkish), ISO-8859-15 (Latin 9)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 24 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)
Indian Script Code for Information Interchange.
ISCII is an encoding scheme that represents various languages that
are written and spoken in India.
ISCII was introduced in the year 1991 by the Bureau of Indian
Standards(BIS).
ISCII code include 256 characters (28 ).
The first 128 characters, that is, from 0-127 are same as that for
ASCII.
The next characters, that are from 128-255 represent the characters
from the Indian scripts.
Most of the Indian language characters are taken from the ancient
Brahmi script and resemble close to each other due to having similar
phonetic structure.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 25 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)
Advantage
Majority of languages that are spoken in India are represented in this.
Character set is simple and easy to understand.
Easy transliteration between languages is possible.
Supported Languages: Devanagari, Punjabi, Gujarati, Oriya,
Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil
Disadvantages
We need a special keyboard which contains ISCII character keys.
As the Unicode was invented later, and with Unicode having the
characters of ISCII, ISCII became obsolete.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34
1(A). Character Encoding Types - A two byte
character set - for Japanese and Chinese
Chinese and Japanese, which have several thousand distinct
characters, require multiple bytes to encode a single character.
A two-byte character or 16 bits set can represent 65,536 (216 )
distinct characters.
single-byte letters, spaces, punctuation marks (e.g., periods,
quotation marks, and parentheses), and Arabic numerals (0–9) are
commonly interspersed (distributed) with 2-byte Chinese and
Japanese characters
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34
1(A). Character Encoding Types - A two byte
character set- Challenges
Challenges
Code switching: Characters from many different writing systems
occur within the same text. e.g. i am going to School. (where all
words except School are written in Indian languages such as hindi,
Tamil, Telugu, Bengali etc.).
Multiple Encoding Multiple encodings also exist for chinese e.g.
Big-5 for the complex-form (traditional) character set and GB for
the simple-form (simplified) set.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34
1(A). Character encoding - Unicode Encoding - UTF-8
The default character encoding for HTML5 is UTF-8.
To specify encoding in HTML : <meta charset="UTF-8">
Figure: Character encoding in HTML
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 29 / 34
1(A). Character Encoding Types - Unicode encodings
It seeks to eliminate this character set ambiguity by specifying a
Universal Character Set that includes over 100,000 distinct coded
characters derived from over 75 supported scripts representing all the
writing systems commonly used today.
most commonly implemented in the UTF-8 variable-length
character encoding, in which each character is represented by a 1 to
4 byte encoding.
UTF-8 allow for the encoding of all supported characters with no
overlap or confusion between conflicting byte ranges.
it is rapidly replacing older character encoding sets for multilingual
applications.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34
1(A). Character encoding - Unicode Encoding - UTF-8
In the UTF-8 encoding
ASCII characters require 1 byte
other characters included in ISO-8859 and other alphabetic
characters require 2 bytes.
All other characters, including Chinese, Japanese, and Korean,
require 3 bytes (and very rarely 4 bytes).
1 import pandas as pd
2 df = pd . read_csv ( ’ fname . csv ’ , encoding = ’ utf8 ’)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34
1 (B). Character Encoding Identification and Its
Impact on Tokenization
the header of a digital document may contain information regarding
its character encoding.
this information is not always present or even reliable.
in which case the encoding must be determined automatically.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34
1 (B). Character Encoding Identification and Its
Impact on Tokenization
Challenges of Character Encoding Identification.
Despite popularity of Unicode encoding (UTF-8), still many sources
of documents uses different encoding schemes.
same range of numeric values can represent different characters in
different encodings. e.g. English or Spanish are both normally stored
in the common 8-bit encoding Latin-1 (or ISO-8859-1).
Python Librarry for character encoding detection
chardet
cchardet
Reference blog
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 34 / 34