KEMBAR78
Natural Language Processing | PDF | Character Encoding | Ascii
0% found this document useful (0 votes)
149 views35 pages

Natural Language Processing

The document discusses text processing in natural language processing. It begins with an overview of text processing, including defining graphemes, phonemes, words, and sentences. It then covers challenges in text processing, such as different writing systems, and outlines the modules to be covered, including character encoding, word segmentation, and sentence segmentation. The document focuses on character encoding types like ASCII and Unicode and discusses challenges related to character set dependence, language dependence, corpus dependence, and application dependence in text processing.

Uploaded by

shuchis785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views35 pages

Natural Language Processing

The document discusses text processing in natural language processing. It begins with an overview of text processing, including defining graphemes, phonemes, words, and sentences. It then covers challenges in text processing, such as different writing systems, and outlines the modules to be covered, including character encoding, word segmentation, and sentence segmentation. The document focuses on character encoding types like ASCII and Unicode and discusses challenges related to character set dependence, language dependence, corpus dependence, and application dependence in text processing.

Uploaded by

shuchis785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Natural Language Processing

CSE4022

Lecture-04: Text Processing - Module 2

Dr. Durgesh Kumar


Assistant Professor, SCOPE, VIT Vellore
Table of contents

1 Recap from previous Module

2 Text Processing Module outline

3 Character Encoding types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34


Recap from previous Module- Intro to NLP

Module 1 Summary

What is NLP?
Stages of NLP
Dis-ambiguity in NLP
Challenges of NLP.
Models and Algorithm categories in NLP.
Real world Applications of NLP
Heroes of NLP and related online courses.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34


Text Processing Introduction

Text Processing
The task of converting a raw text file, essentially a sequence of digital
bits, into a well-defined sequence of linguistically meaningful units.

1 Grapheme:the smallest units of writing that correspond with sounds


(more accurately phonemes).
2 Phoneme:A unit of sound that can distinguish one word from
another in a particular language.
3 words: consists of one or more characters
4 sentence: consisting of one or more words.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34


Text Processing Introduction (Contd.)

At the lowest level characters represents: individual grapheme, word,


sentence in a language’s written system.
Text pre-processing is an essential part of any NLP system.
The characters, words, and sentences identified at this stage are the
fundamental units passed to all further processing stages, from
document analysis, NER, POS, Sentiment Analysis, Information
Retrieval, etc.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34


Text Processing Introduction (Contd.)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34


Progress of NLP (Text Processing)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34


Text Processing - Challenges & Need

NLP contains inherent ambiguities, and is further amplified and


generated by writing systems.
The explosion in corpus size and variety has necessities techniques
for automatically harvesting and preparing text corpora for various
NLP tasks.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34


Outline of Text-processing

Figure: Text Processing types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34


Text Processing

Text pre-processing: converting sequence of digital bits, into a


well-defined sequence of linguistically meaningful units.
Text pre-processing can be divided into two stages:
1 document triage: process of converting a set of digital files into
well-defined text documents.
2 text segmentation: process of converting a well-defined text corpus
into its component words, and sentences.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34


Text Processing - Document Triage
In order for any natural language document to be machine readable, its
characters must be represented in a character encoding, in which one or
more bytes in a file maps to a known character.
1 Character encoding determines the character encoding (or
encodings) for any file and optionally converts between encodings.
2 language identification: determines the natural language for a
document; this step is closely linked to, but not uniquely determined
by, the character encoding.
3 text sectioning identifies the actual content within a file while
discarding undesirable elements, such as images, tables, headers,
links, and HTML formatting.
The o/p the document triage stage is a well-defined text corpus,
organized by language, suitable for text segmentation, and further
analysis.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 10 / 34
Text Processing - Text segmentation
1 Word segmentation breaks up the sequence of characters in a text
by locating the word boundaries. In computation linguistics, words
are oaften referred to as tokens, and word segmentation as
tokenization.
2 Text normalization is a related step that involves merging different
written forms of a token into a canonical normalized form; e.g.
convert tokens “Mr.”, “Mr”, “mister”, and “Mister” to a single
normalized form.
3 Sentence segmentation is the process of determining the longer
processing units consisting of one or more words. This task involves
identifying sentence boundaries between words in different
sentences. sentence boundary detection, sentence boundary
disambiguation or sentence boundary recognition.
=⇒ Most written languages have punctuation marks.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 11 / 34
Module 2 outline- from Syllabus

Character Encoding : In the linguistic analysis of a digital natural


language text, it is necessary to clearly define the characters, words,
and sentences in any document.
Word Segmentation
Sentence Segmentation
Intro to Corpora
Corpora Analysis

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34


Challenges of Text Processing

Types of Writing systems

1 Logo-graphic: individual symbol represent words; large number


(often thousands) of symbols. e.g.: Chinese.
2 Syllabic: individual symbol represents syllables. e.g.: Japanese kana
3 alphabetic: where individual symbol represent sounds. e.g.: English

Syllabic and alphabetic system have fewer than 100 symbols.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34


Challenges of Text Processing

Syllables

a unit of spoken language that is next bigger than a speech sound


is a sequence of speech sounds (formed from vowels and consonants)
organized into a single unit.
consists of one or more vowel sounds alone or of a syllabic consonant
alone or of either with one or more consonant sounds preceding or
following
act as the building blocks of a spoken word, determining the pace
and rhythm of how the word is pronounced.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34


Challenges of Text Processing
Table: Syllab Example

suffix Words Written Syllables Spoken Syllables


alarming a·larm·ing uh-lahr-ming
asking ask·ing
baking bak·ing
“-ing” words
eating eat·ing
learning learn·ing
thinking think·ing

atomize at·om·ize
customize cus·tom·ize
internalize in·ter·nal·ize
“-ize” words
moisturize mois·tur·ize

See here for details


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 15 / 34
Challenges of Text Processing

Different types of writing systems. e.g. logo-graphics, syllabic,


alphabetic.
Majority of all written languages uses an alphabetic or syllabic
system [Corrie et al. 1996].
Modern writing system employs more than one types.
English predominantly uses roman base script (an alphabetic writing
system), but also utilizes logo-graphic symbols.
Arabic (0-9)
Currency symbol: dollar($), Euro (€) , Rupee(|) , Pounds (£), Yen
(¥)
And other symbols (%, &, #)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34


Challenges of Text Processing

Challenges of Text processing can be categorized into 4 categories:


1 Character set dependence
(a) Character encoding types
(b) Character encoding identification & its implication on tokenization.

2 Language dependence
3 Corpus dependence
4 Application dependence

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34


1(A). Character Encoding types

ASCII encoding (7-bit)


Extended ASCII encoding (8-bit)
ISCII encoding (8-bit)
16 bit - ( for encoding Chinese, Japanese)
Unicode 5.0
UTF-8: Variable length Encoding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34


1(A). Character Encoding Types - ASCII (7 bit)

American Standard Code for Information Interchange.


Historically all digital text were encded in 7-bit ASCII code.
ASCII code include 128 characters (27 ).
It includes only Roman or Latin alphabets and essentials English
alphabets. e.g. [0-9], [A-Z], [a-z].

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34


1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34


1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning in rectangular format.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34


1(A). Character Encoding Types- ASCII (Contd.)

ASCII codes can be broadly classified into two categories:


1 ASCII control characters (character code 0-31): NULL Character
(0), Start of Heading (1), Backspace (8), Horizontal tab (9), Line
Feed (10), Vertical Tab (11), Device Control 1 (17).
2 ASCII printable characters (character code 32-127): represent
letters, digits, punctuation marks, and a few miscellaneous
symbols.
digits [0-9]: 48-57
Capital letters [A-Z]: 65-90
Small letters [a-z]: 97-122
Punctuation: Space (32), Exclamation mark (33), double
quote(34), Single quote (39), Comma (44), hypen (45), Period, dot
or full stop (46), colon (58), Semicolon (59), question mark (63)
Special Symbols: #: (35), $: (36), % (37), &: (38), <: (60), =
(61), >: (62), @: (64)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 22 / 34
1(A). Character Encoding Types - ASCII Limitations

Limitation of 7-bit ASCII code

Required “asciification” or “romanization” of characters not present


in ASCII table. e.g. German über would be written as u”ber or
ueber.
the French word déjà would be written as de‘ja’ or de1ja2.
complex ASCII-fication for non-roman based languages:
a phonetic mapping of the source characters to the
roman characters . e.g.: Russian, Arabic, Chinese.
The Pinyin transliteration of Chinese writing is one such example.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34


1(A). Character Encoding Types - Extended ASCII

8 bits: It can represents 256 characters (28 ).


First 128 characters (0-128) reserved for original ASCII characters.
Eight-bit encoding exist for all common alphabetic and some syllabic
writing systems; e.g. ISO-8859.
Cons: Overlapping character sets in different languages.
ISO-8859

It contains encoding definitions for most European characters.


HTML syntax: < meta charset="ISO-8859-1">
ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe),
ISO-8859-3 (Southern Europe), ISO-8859-4 (Baltic) ISO-8859-5
(Cyrillic)
ISO-8859-6 (Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew),
ISO-8859-9 (Turkish), ISO-8859-15 (Latin 9)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 24 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Indian Script Code for Information Interchange.


ISCII is an encoding scheme that represents various languages that
are written and spoken in India.
ISCII was introduced in the year 1991 by the Bureau of Indian
Standards(BIS).
ISCII code include 256 characters (28 ).
The first 128 characters, that is, from 0-127 are same as that for
ASCII.
The next characters, that are from 128-255 represent the characters
from the Indian scripts.
Most of the Indian language characters are taken from the ancient
Brahmi script and resemble close to each other due to having similar
phonetic structure.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 25 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Advantage

Majority of languages that are spoken in India are represented in this.


Character set is simple and easy to understand.
Easy transliteration between languages is possible.
Supported Languages: Devanagari, Punjabi, Gujarati, Oriya,
Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil

Disadvantages

We need a special keyboard which contains ISCII character keys.


As the Unicode was invented later, and with Unicode having the
characters of ISCII, ISCII became obsolete.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34


1(A). Character Encoding Types - A two byte
character set - for Japanese and Chinese

Chinese and Japanese, which have several thousand distinct


characters, require multiple bytes to encode a single character.
A two-byte character or 16 bits set can represent 65,536 (216 )
distinct characters.
single-byte letters, spaces, punctuation marks (e.g., periods,
quotation marks, and parentheses), and Arabic numerals (0–9) are
commonly interspersed (distributed) with 2-byte Chinese and
Japanese characters

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34


1(A). Character Encoding Types - A two byte
character set- Challenges

Challenges

Code switching: Characters from many different writing systems


occur within the same text. e.g. i am going to School. (where all
words except School are written in Indian languages such as hindi,
Tamil, Telugu, Bengali etc.).
Multiple Encoding Multiple encodings also exist for chinese e.g.
Big-5 for the complex-form (traditional) character set and GB for
the simple-form (simplified) set.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34


1(A). Character encoding - Unicode Encoding - UTF-8

The default character encoding for HTML5 is UTF-8.


To specify encoding in HTML : <meta charset="UTF-8">

Figure: Character encoding in HTML


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 29 / 34
1(A). Character Encoding Types - Unicode encodings

It seeks to eliminate this character set ambiguity by specifying a


Universal Character Set that includes over 100,000 distinct coded
characters derived from over 75 supported scripts representing all the
writing systems commonly used today.
most commonly implemented in the UTF-8 variable-length
character encoding, in which each character is represented by a 1 to
4 byte encoding.
UTF-8 allow for the encoding of all supported characters with no
overlap or confusion between conflicting byte ranges.
it is rapidly replacing older character encoding sets for multilingual
applications.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34


1(A). Character encoding - Unicode Encoding - UTF-8

In the UTF-8 encoding


ASCII characters require 1 byte
other characters included in ISO-8859 and other alphabetic
characters require 2 bytes.
All other characters, including Chinese, Japanese, and Korean,
require 3 bytes (and very rarely 4 bytes).
1 import pandas as pd
2 df = pd . read_csv ( ’ fname . csv ’ , encoding = ’ utf8 ’)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34


1 (B). Character Encoding Identification and Its
Impact on Tokenization

the header of a digital document may contain information regarding


its character encoding.
this information is not always present or even reliable.
in which case the encoding must be determined automatically.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34


1 (B). Character Encoding Identification and Its
Impact on Tokenization

Challenges of Character Encoding Identification.

Despite popularity of Unicode encoding (UTF-8), still many sources


of documents uses different encoding schemes.
same range of numeric values can represent different characters in
different encodings. e.g. English or Spanish are both normally stored
in the common 8-bit encoding Latin-1 (or ISO-8859-1).

Python Librarry for character encoding detection

chardet
cchardet
Reference blog

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 34 / 34

You might also like