AM4TM_WS22_Practice_01_NLP_Basics.pdf

Natural Language Processing Basics
Tobias Deußer, Dr. Rafet Sifa
tobias.deusser@iais.fraunhofer.de
10/11/2022
Advanced Methods for Text Mining

Agenda
1. What is Natural Language Processing
2. Preprocessing – Theory
3. Getting Started
©Tobias Deußer 1 / 32

Natural Language Processing – The Wikipedia Definition
From en.wikipedia.org/wiki/Natural_language_processing
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language
data. The goal is a computer capable of “understanding” the contents of documents, including
the contextual nuances of the language within them. The technology can then accurately extract
information and insights contained in the documents as well as categorize and organize the
documents themselves.

Modern NLP in a nutshell

The usual NLP workflow / pipeline

Data gathering
Text parsing

Text preprocessing

Vectorization / featurization / embedding
A downstream task

Downstream tasks
Classification, e.g. sentiment or ratings

Information extraction
Named Entity Recognition

Relation Extraction

Natural Language Inference (NLI)
Text generation

Image generation

…

Downstream tasks – Sentiment Analysis
Figure 1: Classifying text into sentiment, figure taken from Socher et al. 2013

Downstream tasks – Information Extraction
Example from Deußer et al. 2022
“In 2021 and 2020 the total net revenue
kpi
was $100
cy
million and $80
py
million,
respectively.”
total net revenue
kpi
− 100
cy
, total net revenue
kpi
− 80
py
,

Downstream tasks – Natural Language Inference
Example from Pielka et al. 2021
“The man is wearing an orange and black polo shirt and is kneeling with his
lunch box in one hand while holding a banana in his other hand.”
“A man is wearing a green t shirt while holding a banana.”

Downstream tasks – Text Generation
Figure 2: Generating text from a prompt by leveraging GPT-3 (Brown et al. 2020) on
https://beta.openai.com/playground

Downstream tasks – Image Generation
Figure 3: Generating images from a prompt by leveraging DALL·E mini (Dayma et al. 2021) on
https://www.craiyon.com/


Data gathering

Text parsing

Text preprocessing


A downstream task

Generating text embeddings 1

Generating text embeddings 2

Text, i.e. strings, cannot be used in ML models without vectorization.

We need an embedding model to convert text to a numerical representation.

Examples for these embedding models:

tf-idf
GloVe

Word2Vec

BERT
…


Data gathering

Text parsing

Text preprocessing


A downstream task

Text preprocessing – Sentence Tokenization

Sentence tokenization describes the process of splitting a text corpus into
sentences.

Obviously, the key character to look for is the period (.) character and other
punctuation marks, i.e. ;:!?.

However, simply splitting whenever it occurs might lead to wrong results,
because:

The decimal separator occurs frequently in written text, e.g.: 3.14.

Abbreviations are shortened with a period, e.g. abbr. for abbreviation.
Many more, see en.wikipedia.org/wiki/Full_stop
A lot of implementations exist:
Regex approach: github.com/mediacloud/sentence-splitter

Unsupervised approach: github.com/nltk/nltk
…

Text preprocessing – Word Tokenization

Word tokenization describes the process of splitting a sentence (or any text)
into words.

Here, we mainly look for spaces and have to take care of punctuation marks.
1 from nltk.tokenize import word_tokenize
2 s = '''Good muffins cost $3.88nin New York. Please buy me two of
them.nnThanks.'''
3 word_tokenize(s)
4 ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', '
Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Listing 1: Example from the nltk documentation

Text preprocessing – Subword Tokenization

Modern NLP approaches (e.g. Transformers like BERT) tokenize the input a
step further, splitting unknown (i.e., not in the vocabulary) words into
meaningful subwords, but preserving frequently used words.
1 from transformers import BertTokenizer
2 tokenizer = BertTokenizer.from_pretrained(bert-base-uncased)
3 tokenizer.tokenize('''Don't you love wasting GPU capacity by training
transformer models for your Advanced Method in Text Mining course?'''
)
4 ['don', ', 't', 'you', 'love', 'wasting', 'gp', '##u', 'capacity', 'by'
, 'training', 'transform', '##er', 'models', 'for', 'your', 'advanced'
, 'method', 'in', 'text', 'mining', 'course', '?']
Listing 2: Example using the BertTokenizer from the transformer package

Text preprocessing – Part-Of-Speech Tagging

Part-Of-Speech, or POS, tagging refers to the process of assigning “word
classes” to each token.

These “word classes” are the ones you probably learnt in elementary school:
nouns, verbs, adjectives, adverbs, …
Methods:
Dictionary lookup, e.g. Penn Treebank tagset1

Hidden Markov models
Supervised models
1
Marcinkiewicz 1994

Text preprocessing – Stop Words Removal 1
Stop words are the most common (like articles, prepositions, pronouns,
conjunctions, etc) and often do not add that much information to a sentence
1 from nltk.corpus import stopwords
2 stopwords.words(english)
3 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', you
're, you've, you'll, you'd, 'your', 'yours', 'yourself', '
yourselves', 'he', 'him', 'his', 'himself', 'she', she's, 'her', '
hers', 'herself', 'it', it's, 'its', 'itself', 'they', 'them', '
their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this'
, 'that', that'll, 'these', 'those', 'am', 'is', 'are', 'was', 'were
', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does'
, 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because
', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', '
against', 'between', 'into', 'through', 'during', 'before', 'after', '
above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
'over', 'under', 'again', 'further', 'then', 'once', 'here', ...]
Listing 3: Stop Words from the NLTK package

Text preprocessing – Stop Words Removal 2

But there are issues with this …
1 from nltk.corpus import stopwords
2 sentence = The dish was not tasty at all
3 [word for word in sentence.split() if word.lower() not in stopwords.
words(english)]
4 ['dish', 'tasty']
Listing 4: Removing “unimportant” words

Text preprocessing – Stemming 1
Stemming normalizes words to their base stem

Examples:
“liked” becomes “like”

“birds” becomes “bird”
“itemization” becomes “item”
Thus, these words are treated similarly

However, there are problems with this…

Text preprocessing – Stemming 2
Overstemming – When the algorithm stems words to the same root even
though are unrelated

Understemming – The opposite
1 from nltk.stem.porter import PorterStemmer
2 stemmer = PorterStemmer()
3 words_to_be_stemmed = [liked, birds, itemization, universal,
university, universe, alumnus, alumni]
4 [stemmer.stem(word) for word in words_to_be_stemmed]
5 ['like', 'bird', 'item', 'univers', 'univers', 'univers', 'alumnu', '
alumni']
Listing 5: Examples of correct and incorrect stemming

Text preprocessing – Lemmatization

Lemmatization is very similar to stemming, but trys to incorporate context
and aligns word with their lemma
Often leverages POS tags

Usually prefered to stemming (contextual analysis vs. hard coded rules)

Examples:

“went” becomes “go”
“better” becomes “good”
“meeting” becomes either “meet” or “meeting”, based on the POS tag


Data gathering

Text parsing

Text preprocessing


A downstream task

Data Gathering Text Parsing

Data sources:
Web scraping
PDFs, Word files, Excel files, scanned documents, …

Public datasets
Your customers

…

Parse the raw data into your preferred data structure

Getting Started – A few things first

We will use Python for examples assignments
You are free to either use Jupyter Notebooks or a proper IDE like PyCharm or
VSCode

b-it has some GPU pools that we can use for assignment

Getting Started – Jupyter Notebooks

It is essential that your chosen provider offers GPU support

There are a few “free” provider out there:
Google colab
Kaggle

More on the web?

Getting Started – Your preferred IDE

Implement your code with your own IDE
Access GPUs from b-it or your own “sick gaming rig”?
Example IDEs:
PyCharm Pro (free for students)

VSCode

Check https://www.jetbrains.com/help/pycharm/
configuring-remote-interpreters-via-ssh.html for how to
configure a remote interpreter with PyCharm Pro

Getting Started – Assignments

Assignments should be delivered as a .pdf file containing the written
answers and/or code

Format your code in a dedicated environment, see here for an example:
https://www.overleaf.com/learn/latex/Code_listing

Email your PDF file to bit.am4tm@gmail.com

Contact
Try this E-Mail address first: bit.am4tm@gmail.com
Tobias Deußer
tobias.deusser@iais.fraunhofer.de
Rafet Sifa
rafet.sifa@iais.fraunhofer.de

References I
Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Y Ng, and Christopher Potts (2013). “Recursive deep models for
semantic compositionality over a sentiment treebank”. In: Proc. EMNLP,
pp. 1631–1642.
Deußer, Tobias, Syed Musharraf Ali, Lars Hillebrand, Desiana Nurchalifah,
Basil Jacob, Christian Bauckhage, and Rafet Sifa (2022). “KPI-EDGAR: A Novel
Dataset and Accompanying Metric for Relation Extraction from Financial
Documents”. In: Proc. ICMLA (to be published). doi:
10.48550/arXiv.2210.09163.
Pielka, Maren, Rafet Sifa, Lars Patrick Hillebrand, David Biesner,
Rajkumar Ramamurthy, Anna Ladi, and Christian Bauckhage (2021). “Tackling
contradiction detection in german using machine translation and end-to-end
recurrent neural networks”. In: Proc. ICPR, pp. 6696–6701.

References II
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. (2020). “Language models are few-shot learners”. In: Proc.
NIPS, pp. 1877–1901.
Dayma, Boris, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham,
Phúc Lê Khắc, Luke Melas, and Ritobrata Ghosh (July 2021). DALL·E Mini. url:
https://github.com/borisdayma/dalle-mini.
Marcinkiewicz, Mary Ann (1994). “Building a large annotated corpus of English: The
Penn Treebank”. In: Using Large Corpora 273.

AM4TM_WS22_Practice_01_NLP_Basics.pdf

More Related Content

Similar to AM4TM_WS22_Practice_01_NLP_Basics.pdf

Recently uploaded

AM4TM_WS22_Practice_01_NLP_Basics.pdf