KEMBAR78
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Natural Language Processing Basics
Tobias Deußer, Dr. Rafet Sifa
tobias.deusser@iais.fraunhofer.de
10/11/2022
Advanced Methods for Text Mining
Agenda
1. What is Natural Language Processing
2. Preprocessing – Theory
3. Getting Started
©Tobias Deußer 1 / 32
Natural Language Processing – The Wikipedia Definition
From en.wikipedia.org/wiki/Natural_language_processing
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language
data. The goal is a computer capable of “understanding” the contents of documents, including
the contextual nuances of the language within them. The technology can then accurately extract
information and insights contained in the documents as well as categorize and organize the
documents themselves.
©Tobias Deußer 2 / 32
Modern NLP in a nutshell
©Tobias Deußer 3 / 32
The usual NLP workflow / pipeline

Data gathering
 Text parsing

Text preprocessing

Vectorization / featurization / embedding
 A downstream task
©Tobias Deußer 4 / 32
Downstream tasks
 Classification, e.g. sentiment or ratings

Information extraction
 Named Entity Recognition

Relation Extraction

Natural Language Inference (NLI)
 Text generation

Image generation

…
©Tobias Deußer 5 / 32
Downstream tasks – Sentiment Analysis
Figure 1: Classifying text into sentiment, figure taken from Socher et al. 2013
©Tobias Deußer 6 / 32
Downstream tasks – Information Extraction
Example from Deußer et al. 2022
“In 2021 and 2020 the total net revenue
kpi
was $100
cy
million and $80
py
million,
respectively.”
total net revenue
kpi
− 100
cy
, total net revenue
kpi
− 80
py
,
©Tobias Deußer 7 / 32
Downstream tasks – Natural Language Inference
Example from Pielka et al. 2021
“The man is wearing an orange and black polo shirt and is kneeling with his
lunch box in one hand while holding a banana in his other hand.”
“A man is wearing a green t shirt while holding a banana.”
©Tobias Deußer 8 / 32
Downstream tasks – Text Generation
Figure 2: Generating text from a prompt by leveraging GPT-3 (Brown et al. 2020) on
https://beta.openai.com/playground
©Tobias Deußer 9 / 32
Downstream tasks – Image Generation
Figure 3: Generating images from a prompt by leveraging DALL·E mini (Dayma et al. 2021) on
https://www.craiyon.com/
©Tobias Deußer 10 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 11 / 32
Generating text embeddings 1
©Tobias Deußer 12 / 32
Generating text embeddings 2

Text, i.e. strings, cannot be used in ML models without vectorization.

We need an embedding model to convert text to a numerical representation.

Examples for these embedding models:

tf-idf
 GloVe

Word2Vec

BERT
 …
©Tobias Deußer 13 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 14 / 32
Text preprocessing – Sentence Tokenization

Sentence tokenization describes the process of splitting a text corpus into
sentences.

Obviously, the key character to look for is the period (.) character and other
punctuation marks, i.e. ;:!?.

However, simply splitting whenever it occurs might lead to wrong results,
because:

The decimal separator occurs frequently in written text, e.g.: 3.14.

Abbreviations are shortened with a period, e.g. abbr. for abbreviation.
 Many more, see en.wikipedia.org/wiki/Full_stop
 A lot of implementations exist:
 Regex approach: github.com/mediacloud/sentence-splitter

Unsupervised approach: github.com/nltk/nltk
 …
©Tobias Deußer 15 / 32
Text preprocessing – Word Tokenization

Word tokenization describes the process of splitting a sentence (or any text)
into words.

Here, we mainly look for spaces and have to take care of punctuation marks.
1  from nltk.tokenize import word_tokenize
2  s = '''Good muffins cost $3.88nin New York. Please buy me two of
them.nnThanks.'''
3  word_tokenize(s)
4 ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', '
Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Listing 1: Example from the nltk documentation
©Tobias Deußer 16 / 32
Text preprocessing – Subword Tokenization

Modern NLP approaches (e.g. Transformers like BERT) tokenize the input a
step further, splitting unknown (i.e., not in the vocabulary) words into
meaningful subwords, but preserving frequently used words.
1  from transformers import BertTokenizer
2  tokenizer = BertTokenizer.from_pretrained(bert-base-uncased)
3  tokenizer.tokenize('''Don't you love wasting GPU capacity by training
transformer models for your Advanced Method in Text Mining course?'''
)
4 ['don', ', 't', 'you', 'love', 'wasting', 'gp', '##u', 'capacity', 'by'
, 'training', 'transform', '##er', 'models', 'for', 'your', 'advanced'
, 'method', 'in', 'text', 'mining', 'course', '?']
Listing 2: Example using the BertTokenizer from the transformer package
©Tobias Deußer 17 / 32
Text preprocessing – Part-Of-Speech Tagging

Part-Of-Speech, or POS, tagging refers to the process of assigning “word
classes” to each token.

These “word classes” are the ones you probably learnt in elementary school:
nouns, verbs, adjectives, adverbs, …
 Methods:
 Dictionary lookup, e.g. Penn Treebank tagset1

Hidden Markov models
 Supervised models
1
Marcinkiewicz 1994
©Tobias Deußer 18 / 32
Text preprocessing – Stop Words Removal 1
 Stop words are the most common (like articles, prepositions, pronouns,
conjunctions, etc) and often do not add that much information to a sentence
1  from nltk.corpus import stopwords
2  stopwords.words(english)
3 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', you
're, you've, you'll, you'd, 'your', 'yours', 'yourself', '
yourselves', 'he', 'him', 'his', 'himself', 'she', she's, 'her', '
hers', 'herself', 'it', it's, 'its', 'itself', 'they', 'them', '
their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this'
, 'that', that'll, 'these', 'those', 'am', 'is', 'are', 'was', 'were
', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does'
, 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because
', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', '
against', 'between', 'into', 'through', 'during', 'before', 'after', '
above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
'over', 'under', 'again', 'further', 'then', 'once', 'here', ...]
Listing 3: Stop Words from the NLTK package
©Tobias Deußer 19 / 32
Text preprocessing – Stop Words Removal 2

But there are issues with this …
1  from nltk.corpus import stopwords
2  sentence = The dish was not tasty at all
3  [word for word in sentence.split() if word.lower() not in stopwords.
words(english)]
4 ['dish', 'tasty']
Listing 4: Removing “unimportant” words
©Tobias Deußer 20 / 32
Text preprocessing – Stemming 1
 Stemming normalizes words to their base stem

Examples:
 “liked” becomes “like”

“birds” becomes “bird”
 “itemization” becomes “item”
 Thus, these words are treated similarly

However, there are problems with this…
©Tobias Deußer 21 / 32
Text preprocessing – Stemming 2
 Overstemming – When the algorithm stems words to the same root even
though are unrelated

Understemming – The opposite
1  from nltk.stem.porter import PorterStemmer
2  stemmer = PorterStemmer()
3  words_to_be_stemmed = [liked, birds, itemization, universal,
university, universe, alumnus, alumni]
4  [stemmer.stem(word) for word in words_to_be_stemmed]
5 ['like', 'bird', 'item', 'univers', 'univers', 'univers', 'alumnu', '
alumni']
Listing 5: Examples of correct and incorrect stemming
©Tobias Deußer 22 / 32
Text preprocessing – Lemmatization

Lemmatization is very similar to stemming, but trys to incorporate context
and aligns word with their lemma
 Often leverages POS tags

Usually prefered to stemming (contextual analysis vs. hard coded rules)

Examples:

“went” becomes “go”
 “better” becomes “good”
 “meeting” becomes either “meet” or “meeting”, based on the POS tag
©Tobias Deußer 23 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 24 / 32
Data Gathering  Text Parsing

Data sources:
 Web scraping
 PDFs, Word files, Excel files, scanned documents, …

Public datasets
 Your customers

…

Parse the raw data into your preferred data structure
©Tobias Deußer 25 / 32
Getting Started – A few things first

We will use Python for examples  assignments
 You are free to either use Jupyter Notebooks or a proper IDE like PyCharm or
VSCode

b-it has some GPU pools that we can use for assignment
©Tobias Deußer 26 / 32
Getting Started – Jupyter Notebooks

It is essential that your chosen provider offers GPU support

There are a few “free” provider out there:
 Google colab
 Kaggle

More on the web?
©Tobias Deußer 27 / 32
Getting Started – Your preferred IDE

Implement your code with your own IDE
 Access GPUs from b-it or your own “sick gaming rig”?
 Example IDEs:
 PyCharm Pro (free for students)

VSCode

Check https://www.jetbrains.com/help/pycharm/
configuring-remote-interpreters-via-ssh.html for how to
configure a remote interpreter with PyCharm Pro
©Tobias Deußer 28 / 32
Getting Started – Assignments

Assignments should be delivered as a .pdf file containing the written
answers and/or code

Format your code in a dedicated environment, see here for an example:
https://www.overleaf.com/learn/latex/Code_listing

Email your PDF file to bit.am4tm@gmail.com
©Tobias Deußer 29 / 32
Contact
Try this E-Mail address first: bit.am4tm@gmail.com
Tobias Deußer
tobias.deusser@iais.fraunhofer.de
Rafet Sifa
rafet.sifa@iais.fraunhofer.de
©Tobias Deußer 30 / 32
References I
Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Y Ng, and Christopher Potts (2013). “Recursive deep models for
semantic compositionality over a sentiment treebank”. In: Proc. EMNLP,
pp. 1631–1642.
Deußer, Tobias, Syed Musharraf Ali, Lars Hillebrand, Desiana Nurchalifah,
Basil Jacob, Christian Bauckhage, and Rafet Sifa (2022). “KPI-EDGAR: A Novel
Dataset and Accompanying Metric for Relation Extraction from Financial
Documents”. In: Proc. ICMLA (to be published). doi:
10.48550/arXiv.2210.09163.
Pielka, Maren, Rafet Sifa, Lars Patrick Hillebrand, David Biesner,
Rajkumar Ramamurthy, Anna Ladi, and Christian Bauckhage (2021). “Tackling
contradiction detection in german using machine translation and end-to-end
recurrent neural networks”. In: Proc. ICPR, pp. 6696–6701.
©Tobias Deußer 31 / 32
References II
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. (2020). “Language models are few-shot learners”. In: Proc.
NIPS, pp. 1877–1901.
Dayma, Boris, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham,
Phúc Lê Khắc, Luke Melas, and Ritobrata Ghosh (July 2021). DALL·E Mini. url:
https://github.com/borisdayma/dalle-mini.
Marcinkiewicz, Mary Ann (1994). “Building a large annotated corpus of English: The
Penn Treebank”. In: Using Large Corpora 273.
©Tobias Deußer 32 / 32

AM4TM_WS22_Practice_01_NLP_Basics.pdf

  • 1.
    Natural Language ProcessingBasics Tobias Deußer, Dr. Rafet Sifa tobias.deusser@iais.fraunhofer.de 10/11/2022 Advanced Methods for Text Mining
  • 2.
    Agenda 1. What isNatural Language Processing 2. Preprocessing – Theory 3. Getting Started ©Tobias Deußer 1 / 32
  • 3.
    Natural Language Processing– The Wikipedia Definition From en.wikipedia.org/wiki/Natural_language_processing Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. ©Tobias Deußer 2 / 32
  • 4.
    Modern NLP ina nutshell ©Tobias Deußer 3 / 32
  • 5.
    The usual NLPworkflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 4 / 32
  • 6.
    Downstream tasks Classification,e.g. sentiment or ratings Information extraction Named Entity Recognition Relation Extraction Natural Language Inference (NLI) Text generation Image generation … ©Tobias Deußer 5 / 32
  • 7.
    Downstream tasks –Sentiment Analysis Figure 1: Classifying text into sentiment, figure taken from Socher et al. 2013 ©Tobias Deußer 6 / 32
  • 8.
    Downstream tasks –Information Extraction Example from Deußer et al. 2022 “In 2021 and 2020 the total net revenue kpi was $100 cy million and $80 py million, respectively.” total net revenue kpi − 100 cy , total net revenue kpi − 80 py , ©Tobias Deußer 7 / 32
  • 9.
    Downstream tasks –Natural Language Inference Example from Pielka et al. 2021 “The man is wearing an orange and black polo shirt and is kneeling with his lunch box in one hand while holding a banana in his other hand.” “A man is wearing a green t shirt while holding a banana.” ©Tobias Deußer 8 / 32
  • 10.
    Downstream tasks –Text Generation Figure 2: Generating text from a prompt by leveraging GPT-3 (Brown et al. 2020) on https://beta.openai.com/playground ©Tobias Deußer 9 / 32
  • 11.
    Downstream tasks –Image Generation Figure 3: Generating images from a prompt by leveraging DALL·E mini (Dayma et al. 2021) on https://www.craiyon.com/ ©Tobias Deußer 10 / 32
  • 12.
    The usual NLPworkflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 11 / 32
  • 13.
    Generating text embeddings1 ©Tobias Deußer 12 / 32
  • 14.
    Generating text embeddings2 Text, i.e. strings, cannot be used in ML models without vectorization. We need an embedding model to convert text to a numerical representation. Examples for these embedding models: tf-idf GloVe Word2Vec BERT … ©Tobias Deußer 13 / 32
  • 15.
    The usual NLPworkflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 14 / 32
  • 16.
    Text preprocessing –Sentence Tokenization Sentence tokenization describes the process of splitting a text corpus into sentences. Obviously, the key character to look for is the period (.) character and other punctuation marks, i.e. ;:!?. However, simply splitting whenever it occurs might lead to wrong results, because: The decimal separator occurs frequently in written text, e.g.: 3.14. Abbreviations are shortened with a period, e.g. abbr. for abbreviation. Many more, see en.wikipedia.org/wiki/Full_stop A lot of implementations exist: Regex approach: github.com/mediacloud/sentence-splitter Unsupervised approach: github.com/nltk/nltk … ©Tobias Deußer 15 / 32
  • 17.
    Text preprocessing –Word Tokenization Word tokenization describes the process of splitting a sentence (or any text) into words. Here, we mainly look for spaces and have to take care of punctuation marks. 1 from nltk.tokenize import word_tokenize 2 s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' 3 word_tokenize(s) 4 ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', ' Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] Listing 1: Example from the nltk documentation ©Tobias Deußer 16 / 32
  • 18.
    Text preprocessing –Subword Tokenization Modern NLP approaches (e.g. Transformers like BERT) tokenize the input a step further, splitting unknown (i.e., not in the vocabulary) words into meaningful subwords, but preserving frequently used words. 1 from transformers import BertTokenizer 2 tokenizer = BertTokenizer.from_pretrained(bert-base-uncased) 3 tokenizer.tokenize('''Don't you love wasting GPU capacity by training transformer models for your Advanced Method in Text Mining course?''' ) 4 ['don', ', 't', 'you', 'love', 'wasting', 'gp', '##u', 'capacity', 'by' , 'training', 'transform', '##er', 'models', 'for', 'your', 'advanced' , 'method', 'in', 'text', 'mining', 'course', '?'] Listing 2: Example using the BertTokenizer from the transformer package ©Tobias Deußer 17 / 32
  • 19.
    Text preprocessing –Part-Of-Speech Tagging Part-Of-Speech, or POS, tagging refers to the process of assigning “word classes” to each token. These “word classes” are the ones you probably learnt in elementary school: nouns, verbs, adjectives, adverbs, … Methods: Dictionary lookup, e.g. Penn Treebank tagset1 Hidden Markov models Supervised models 1 Marcinkiewicz 1994 ©Tobias Deußer 18 / 32
  • 20.
    Text preprocessing –Stop Words Removal 1 Stop words are the most common (like articles, prepositions, pronouns, conjunctions, etc) and often do not add that much information to a sentence 1 from nltk.corpus import stopwords 2 stopwords.words(english) 3 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', you 're, you've, you'll, you'd, 'your', 'yours', 'yourself', ' yourselves', 'he', 'him', 'his', 'himself', 'she', she's, 'her', ' hers', 'herself', 'it', it's, 'its', 'itself', 'they', 'them', ' their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this' , 'that', that'll, 'these', 'those', 'am', 'is', 'are', 'was', 'were ', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does' , 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because ', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', ' against', 'between', 'into', 'through', 'during', 'before', 'after', ' above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', ...] Listing 3: Stop Words from the NLTK package ©Tobias Deußer 19 / 32
  • 21.
    Text preprocessing –Stop Words Removal 2 But there are issues with this … 1 from nltk.corpus import stopwords 2 sentence = The dish was not tasty at all 3 [word for word in sentence.split() if word.lower() not in stopwords. words(english)] 4 ['dish', 'tasty'] Listing 4: Removing “unimportant” words ©Tobias Deußer 20 / 32
  • 22.
    Text preprocessing –Stemming 1 Stemming normalizes words to their base stem Examples: “liked” becomes “like” “birds” becomes “bird” “itemization” becomes “item” Thus, these words are treated similarly However, there are problems with this… ©Tobias Deußer 21 / 32
  • 23.
    Text preprocessing –Stemming 2 Overstemming – When the algorithm stems words to the same root even though are unrelated Understemming – The opposite 1 from nltk.stem.porter import PorterStemmer 2 stemmer = PorterStemmer() 3 words_to_be_stemmed = [liked, birds, itemization, universal, university, universe, alumnus, alumni] 4 [stemmer.stem(word) for word in words_to_be_stemmed] 5 ['like', 'bird', 'item', 'univers', 'univers', 'univers', 'alumnu', ' alumni'] Listing 5: Examples of correct and incorrect stemming ©Tobias Deußer 22 / 32
  • 24.
    Text preprocessing –Lemmatization Lemmatization is very similar to stemming, but trys to incorporate context and aligns word with their lemma Often leverages POS tags Usually prefered to stemming (contextual analysis vs. hard coded rules) Examples: “went” becomes “go” “better” becomes “good” “meeting” becomes either “meet” or “meeting”, based on the POS tag ©Tobias Deußer 23 / 32
  • 25.
    The usual NLPworkflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 24 / 32
  • 26.
    Data Gathering Text Parsing Data sources: Web scraping PDFs, Word files, Excel files, scanned documents, … Public datasets Your customers … Parse the raw data into your preferred data structure ©Tobias Deußer 25 / 32
  • 27.
    Getting Started –A few things first We will use Python for examples assignments You are free to either use Jupyter Notebooks or a proper IDE like PyCharm or VSCode b-it has some GPU pools that we can use for assignment ©Tobias Deußer 26 / 32
  • 28.
    Getting Started –Jupyter Notebooks It is essential that your chosen provider offers GPU support There are a few “free” provider out there: Google colab Kaggle More on the web? ©Tobias Deußer 27 / 32
  • 29.
    Getting Started –Your preferred IDE Implement your code with your own IDE Access GPUs from b-it or your own “sick gaming rig”? Example IDEs: PyCharm Pro (free for students) VSCode Check https://www.jetbrains.com/help/pycharm/ configuring-remote-interpreters-via-ssh.html for how to configure a remote interpreter with PyCharm Pro ©Tobias Deußer 28 / 32
  • 30.
    Getting Started –Assignments Assignments should be delivered as a .pdf file containing the written answers and/or code Format your code in a dedicated environment, see here for an example: https://www.overleaf.com/learn/latex/Code_listing Email your PDF file to bit.am4tm@gmail.com ©Tobias Deußer 29 / 32
  • 31.
    Contact Try this E-Mailaddress first: bit.am4tm@gmail.com Tobias Deußer tobias.deusser@iais.fraunhofer.de Rafet Sifa rafet.sifa@iais.fraunhofer.de ©Tobias Deußer 30 / 32
  • 32.
    References I Socher, Richard,Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts (2013). “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proc. EMNLP, pp. 1631–1642. Deußer, Tobias, Syed Musharraf Ali, Lars Hillebrand, Desiana Nurchalifah, Basil Jacob, Christian Bauckhage, and Rafet Sifa (2022). “KPI-EDGAR: A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents”. In: Proc. ICMLA (to be published). doi: 10.48550/arXiv.2210.09163. Pielka, Maren, Rafet Sifa, Lars Patrick Hillebrand, David Biesner, Rajkumar Ramamurthy, Anna Ladi, and Christian Bauckhage (2021). “Tackling contradiction detection in german using machine translation and end-to-end recurrent neural networks”. In: Proc. ICPR, pp. 6696–6701. ©Tobias Deußer 31 / 32
  • 33.
    References II Brown, Tom,Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. (2020). “Language models are few-shot learners”. In: Proc. NIPS, pp. 1877–1901. Dayma, Boris, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phúc Lê Khắc, Luke Melas, and Ritobrata Ghosh (July 2021). DALL·E Mini. url: https://github.com/borisdayma/dalle-mini. Marcinkiewicz, Mary Ann (1994). “Building a large annotated corpus of English: The Penn Treebank”. In: Using Large Corpora 273. ©Tobias Deußer 32 / 32