KEMBAR78
NLP Ngram | PDF | Machine Learning | Computational Neuroscience
0% found this document useful (0 votes)
26 views2 pages

NLP Ngram

N-grams are sequences of contiguous items in text, crucial for various NLP tasks such as language modeling, speech recognition, and machine translation. They can be classified based on the number of items, with unigrams, bigrams, trigrams, and higher n-grams serving different applications. Companies like Google and Microsoft utilize n-gram models for tasks like spelling correction and text summarization, highlighting their importance in natural language processing.

Uploaded by

Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views2 pages

NLP Ngram

N-grams are sequences of contiguous items in text, crucial for various NLP tasks such as language modeling, speech recognition, and machine translation. They can be classified based on the number of items, with unigrams, bigrams, trigrams, and higher n-grams serving different applications. Companies like Google and Microsoft utilize n-gram models for tasks like spelling correction and text summarization, highlighting their importance in natural language processing.

Uploaded by

Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

N-grams NLP

Overview
N-grams are continuous sequences of words or symbols or tokens in a document and are
defined as the neighbouring sequences of items in a document. They are used most
importantly in tasks dealing with text data in NLP (Natural Language Processing).

N-gram models are widely used in statistical natural language processing, speech recognition,
phonemes and sequences of phonemes, machine translation and predictive text input, and
many others for which the modeling inputs are n-gram distributions.
What is n-grams?
N-grams are defined as the contiguous sequence of n items that can be extracted from a given
sample of text or speech. The items can be letters, words, or base pairs, according to the
application. The N-grams typically are collected from a text or speech corpus (Usually a
corpus of long text dataset).

N-grams can also be seen as a set of co-occurring words within a given window computed by
basically moving the window some k words forward (k can be from 1 or more than 1).
The co-occurring words are called "n-grams," and "n" is a number saying how long a string
of words we have considered in the construction of n-grams.
Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are
four words, 5-grams are five words, etc.
Applications of N-grams: N-grams of texts are extensively used in the n-gram model in NLP,
text mining, and natural language processing tasks.
For example, when developing language models in natural language processing, n-grams are
used to develop not just unigram models but also bigram and trigram models.
Tech companies like Google and Microsoft have developed web-scale n-gram models that
can be used in a variety of NLP-related tasks such as spelling correction, word breaking, and
text summarization.
One other major usage of n-grams is for developing features for supervised Machine
Learning models such as SVMs, Max Ent models, Naive Bayes, etc. The main idea is to use
tokens such as bigrams (and trigrams and advanced n-grams) in the feature space instead of
just unigrams.
How are n-grams Classified?
N-grams are classified into different types depending on the value that n takes. When
n=1n=1, it is said to be a unigram. When n=2n=2, it is said to be a bigram. When n=3n=3, it
is said to be a trigram. When n=4n=4, it is said to be a 4-gram, and so on.
Different types of n-grams are suitable for different types of applications in the n-gram model
in nlp, and we need to try different n-grams on the dataset to confidently conclude which one
works the best among all for the text corpus analysis.
It is also established with research and substantiated that trigrams and 4 grams work the best
in the case of spam filtering.
An Example of n-grams
Let us look at the example sentence Cowards die many times before their deaths; the valiant
never taste of death but once and generate the associated n-grams related to the sentence.

Unigrams: These are simply the unique words in the sentence.


Cowards, die, many, times, before, their, deaths, the valiant, the valiant, never, taste, of,
death, but, once.
Bigrams: These are simply the pairs of co-occurring words in the sentence formed by sliding
one word at a time in the forward direction to generate the next bigram.
cowards die, die many, many times, times before, before their, their deaths, deaths the, the
valiant, valiant never, never taste, the taste of, of death, death but, but once
Trigrams: These are the 3 pairs of co-occurring words in the sentence formed by sliding two
words at a time in the forward direction to generate the next trigram.
cowards die many, die many times, many times before, times before their, before their deaths,
their deaths the, deaths the valiant, the valiant never, valiant never taste, never taste of, taste
of death, of death but, death but once
4-grams: Here we have the window such that we have combinations of 4 words together
cowards die many times, die many times before, many, times before their, times before their
deaths, before their deaths the, their deaths the valiant, deaths the valiant never, the valiant
taste, valiant never taste of, never taste of death, taste of death but, of death but once
Simi alary we can pick n>4n>4 and generate 5-grams etc.

You might also like