KEMBAR78
Python NLTK | PPTX
NLTK
Alberts Pumpurs
90% of world's data generated over last two years
common
Internet
user
creates
Visual Textual
Instagram
Flickr
Vscocam
Facebook
Tumblr
Blogger
Twitter
Facebook
Emails
Costumer Reviews
Detecting hidden signals
World is full of unstructured, text-rich
data. Everything from emails to
customer tweets.
The information buried in all that
text holds the potential to deliver
valuable business insights
Text analytics is the practice of using
technology to gather, store and mine
textual information for hidden signals
that can be used to inform smarter
business decisions
An explosion of
unstructured
data
Many types of organizations are
experiencing explosive growth in their
unstructured enterprise data.
Same time that they have access to
external sources of data such as social
media, blogs, and mobile data.
Until now, much of this information
passed through the organization virtually
unanalyzed. Today, new tools for
handling large amounts of complex data
makes it easier to squeeze value from
such unlikely sources.
Text Processing
use cases
sentiment analysis
spam filtering
text categorization
topic detection
keyword frequency
plagiatism detection
document similarity
phrase extraction
Natural Language
Tool Kit
leading platform for building
Python programs to work with
human language data
NLTK Features
sentence and word tokenization
text calsification
corpora
parsing
clustring
part of speach tagging
text stemming
and mutch more..
Sentence
tokenization
Word
tokenization
Part of speech
tagging
Part of speech
tagging explanation
CC Coordinating conjunctin
CD Cardinal Number
DT Determiner
EX Existing “ there“
FW Foreign word
IN Preposition or subordination conjuction
JJ Adjective
JJR Adjective- comparative
JJS Adjective- superlative
LS List item marker
MD Modal
NN Noun- singular or mass
NNS Non-Plural
NP Proper noun- singular
nltk.help.upenn_tagset() //all tag sets
Chunking and NER
Text
clasification
Algorithms in
NLTK
Naive Bayes
Maximum Entropy
Decision Tree
Text clasification
Sentiment analysis
https://github.com/pumpurs/SentimentWordsLV/
Document similarity
detection
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is
a weight often used in information retrieval and text mining. This weight is a statistical
measure used to evaluate how important a word is to a document in a collection or
corpus.
Similarity and
concordance
Dispersion Plot
“Market and product reserch”
“Social CMS”
1.97 b social network users
“Costumer profiling / analytics”
70% of marketers used Facebook to gain
6.7 million people blog on blogging sites
pumpurs.alberts@gmail.com
Big Data, Startups, Text Analysis, Internet of Things, Web Development

Python NLTK