KEMBAR78
Python for Data Science - TDC 2015 | PDF
PYTHON FOR
DATA SCIENCE
Gabriel Moreira
Machine Learning Engineer
@gspmoreira
2015
Why so much buzz?
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Big Data
ONLINE PERSONALIZATION
WHAT IS DATA SCIENCE
http://drewconway.com
WHAT IS DATA SCIENTIST
A Data Scientist is someone with deliberate dual personality who can first
build a curious business case defined with a telescopic vision and can then dive
deep with microscopic lens to sift through DATA to reach the goal while
defining and executing all the intermittent tasks.
http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist
http://nirvacana.com/thoughts/becoming-a-data-scientist/
Data Science MetroMap Curriculum
TYPES OF ANALYTICS
Investigative Analytics Operational Analytics
Consumers: Humans Consumers: Machines
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/
[Hillary Mason, Data Scientist]
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
DATA SCIENCE IS IOSEMN
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
PYTHON IS IOSEMN
js
Outsider
ANALYTICS CASE

CORPORATE SOCIAL NETWORKS
Full Data Analysis demo available in IPython Notebook
bit.ly/python4ds_nb
Investigative Analytics
Consumers: Humans
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INQUIRE
1.Which communities are more popular?
2.Is the user engagement increasing?
3.What is the distribution of publishing time?
4.What is the distribution of user interactions?
5.Is there a relationship between publishing hour and
number of interactions?
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
OBTAIN
•Download data from another location (e.g., a web
page or server)
•Query data from a database (e.g., MySQL or Oracle)
•Extract data from an API (e.g.,Twitter, Facebook)
•Extract data from another file (e.g., an HTML file or
spreadsheet)
•Generate data yourself (e.g., reading sensors or
taking surveys)
READING INTERACTIONS FROM CVS
READING POSTS FROM JSON LINES
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
SCRUB
SCRUB
SCRUB
SCRUB
Dealing with nulls
SCRUB
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
1 - WHICH COMMUNITIES ARE MORE POPULAR?
1 - WHICH COMMUNITIES ARE MORE POPULAR?
2 - IS USER ENGAGEMENT INCREASING?
2 - IS USER ENGAGEMENT INCREASING?
3 - WHAT ISTHE DISTRIBUTION OF PUBLISHINGTIME?
4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND
NUMBER OF INTERACTIONS?
5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND
NUMBER OF INTERACTIONS?
5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND
NUMBER OF INTERACTIONS?
5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND
NUMBER OF INTERACTIONS?
http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/
Operational Analytics
Consumers: Machines
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
1. Discover the most relevant words in the posts
2. Find related posts, with similar content
Operational AnalyticsTasks example
Find Related Posts
1 - RELEVANT WORDS IN A POST
TF-IDF - More “relevant" terms in a document are frequent
terms in the document and rare in other documents
1 - RELEVANT WORDS IN A POST
1 - RELEVANT WORDS IN A POST
1 - RELEVANT WORDS IN A POST
BONUS - GLOBAL RELEVANTTERMS [ALL POSTS]
2 - SIMILAR POSTS
Cosine Similarity

Measure of similarity between two vectors 

being the cosine of the angle between them.
2 - SIMILAR POSTS
2 - SIMILAR POSTS
Original Post
Did you ever wonder how great it would be if you could write your jmeter
tests in ruby ?This projects aims to do so. If you use it on your project just
let me now. On the Architecture Academy you can read how jmeter can
be used to validate your Architecture. modulo 13 arch definition
architecture validation | academia de arquitetura



Most similar post (cosine similarity = 0.30)

Foram disponibilizados no site Enterprise Architecture, na parte de
Knowledge Base de performance, alguns how-tos relacionados a testes de
performance.Entre eles, como definir os requisitos (throughput, cálculo de
threads para o JMeter etc.), utilização do JMeter, geração de massa de
dados e monitoramento. planning and executing performance testing |
enterprise architecture - how to identify performance acceptance criteria |
enterprise architecture - how to geracao de massa de dados | enterprise
architecture - how to jmeter | enterprise architecture - how to
monitoramento | enterprise architecture
SIMILAR PEOPLE!
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INTERPRET
•Drawing conclusions from your data
•Evaluating what your results mean
•Communicating your result
DATA PRODUCTS
“If information has context and the context is
interactive, insights are not predictable."
[Agile Data Science, O’Reilly, 2014]
SENTIMENT ANALYSIS
bit.ly/eleicoes2014debatesbt
Analytical Dashboard
SENTIMENT ANALYSIS
Analytical Dashboard
bit.ly/eleicoes2014debatesbt
NETWORK ANALYSIS
https://linkedjazz.org/network/
js
What about 

Python for Big Data?
PYTHON ON HADOOP
Streaming
HADOOPY
Pig UDFs 

in Jython
HADOOP STREAMING
Hadoop Streaming - Allows MapReduce jobs from any
executable script - including Python

HADOOP STREAMING
http://workingsweng.com.br/2014/04/clusterizando-raios-com-hadoop-e-k-means-em-map-reduce/
K-Means with Python on MapReduce
140.000 lightnings em 28/02/2014 in 137 data files
Running on Amazon Elastic Map Reduce
•Instances: 10 m1.small
•Time (k=10): 10 iterations => 32 minutes
•Time (k=50): 50 iterations => 164 minutes
IS DATA SCIENTISTTHE
NEW WEBMASTER?
[Doing Data Science, O’Reilly, 2014]
DATA SCIENCE COURSES
• Introduction to Data Science (Univ. of Washington)
• Data Science specialization (Johns Hopkins)
• Intro to Hadoop and MapReduce (Cloudera)
• Machine Learning (Stanford)
• Statistical Learning (Stanford)
• Mining Massive Datasets (Stanford)
• Scalable Machine Learning (Berkeley)
http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/
BOOKS
Happy data geeking!
Gabriel Moreira
@gspmoreira
http://about.me/gspmoreira
Thank you!
2015
PYTHON FOR DATA SCIENCE
Slides: http://bit.ly/python4ds_tdc

Python for Data Science - TDC 2015