Python for Data Science
Data Science:
Getting Value out of Data
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you will be able:
Python for Data Science
• Describe what modern data science is
• Explain why data science is the key to
getting value out of data and where the
growing interest for it comes from
• List a recommended set of skills for a data
scientist
Python for Data Science
Big Data
Insight
Data Science
Action
Insight Data Product
Python for Data Science
Analysis
Data Insight
Question
Insight Data Product
Python for Data Science
Analysis
Data Insight
Question
Book Recommendations
Customer
Python for Data Science
Demographic What kind of Book
Previous books does this recommendations
Purchases customer like?
Book reviews
Find Potential Audience for a Book
Python for Data Science
Model of customer’s
book preferences
Who is likely to like
this book?
New book
information
Market a New Book
Python for Data Science
Who is likely to like Action to market the
this book? book to the right
audience
Market a New Book
Python for Data Science
Who is likely to like Action to market the
this book? book to the right
audience
Insight Action
Actionable Information
Python for Data Science
Historical data Near real-time data
Prediction
Python for Data Science
Action
Prediction
Python for Data Science
Why the Increased Interest in
Data Science?
New era of
data science!
Python for Data Science
BIG DATA COMPUTING AT SCALE
Many dynamic data-driven applications
Computer-Aided Drug Discovery Smart Cities Disaster Resilience and Response
Smart Manufacturing Personalized Precision Medicine Smart Grid and Energy Management
How Much Data Is Big Data?
Python for Data Science
Image Source: http://www.marketwatch.com/story/one-chart-
shows-everything-that-happens-on-the-internet-in-just-one-
Python for Data Science
minute-2016-04-26
Every minute…
Python for Data Science
204 Million emails
200,000 photos
1.8 Million likes
2.78 Million video views
72 hours of video uploads
Scientific Data Management and Analysis
• HPWREN: hpwren.ucsd.edu
Python for Data Science
• 30 TB of data annually
• MODIS: modis.gsfc.nasa.gov
• 219 TB of data annually
• Precision Medicine
• 4 EB (1018 bytes) of data in 2016
(www.fastcompany.com)
• LIGO, Deep Space Network, Protein Data
Bank, …
100 MBs ~= couple of
volumes of Encyclopedias
Python for Data Science
A DVD ~= 5 GBs
1 TB ~= 300 hours of
good quality video
LHC ~= 15 PBs a year
Python for Data Science
Exponential data growth!
Python for Data Science
Data Deluge
Python for Data Science
“We are drowning in
information and
starving for knowledge”
– John Naisbitt
Source: Megatrends, 1982
Image Source: http://www.digitalzenway.com/2011/12/data-diet-a-resolution-you-can-stick-to/
How do we find the connections?
Python for Data Science
Modern Data Science Skills
Python for Data Science
• Programming in Python
• Statistics
• Machine Learning
• Scalable Big Data Analysis
Data Science
The sum is bigger than the
Python for Data Science
parts!
Big Data Actionable Insight
Python Programming
Modern Data Statistical Analysis
Science Skills Machine Learning
Scalable Big Data Analysis
Python for Data Science
The Role of Python
Programming in Data Science
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you will be able:
Python for Data Science
•List some of the traits of modern data
scientists
•Explain why Python is a good
programming language for data
science
•Recite four major Python modules
that are useful for data analysis
Python for Data Science
Python for Data Science
Python for Data Science
Are data scientists
Python for Data Science
unicorns?
Data science is
Python for Data Science
team sport!
Data scientists…
Python for Data Science
Have passion for data
Relate problems to analytics
Care about engineering solutions
Exhibit curiosity
Communicate with teammates
Python for Data Science
http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html
Why Python for Data Science?
Python for Data Science
• Easy-to-read and learn
• Vibrant community
• Growing and evolving set of libraries
• Data management
• Analytical processing
• Visaualization
• Applicable to each step in the data science process
• Notebooks
What to look forward to!
Python for Data Science
•Jupyter notebooks
•NumPy
•Pandas
•Matplotlib
•Scikit-Learn
•BeautifulSoup
Python for Data Science
Case Study:
Soccer Data Analysis
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you will be able:
Python for Data Science
• Talk about the “Big Picture” of data science through a soccer case
study
• Generate statistics about a soccer data set
• Summarize how data cleaning and correlations were applied to an
existing dataset
• Recite the data visualization techniques employed in this study
• Explain how clustering similar groups and plotting these clusters
helped the case study
• Recall what was used to drawing conclusion based on data analysis
Week 1 Case Study: Soccer Data Analysis
Python for Data Science
Dataset location: https://www.kaggle.com/hugomathien/soccer
• Form meaningful player groups
• Discover other players that are similar to your favorite
athlete
• Form strong teams by using analytics
Understanding the Benefits
Python for Data Science
Insight
Action
Data
Ask yourself:
“What insights do I expect to get!”
INSIGHTS Data Science
• Better understanding and insights on
• player strengths
• enhancing performance
• critical attributes for a player’s performance
ACTIONS
• Coach can design programs that improve these
areas in teams
Basic Steps in a Data Science Project
Python for Data Science
ACQUIRE • Import raw dataset into your analytics platform
PREPARE • Explore & Visualize
• Perform Data Cleaning
ANALYZE • Feature Selection
• Model Selection
• Analyze the results
REPORT • Present your findings
ACT • Use them
Data Collection from Diverse Sources
Python for Data Science
• Databases
• Relational
• Non-relational (NoSQL)
• Text files
• CSV files
• Text files
• Live feeds
• Sensors
• Online Platforms
• Twitter
• Live feeds of weather observations
Data Ingestion to Analytics Platform
Python for Data Science
Data Preparation: Explore using Statistics
Python for Data Science
Data Cleaning
Python for Data Science
• Why do we need to clean data?
• Missing entries
• Garbage values
• NULLs
• How do we clean data?
• Remove the entries
• Impute these entries with a counterpart
• Ex. Average values of the column
• Ex. Assign 0, -1, etc
Data Visualization
Python for Data Science
A picture is worth a thousand words
Convey more in less space and time
Use Graphs when possible
http://seaborn.pydata.org/examples/
Analysis
and
Modeling
Python for Data Science
• Supervised Learning
• Unsupervised Learning
• Semi supervised Learning
Image Source: https://jixta.wordpress.com/2015/07/17/machine-learning-algorithms-mindmap/
scikit-learn for Machine Learning in Python
Python for Data Science
http://scikit-learn.org
Soccer Data Analysis: Feature Selection
Python for Data Science
• What are intrinsic attributes on which ‘you’ would group players ?
Agility Hair Style
Reaction Time Movies the player likes
Shot Power
Sprint Speed
• You can also build complex features
f ( shot power, reaction time )
Clustering in Python: sklearn.cluster
Python for Data Science
http://scikit-learn.org/stable/modules/clustering.html
K-Means clustering in Python
Python for Data Science
from sklearn.cluster import Kmeans
Y = KMeans(n_clusters=3,random_state=random_state).fit_predict(X)
…
How to choose the right algorithm?
Python for Data Science
Image Source: http://4.bp.blogspot.com/-o0vLxYf6YZ4/UQVO9K2jxDI/AAAAAAAACt8/Z5w0bSgqkxw/s1600/machine_learning.png
Interpreting Clustering Results
Python for Data Science
• How many players per cluster ?
• Too many in few clusters ?
• Too few ?
• Look at distribution of features in each cluster
• Investigate the values for each cluster
• If few clusters à Plot for comparative analysis
Presenting Data Science Outcomes
Python for Data Science
Cluster #
Attribute Value (Scaled)
Player Attributes
Summary ACQUIRE
PREPARE
Python for Data Science
ANALYZE
REPORT
ACT
INSIGHTS
• Better understanding and insights on
• player strengths
• enhancing performance
• critical attributes for a player’s performance
ACTIONS
• Coach can design programs that improve these
areas in teams