Applied Machine Learning
CT046-3-M
Topic 1 Intro to Data Science &
Machine Learning
Outline
Why Data Science?
What is Data Science?
What are some prominent examples of
Data Science?
How to become a Data Scientist?
Who are hiring Data Scientists Now?
CE52604-5-Object Oriented Methods
Module Introduction
Why Data
Science?
CE52604-5-Object Oriented Methods
Module Introduction
The Dawn of Big Data
Google, Yahoo today
Web Search and Computational advertising
Google: 35,000 searches/sec
Yahoo! scale: 600 million users per month, 4
billion clicks per day, 25 terabytes of data
collected every day
Netflix 2007
Movie recommendations, netflix prize
100 million ratings, 500,000 users, 18,000
movies
Amazon 2003
Product recommendations, reviews
29 million customers, millions of products
CE52604-5-Object Oriented Methods
Module Introduction
How Big is Your Data?
Kilobyte (1000 bytes)
Megabyte (1 000 000 bytes)
Gigabyte (1 000 000 000 bytes)
Terabyte (1 000 000 000 000 bytes)
Petabyte (1 000 000 000 000 000 bytes)
Exabyte (1 000 000 000 000 000 000 bytes)
Zettabyte (1 000 000 000 000 000 000 000
bytes)
Yottabyte (1 000 000 000 000 000 000 000 000
bytes)
7
CE52604-5-Object Oriented Methods
Module Introduction
5 Vs of Big Data
Raw Data: Volume
Change over time: Velocity
Data types: Variety
Data Quality: Veracity
Information for Decision Making:
Value
CE52604-5-Object Oriented Methods
Module Introduction
Cloud Computing
The practice of using a network of remote
servers hosted on the Internet to store, manage,
and process data, rather than a local server or a
personal computer-- Gartner IT Glossary
Cloud Computing is a new term for a longheld dream of computing as a utility
-- Above the Clouds, 2009
CE52604-5-Object Oriented Methods
Module Introduction
Cloud Computing = Cloud +
SaaS
Cloud computing refers to both:
Cloud: The hardware and system software in
the datacenters that provide those services.
Public Cloud (Utility Computing) vs. Private Cloud
SaaS: Describes any cloud service where
consumers are able to access software applications
over the internet. (e.g facebook,twitter..)
Cloud Computing started around 2006
Big Data and Data Science (Big Data
Analytics) started around 2011
CE52604-5-Object Oriented Methods
Module Introduction
10
Current Trends
Applications has bigger data and
need
more advanced analysis
Example: Web, Corporate documents
and Emails
Natural Language Processing
Example: Social Media
Network/Graph Analysis
IT Infrastructure moving to Cloud
Computing
Data Science arise given this
application pull and technology
push
CE52604-5-Object Oriented Methods
Module Introduction
11
What is Data
Science?
CE52604-5-Object Oriented Methods
Module Introduction
12
Data Science A Definition
Data Science is the science which uses
computer science, statistics and machine
learning, visualization and human- computer
interactions
to collect, clean, integrate, analyze,
visualize, interact with data to create data
products.
CE52604-5-Object Oriented Methods
Module Introduction
13
Goal of Data Science
Turn data into data
products.
Data to Data Products
Transaction Databases Fraud Detection
Wireless Sensor Data Smart Home
Text Data, Social Media Data
Product Review and Consumer
Satisfaction
Software Log Data Automatic Trouble
Shooting
CE52604-5-Object Oriented Methods
Module Introduction
What are some prominent
examples of
Data Science?
CE52604-5-Object Oriented Methods
Module Introduction
17
Data Products Google
Web Search
Google Ads
News Recommendation
Engine
Google Maps
Currently one of the best if not the best
IT company to work for. (Google event
on Jan 21/22)
CE52604-5-Object Oriented Methods
Module Introduction
Data Products Netflix
Personalized Movie Ratings
Movie Recommendations
Similar Movies
Movie Categories (e.g., 80s movie
with a strong female lead, Kung
Fu movies)
BlockBuster is out of the business
CE52604-5-Object Oriented Methods
Module Introduction
Data Products
LinkedIn/Facebook
People you may know
Applications you may like
Jobs/Events you might be
interested
Classifier for bad users and bad
content
With high accuracy, Facebook can
guess whether you are single or
married
CE52604-5-Object Oriented Methods
Module Introduction
Data Products Twitter
Text Analysis Spam
Filter/Similarity
Search
User
Sentiment/Satisfaction/Feedback
News Breakout
Trend and Topics
200 million users as of 2011,
generating
CE52604-5-Object Oriented Methods
Module Introduction
Data Products Splunk
Degradation, Failure Detection
Identify Security Breach
Event Monitoring
Troubleshoot Tools
Cross-platform Event Correlation
Splunk is an American multinational corporation headquartered
in San Francisco, California, which produces software for
searching, monitoring, and analyzing machine-generated big
data, via a web-style interface.
CE52604-5-Object Oriented Methods
Module Introduction
How to become a Data
Scientist?
CE52604-5-Object Oriented Methods
Module Introduction
23
The Life of Data
Users
Collect
Clean
Integrat Analys
e
is
Data
Sources
CE52604-5-Object Oriented Methods
Module Introduction
Interfac
e
Visualizati
on
Challenges in Data Science
Preparing Data (Noisy, Incomplete,
Diverse, Streaming )
Analyze Data (Scalable, Accurate,
Real- time, Advanced Methods,
Probabilities and Uncertainties ...)
Represent Analysis Results (i.e. data
product) (Story-telling, Interactive,
explainable)
CE52604-5-Object Oriented Methods
Module Introduction
Skill Set of a Data
Scientist
Data Management
Data collection, storage, cleaning, filtering,
integration
Large-scale Parallel Data Processing
Parallel computing
Statistics and Machine Learning
Data modeling, inference, prediction,
pattern recognition
Interface and Data Visualization
HCI design, visualization, story-telling
CE52604-5-Object Oriented Methods
Module Introduction
Who are hiring Data Scientists
Now?
CE52604-5-Object Oriented Methods
Module Introduction
27
Sexy Job in the next 10 years
The sexy job in the next ten years
will be
The ability to take datato be
able to understand it, to process it,
to extract value from it, to
visualize it, to communicate it
thats going to be a hugely
important skill.
-- Hal Varian, Google Chief
CE52604-5-Object Oriented Methods
Module Introduction
Whos hiring Data Scientist?
IT companies: Google, Twitter,
Lexis/Nexis, Facebook.
Media and Financial sectors Fox,
CNN, NYT, Bloomburg,
Research: Biology, Medicine,
Physics,
Psychology,
Information office in government
and corporations
Law firms: e-discovery tools
CE52604-5-Object Oriented Methods
Module Introduction
Books on Data Science
CE52604-5-Object Oriented Methods
Module Introduction
30
Additional Reading Pointers
Data Science Summit (Strata)
(
http://www.datascientistsummit.co
m/
)
Kaggle Competitions (
http://www.kaggle.com/)
Data Science course at Berkeley &
Corsera (http://datascienc.es/ )
CE52604-5-Object Oriented Methods
Module Introduction
31
Summary
Why now: Dawn of Big Data, Need for
Advanced Analytics and Cloud Computing
What is it: Data Data Product, many
examples incl. Google, Netflix, Splunk,
LinkIn
How to become: Data management, parallel
computing and data processing, statistical
machine learning, and visualization skills
Life of Data
Who are hiring: Data Scientists are in great
demands, from industry to government to
science.
CE52604-5-Object Oriented Methods
Module Introduction
32
Question & Answer Session
Q&A
CE52604-5-Object Oriented Methods
Module Introduction