EEL
6935 Data Analytics
LECTURE 1
Introduction to Data Science & Machine Learning
Jan. 9, 2018
Data
e
• Highly interdisciplinary
• No. 1 job on the market *
* McKinsey, Forbes, Harvard Business Review, Glassdoor, CareerCast
What is Data Science?
Mathematics Computer Science Engineering Library & Info. Science
Probability Machine Learning Signal Processing Information Retrieval
Statistics Data Mining Pattern Recognition Info. Management
Optimization Database Operations Research Ontology
Linear Algebra High P. Computing Data Compression Knowledge Represent.
Data Domain
Biological Sciences
Health Care
Physical Sciences Hardware Volume
Social Sciences Software Big Variety
Business
Finance
Internet
IoT
Data Velocity
Veracity
Data Science
Sports
Cybersecurity
What does a Data Scientist do?
• Understands the physical process (science) that generates data
• e.g., how a transmitted signal travels in air – wireless communications, how people behave in
stock market – economics, how DNA transcribes RNA – genetics, how a planet moves on its
orbit – astronomy
• Models data using probability & statistics
• Develops algorithms that
• learn from data
• infer about the data source
(i.e., generalize the information contained in data to the data source)
• Discovers patterns/regularities in data
What is Machine Learning?
• Through algorithms, discover patterns in data, and
use them to infer about the data source, e.g.,
• Feature extraction: extract the meaningful part
from each object/instance in data
• hand-designed for a specific application or
• learned from data in an unsupervised fashion
• very important! active research area, hardest part in big
data problems
• Supervised Learning: learn a model from training
data with ground truth available and use the
learned model for new/test data
• Classification: assign each object to a category, e.g.,
handwritten digit recognition, face recognition
• Regression: estimate relationships between response
and explanatory variables, e.g., prediction of travel times
in traffic, estimation of class probabilities
What is Machine Learning?
• Unsupervised Learning: no ground truth in training data
• Clustering: group similar objects together • Density estimation: estimate the distribution
of data within the space of possible values
• Semi-supervised Learning: labeled and unlabeled data together in training
• Anomaly detection: detect instances that significantly deviate from standard patterns
What is Machine Learning?
• Objective: Select model that generalizes well to unseen possible data
Poor fit & generalization Good fit & generalization Perfect fit, Poor generalization
Model too simple! Model good enough! Model too complex, fits noise!
OVER-FITTING !
& 3
𝐸"#$ = ∑ w
' ,45{* +, , /0, }
2
#
<
𝑦 𝑥9 , w = : 𝑤< 𝑥9
<=>
What is Machine Learning?
• Regularization: avoid over-fitting by adding a penalty term to error function to
shrink coefficients (shrinkage)
& 3 A
𝐸"#$ = '
∑,45{* +, ,w /0, }2 @
B
w 2
• Validation set: partition available data into a training set and a validation set to
optimize model complexity (M in previous slide)
What is Machine Learning?
K-fold Cross Validation