This document provides an introduction to data science with Python. It discusses key concepts in data science including visualization, statistics, machine learning, deep learning, and big data. Various Python packages are introduced for working with data, including Jupyter, NumPy, SciPy, Matplotlib, Pandas, Scikit-learn and others. The document outlines the main steps in a data science analysis process, including defining assumptions, validating assumptions with data, and iterating. Specific techniques are covered like preprocessing, dimensionality reduction, statistical modeling, and machine learning modeling. The document emphasizes an iterative approach to learning through applying concepts to problems and data.
Introduction to Data Science, its importance in extracting insights and its components like visualization, statistics, machine learning, and big data.
Discussed the relationship between statistics and machine learning, including various statistical concepts and the relevance of machine and deep learning.
Introduces the concept of data as variables and dimensions, types of data, and confounding variables affecting analysis.
Sources for obtaining data include logs, datasets, Kaggle, and experiments.
Three essential steps for any analysis: define, validate, and assess assumptions, emphasizing the importance of rapid iterations in responses.
The importance of visualization in data analysis and practical techniques for plotting and making data interpretable.
The necessity of preprocessing data to meet model requirements, including standardization and removing outliers to ensure data quality.
Discusses methods to reduce variables for effective analysis, such as PCA and feature extraction.
Introduces different statistical modeling approaches including hypothesis testing, regression techniques, and model evaluation methods.
Describes machine learning models for classification, regression, and clustering. Highlights evaluation metrics and methods.
The importance of lifelong learning in statistics, machine learning, and big data, alongside a recap of learning principles.
Data Science
➤ =Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
2
3.
Data Science
➤ =Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
3
We will introduce.
5.
➤ It's kindof outdated, but still contains lot of keywords.
➤ MrMimic/data-scientist-roadmap – GitHub
➤ Becoming a Data Scientist – Curriculum via Metromap
5
6.
➤ Machine learning= statistics - checking of assumptions 😆
➤ But does resolve more problems. "
➤ Statistics constructs more solid inferences.
➤ Machine learning constructs more interesting predictions.
Statistics vs. Machine Learning
6
➤ Deep learningis the most renowned part of machine learning.
➤ A.k.a. the “AI”.
➤ Deep learning uses artificial neural networks (NNs).
➤ Which are especially good at:
➤ Computer vision (CV) 👀
➤ Natural language processing (NLP) 📖
➤ Machine translation
➤ Speech recognition
➤ Too costly to simple problems.
Machine Learning vs. Deep Learning
8
9.
Big Data
➤ The“size” is constantly moving.
➤ As of 2012, ranges from 10n TB to n PB, which is 100x.
➤ Has high-3Vs:
➤ Volume, amount of data.
➤ Velocity, speed of data in and out.
➤ Variety, range of data types and sources.
➤ A practical definition:
➤ A single computer can't process in a reasonable time.
➤ Distributed computing is a big deal.
9
10.
Today,
➤ “Models” arethe math models.
➤ “Statistical models”, emphasize inferences.
➤ “Machine learning models”, emphasize predictions.
➤ “Deep learning” and “big data” are gigantic subfields.
➤ We won't introduce.
➤ But the learning resources are listed at the end.
10
11.
Mosky
➤ Python Charmerat Pinkoi.
➤ Has spoken at
➤ PyCons in
TW, MY, KR, JP, SG, HK,
COSCUPs, and TEDx, etc.
➤ Countless hours
on teaching Python.
➤ Own the Python packages:
➤ ZIPCodeTW,
MoSQL, Clime, etc.
➤ http://mosky.tw/
11
Common Jupyter NotebookShortcuts
14
Esc Edit mode → command mode.
Ctrl-Enter Run the cell.
B Insert cell below.
D, D Delete the current cell.
M To Markdown cell.
Cmd-/ Comment the code.
H Show keyboard shortcuts.
P Open the command palette.
15.
Checkpoint: The Packages
➤Open 00_preface_the_packages.ipynb up.
➤ Run it.
➤ The notebooks are available on https://github.com/moskytw/
data-science-with-python.
15
Data in DifferentTypes
18
Discrete
Nominal {male, female}
Ordinal
Ranked
↑ & can be ordered. {great > good > fair}
Continuous
Interval ↑ & distance is meaningful. temperatures
Ratio ↑ & 0 is meaningful. weights
19.
Data in theX-Y Form
19
y x
dependent variable independent variable
response variable explanatory variable
regressand regressor
endogenous variable | endog exogenous variable | exog
outcome design
label feature
20.
➤ Confounding variables:
➤May affect y, but not x.
➤ May lead erroneous conclusions, “garbage in, garbage out”.
➤ Controlling, e.g., fix the environment.
➤ Randomizing, e.g, choose by computer.
➤ Matching, e.g., order by gender and then assign group.
➤ Statistical control, e.g., BMI to remove height effect.
➤ Double-blind, even triple-blind trials.
20
21.
Get the Data
➤Logs
➤ Existent datasets
➤ The Datasets Package – StatsModels
➤ Kaggle
➤ Experiments
21
The Three Steps
1.Define Assumption
2. Validate Assumption
3. Validated Assumption?
23
24.
1. Define Assumption
➤ Specifya feasible objective.
➤ “Use AI to get the moon!”
➤ Write an formal assumption.
➤ “The users will buy 1% items from our recommendation.”
rather than “The users will love our recommendation!”
➤ Note the dangerous gaps.
➤ “All the items from recommendation are free!”
➤ “Correlation does not imply causation.”
➤ Consider the next actions.
➤ “Release to 100% of users.” rather than “So great!”
24
25.
2. Validate Assumption
➤ Collectpotential data.
➤ List possible methods.
➤ A plotting, median, or even mean may be good enough.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Evaluate the metrics of methods with data.
25
26.
3. Validated Assumption?
➤ Yes→ Congrats! Report fully and take the actions! 🎉
➤ No → Check:
➤ The hypotheses of methods.
➤ The confounding variables in data.
➤ The formality of assumption.
➤ The feasibility of objective.
26
27.
Iterate Fast WhileIndustry Changes Rapidly
➤ Resolve the small problems first.
➤ Resolve the high impact/effort problems first.
➤ One week to get a quick result and improve
rather than one year to get the may-be-the-best result.
➤ Fail fast!
27
28.
Checkpoint: Pick upa Method
➤ Think of an interesting problem.
➤ E.g., revenue is higher, but is it random?
➤ Pick one method from the cheatsheets.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Remember the three analysis steps.
28
Visualization
➤ Make DataColorful – Plotting
➤ 01_1_visualization_plotting.ipynb
➤ In a Statistical Way – Descriptive Statistics
➤ 01_2_visualization_descriptive_statistics.ipynb
30
31.
➤ Star98
➤ star98_df= sm.datasets.star98.load_pandas().data
➤ Fair
➤ fair_df = sm.datasets.fair.load_pandas().data
➤ Howell1
➤ howell1_df = pd.read_csv(
'dataset_howell1.csv', sep=';')
➤ Or your own datasets.
➤ Plot the variables that interest you.
Checkpoint: Plot the Variables
31
Feed the DataThat Models Like
33
➤ Preprocess data for:
➤ Hard requirements, e.g.,
➤ corpus → vectors
➤ “What kind of news will be voted down on PTT?”
➤ Soft requirements (hypotheses), e.g.,
➤ t-test: better when samples are normally distributed.
➤ SVM: better when features range from -1 to 1.
➤ More representative features, e.g., total price / units.
➤ Note that different models have different tastes.
34.
Preprocessing
➤ The Dishes– Containers
➤ 02_1_preprocessing_containers.ipynb
➤ A Cooking Method – Standardization
➤ 02_2_preprocessing_standardization.ipynb
➤ Watch Out for Poisonous Data Points – Removing Outliers
➤ 02_3_preprocessing_removing_outliers.ipynb
34
35.
➤ Try tostandardize and compare.
➤ Try to trim the outliners.
Checkpoint: Preprocess the Variables
35
The Model SicksUp!
➤ Let's reduce the variables.
➤ Feed a subset → feature selection.
➤ Feature selection using SelectFromModel – Scikit-Learn
➤ Feed a transformation → feature extraction.
➤ PCA, FA, etc.
➤ Another definition: non-numbers → numbers.
37
➤ Try toPCA(all variables) → the better components, or FA.
➤ And then plot n-dimensional data onto 2-dimensional plane.
Checkpoint: Reduce the Variables
39
More Regression Models
➤If y is not linear,
➤ Logit or Poisson Regression | Generalized Linear Models, GLMs
➤ If y is correlated,
➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE
➤ If x has multicollinearity,
➤ Lasso or Ridge Regression
➤ If error term is heteroscedastic,
➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS
➤ If x is time series – predict x0 from x-1, not predict y from x,
➤ Autoregressive Integrated Moving Average, ARIMA
42
43.
➤ Try toapply the analysis steps with a statistical method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Statistical Method
43
➤ Apple orOrange? – Classification
➤ 05_1_machine_learning_models_classification.ipynb
➤ Without Labels – Clustering
➤ 05_2_machine_learning_models_clustering.ipynb
➤ Predict the Values – Regression
➤ Who Are the Best? – Model Selection
➤ sklearn.model_selection.GridSearchCV
Machine Learning Models
45
46.
Confusion matrix, whereA = 002 = C[0, 0]
46
predicted -
AC
predicted +
BD
actual -
AB
true -
A
false +
B
actual +
CD
false -
C
true +
D
47.
➤ precision =D / BD
➤ recall = D / CD
➤ sensitivity = D / CD = recall = observed power
➤ specificity = A / AB = observed confidence level
➤ false positive rate = B / AB = observed α
➤ false negative rate= C / CD = observed β
Common “rates” in confusion matrix
47
48.
Ensemble Models
➤ Bagging
➤N independent models and average their output.
➤ e.g., the random forest models.
➤ Boosting
➤ N sequential models, the n model learns from n-1's error.
➤ e.g., gradient tree boosting.
48
49.
➤ Try toapply the analysis steps with a ML method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Machine Learning Method
49
Keep Learning
➤ Statistics
➤Seeing Theory
➤ Biological Statistics
➤ scipy.stats + StatsModels
➤ Research Methods
➤ Machine Learning
➤ Scikit-learn Tutorials
➤ Standford CS229
➤ Hsuan-Tien Lin
➤ Deep Learning
➤ TensorFlow | PyTorch
➤ Standford CS231n
➤ Standford CS224n
➤ Big Data
➤ Dask
➤ Hive
➤ Spark
➤ HBase
➤ AWS
51
52.
The Facts
➤ ∵
➤You can't learn all things in the data science!
➤ ∴
➤ “Let's learn to do” ❌
➤ “Let's do to learn” ✅
52
53.
The Learning Flow
1.Ask a question.
➤ “How to tell the differences confidently?”
2. Explore the references.
➤ “T-test, ANOVA, ...”
3. Digest into an answer.
➤ Explore by the breadth-first way.
➤ Write the code.
➤ Make it work, make it right, finally make it fast.
53
54.
Recap
➤ Let's doto learn, not learn to do.
➤ What is your objective?
➤ For the objective, what is your assumption?
➤ For the assumption, what method may validate it?
➤ For the method, how will you evaluate it with data?
➤ Q & A
54