KEMBAR78
Advanced Scikit Learn | PDF | Machine Learning | Principal Component Analysis
0% found this document useful (0 votes)
283 views98 pages

Advanced Scikit Learn

This document summarizes key concepts in advanced machine learning using scikit-learn including supervised and unsupervised learning techniques, model evaluation, pipelines for feature extraction and selection, grid search for hyperparameter optimization, and randomized parameter search. It discusses classification, regression, clustering, dimensionality reduction, and evaluation metrics.

Uploaded by

suburaaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
283 views98 pages

Advanced Scikit Learn

This document summarizes key concepts in advanced machine learning using scikit-learn including supervised and unsupervised learning techniques, model evaluation, pipelines for feature extraction and selection, grid search for hyperparameter optimization, and randomized parameter search. It discusses classification, regression, clustering, dimensionality reduction, and evaluation metrics.

Uploaded by

suburaaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Advanced Scikit-Learn

Andreas Mueller (NYU Center for Data Science, scikit-learn)


1
Me

2
Classification
Regression
Clustering
Semi-Supervised Learning
Feature Selection
Feature Extraction
Manifold Learning
Dimensionality Reduction
Kernel Approximation
Hyperparameter Optimization
Evaluation Metrics
Out-of-core learning
…...

3
4
Overview
● Reminder: Basic sklearn concepts

Model building and evaluation:
– Pipelines and Feature Unions
– Randomized Parameter Search
– Scoring Interface
● Out of Core learning
– Feature Hashing
– Kernel Approximation
● New stuff in 0.16.0
– Overview
– Calibration

5
Supervised Machine Learning
clf = RandomForestClassifier()

Training Data
clf.fit(X_train, y_train)
Model

Training Labels

6
Supervised Machine Learning
clf = RandomForestClassifier()

Training Data
clf.fit(X_train, y_train)
Model

Training Labels

y_pred = clf.predict(X_test) Test Data Prediction

7
Supervised Machine Learning
clf = RandomForestClassifier()

Training Data
clf.fit(X_train, y_train)
Model

Training Labels

y_pred = clf.predict(X_test) Test Data Prediction

clf.score(X_test, y_test) Test Labels Evaluation

8
Unsupervised Transformations

pca = PCA(n_components=3)

pca.fit(X_train) Training Data Model

9
Unsupervised Transformations

pca = PCA(n_components=3)

pca.fit(X_train) Training Data Model

X_new = pca.transform(X_test) Test Data Transformation

10
Basic API
estimator.fit(X, [y])

estimator.predict estimator.transform

Classification Preprocessing

Regression Dimensionality reduction

Clustering Feature selection

Feature extraction
11
Cross-Validation

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(SVC(), X, y, cv=5)


print(scores)

>> [ 0.92 1. 1. 1. 1. ]

12
Cross-Validation

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(SVC(), X, y, cv=5)


print(scores)

>> [ 0.92 1. 1. 1. 1. ]

cv_ss = ShuffleSplit(len(X_train), test_size=.3,


n_iter=10)
scores_shuffle_split = cross_val_score(SVC(), X, y,
cv=cv_ss)

13
Cross-Validation

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(SVC(), X, y, cv=5)


print(scores)

>> [ 0.92 1. 1. 1. 1. ]

cv_ss = ShuffleSplit(len(X_train), test_size=.3,


n_iter=10)
scores_shuffle_split = cross_val_score(SVC(), X, y,
cv=cv_ss)

cv_labels = LeaveOneLabelOut(labels)
scores_pout = cross_val_score(SVC(), X, y, cv=cv_labels)

14
Cross -Validated Grid Search

15
Cross -Validated Grid Search
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

param_grid = {'C': 10. ** np.arange(-3, 3),


'gamma': 10. ** np.arange(-3, 3)}
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train, y_train)
grid.predict(X_test)
grid.score(X_test, y_test)

16
Training Labels Training Data

Model

17
Training Labels Training Data

Model
18
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
19
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
20
Cross Validation
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
21
Cross Validation
Pipelines
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())


pipe.fit(X_train, y_train)
pipe.predict(X_test)

22
Combining Pipelines and
Grid Search
Proper cross-validation
param_grid = {'svc__C': 10. ** np.arange(-3, 3),
'svc__gamma': 10. ** np.arange(-3, 3)}

scaler_pipe = make_pipeline(StandardScaler(), SVC())


grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

23
Combining Pipelines and
Grid Search II
Searching over parameters of the preprocessing step

param_grid = {'selectkbest__k': [1, 2, 3, 4],


'svc__C': 10. ** np.arange(-3, 3),
'svc__gamma': 10. ** np.arange(-3, 3)}

scaler_pipe = make_pipeline(SelectKBest(), SVC())


grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

24
Feature Union
Training Labels Training Data

Feature Feature
Extraction I Extraction II

Model
25
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))

text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))

param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}


grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)

26
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))

text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))

param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}


grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)

param_grid2 = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1, 5), (2, 5)],


'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)],
'linearsvc__C': 10. ** np.arange(-3, 3)}

27
Randomized Parameter Search

28
Randomized Parameter Search

29
Source: Bergstra and Bengio
Randomized Parameter Search

Step-size free for continuous parameters


Decouples runtime from search-space size
Robust against irrelevant parameters

30
Source: Bergstra and Bengio
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': 10. ** np.arange(-3, 3)}

31
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}

32
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}

rs = RandomizedSearchCV(text_pipe,
param_distributions=param_distributins, n_iter=50)
33
Randomized Parameter Search
● Always use distributions for continuous
variables.
● Don't use for low dimensional spaces.
● Future: Bayesian optimization based search.

34
Generalized Cross-Validation and Path Algorithms

35
rfe = RFE(LogisticRegression())

36
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)

37
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)

38
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)

rfecv = RFECV(LogisticRegression())

39
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)

rfecv = RFECV(LogisticRegression())
rfecv.fit(X, y)

40
41
Linear Models Feature Selection Tree-Based models [possible]

LogisticRegressionCV [new] RFECV [DecisionTreeCV]

RidgeCV [RandomForestClassifierCV]

RidgeClassifierCV [GradientBoostingClassifierCV]

LarsCV

ElasticNetCV
...

42
Scoring Functions

43
GridSeachCV
RandomizedSearchCV
cross_val_score
...CV

Default:
Accuracy (classification)
R2 (regression)

44
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])

45
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])

cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)


>>> array([ 0.9, 0.9, 0.9])

46
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])

cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)


>>> array([ 0.9, 0.9, 0.9])

cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")


array([ 0.99961591, 0.99983498, 0.99966247])

47
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])

cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)


>>> array([ 0.9, 0.9, 0.9])

cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")


array([ 0.99961591, 0.99983498, 0.99966247])

48
Available metrics
print(SCORERS.keys())

>> ['adjusted_rand_score',
'f1',
'mean_absolute_error',
'r2',
'recall',
'median_absolute_error',
'precision',
'log_loss',
'mean_squared_error',
'roc_auc',
'average_precision',
'accuracy']

49
Defining your own scoring

def my_super_scoring(est, X, y):


return accuracy_scorer(est, X, y) - np.sum(est.coef_ != 0)

50
Out of Core Learning

51
Or: save ourself the effort

52
Think twice!
● Old laptop: 4GB Ram
● 1073741824 float32
● Or 1mio data points with 1000 features
● EC2 : 256 GB Ram
● 68719476736 float32
● Or 68mio data points with 1000 features

53
Supported Algorithms
● All SGDClassifier derivatives
● Naive Bayes
● MinibatchKMeans
● IncrementalPCA
● MiniBatchDictionaryLearning

54
Out of Core Learning
sgd = SGDClassifier()

for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
sgd.partial_fit(X_batch, y_batch, classes=range(10))

Possibly go over the data multiple times.

55
Stateless Transformers
● Normalizer
● HashingVectorizer
● RBFSampler (and other kernel approx)

56
Text data and the hashing trick

57
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

58
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

tokenizer

['you', 'better', 'call', 'kenny', 'loggins']

59
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

tokenizer

['you', 'better', 'call', 'kenny', 'loggins']

Sparse matrix encoding

aardvak better call you zyxst


[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ]

60
Hashing Trick
HashingVectorizer

“You better call Kenny Loggins”

tokenizer

['you', 'better', 'call', 'kenny', 'loggins']

hashing

[hash('you'), hash('better'), hash('call'), hash('kenny'), hash('loggins')]


= [832412, 223788, 366226, 81185, 835749]

Sparse matrix encoding

[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, ... 0 ] 61
Out of Core Text Classification
sgd = SGDClassifier()
hashing_vectorizer = HashingVectorizer()

for i in range(9):
text_batch, y_batch = cPickle.load(open("text_%02d" % I))
X_batch = hashing_vectorizer.transform(text_batch)
sgd.partial_fit(X_batch, y_batch, classes=range(10))

62
Kernel Approximations

63
Reminder: Kernel Trick

64
Reminder: Kernel Trick

65
Reminder: Kernel Trick
Classifier linear → need only

66
Reminder: Kernel Trick
Classifier linear → need only

Linear:

Polynomial:

RBF:

Sigmoid:

67
Complexity
● Solving kernelized SVM:
~O(n_samples ** 3)
● Solving linear (primal) SVM:
~O(n_samples * n_features)

n_samples large? Go primal!

68
Undoing the Kernel Trick
● Kernel approximation:

● k=
= RBFSampler

69
Usage
sgd = SGDClassifier()
kernel_approximation = RBFSampler(gamma=.001, n_components=400)

for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
if i == 0:
kernel_approximation.fit(X_batch)
X_transformed = kernel_approximation.transform(X_batch)
sgd.partial_fit(X_transformed, y_batch, classes=range(10))

70
Highlights from 0.16.0

71
Highlights from 0.16.0
● Multinomial Logistic Regression,
LogisticRegressionCV.
● IncrementalPCA.
● Probability callibration of classifiers.
● Birch clustering.
● LSHForest.
● More robust integration with pandas.

72
Probability Calibration
SVC().decision_function()
→ CalibratedClassifierCV(SVC()).predict_proba()

RandomForestClassifier().predict_proba()
→ CalibratedClassifierCV(RandomForestClassifier()).predict_proba()

73
74
CDS is hiring Research Engineers

75
Thank you for your attention.

@t3kcit

@amueller

t3kcit@gmail.comx

76
Bias Variance Tradeoff
(why we do cross validation and grid searches)

77
Overfitting and Underfitting
Training

Accuracy

Model complexity
78
Overfitting and Underfitting
Training

Accuracy
Generalization

Model complexity
79
Overfitting and Underfitting
Training

Sweet spot

Accuracy
Generalization

Underfitting Overfitting

Model complexity
80
Linear SVM

81
Linear SVM

82
(RBF) Kernel SVM

83
(RBF) Kernel SVM

84
(RBF) Kernel SVM

85
(RBF) Kernel SVM

86
Decision Trees

87
Decision Trees

88
Decision Trees

89
Decision Trees

90
Decision Trees

91
Decision Trees

92
Random Forests

93
Random Forests

94
Random Forests

95
Know where you are on the bias-variance tradeoff

96
Validation Curves
train_scores, test_scores = validation_curve(SVC(), X, y,
param_name="gamma", param_range=param_range)

97
Learning Curves
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y,train_sizes=train_sizes)

98

You might also like