Scikit-Learn
Cheat Sheet
Simple tools for data mining, data
analysis, and machine learning
by
Numan
Scikit-Learn is a Python
library that provides
simple and efficient
tools for data mining,
data analysis, and
machine learning.
Data Preprocessing
sklearn.preprocessing.StandardScaler()
Standardizes features by removing the mean and
scaling to unit variance.
sklearn.preprocessing.MinMaxScaler()
Scales features to a given range, typically [0, 1].
sklearn.preprocessing.OneHotEncoder()
Converts categorical values into one-hot encoded
binary vectors.
sklearn.preprocessing.LabelEncoder()
Encodes labels with values between zero and the
number of classes minus one.
sklearn.impute.SimpleImputer()
Handles missing values by replacing them with specified
values (e.g., mean, median).
Test-Train Split
Splits arrays or matrices into random train and
test subsets.
sklearn.model_selection.train_test_split(
data,
test_size=0.2,
shuffle=True,
random_state=42,
)
Don’t forget to specify the random state, so that
the results are reproducible!
Model Training
sklearn.linear_model.LinearRegression()
Fits a linear model with coefficients to minimize the
residual sum of squares.
sklearn.linear_model.LogisticRegression()
Applies logistic regression for binary or multiclass
classification tasks.
sklearn.tree.DecisionTreeClassifier()
A decision tree classifier that uses a tree structure to
make predictions.
sklearn.ensemble.RandomForestClassifier()
A meta-estimator that fits a number of decision trees
on various sub-samples of the dataset.
Model Evaluation
sklearn.metrics.accuracy_score()
Calculates the accuracy classification score
(proportion of correct predictions).
sklearn.metrics.precision_score()
Measures precision; useful for binary classification to
assess the positive class.
sklearn.metrics.recall_score()
Measures recall, which is the ability of the classifier to
find all positive samples.
sklearn.metrics.f1_score()
Computes the F1 score, which balances precision and
recall.
Model Evaluation
sklearn.metrics.mean_absolute_error()
Computes the mean absolute error for regression
tasks.
sklearn.metrics.mean_squared_error()
Calculates the MSE regression loss, measuring how
close a regression line is to a set of data point.
sklearn.metrics.r2_score()
Calculates R squared - a regression performance
measure based on variance explained.
Cross-Validation
Evaluates a score by cross-validation on different
subsets of the data.
sklearn.model_selection.cross_val_score(
estimator=model,
X=X_train,
y=y_train,
cv=5, # splitting strategy
scoring='accuracy',
)
Learning the parameters of a prediction function
and testing it on the same data will lead to
overfitting.
Hyperparameter Tuning
Hyperparameter tuning is the process of
selecting the optimal values for a machine learning
model’s hyperparameters.
sklearn.model_selection.GridSearchCV()
Performs exhaustive search over specified
parameter values for an estimator.
Basically, brute-force search.
sklearn.model_selection.RandomizedSearchCV()
Randomly samples parameter settings. Uses
fewer resources than GridSearchCV.
Pipeline Creation
Use Pipeline to group multiple processing steps
together:
pipeline = sklearn.pipeline.Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
You can still use hyperparameter tunning on your
pipelines! As if they are models.
don’t forget to
Subscribe
Kostya Numan