Machine Learning Cheat Sheet
Types of Machine Learning
Supervised Learning
Definition: Learning with labeled training data
Goal: Predict labels for new, unseen data
Types:
Classification: Predict discrete categories/classes
Regression: Predict continuous numerical values
Unsupervised Learning
Definition: Learning patterns in data without labels
Goal: Discover hidden structure in data
Types:
Clustering: Group similar data points
Dimensionality Reduction: Reduce feature space
Association Rules: Find relationships between variables
Reinforcement Learning
Definition: Learning through interaction with environment
Goal: Maximize cumulative reward
Components: Agent, Environment, Actions, Rewards, Policy
Data Preprocessing
Data Cleaning
python
import pandas as pd
import numpy as np
# Handle missing values
df.dropna() # Remove rows with missing values
df.fillna(value) # Fill missing values
df.interpolate() # Interpolate missing values
# Handle duplicates
df.drop_duplicates() # Remove duplicate rows
Feature Scaling
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Standardization (z-score normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Min-Max scaling (0-1 range)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Robust scaling (using median and IQR)
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
Feature Engineering
python
# One-hot encoding for categorical variables
pd.get_dummies(df, columns=['category'])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# Feature selection
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=10)
X_selected = selector.fit_transform(X, y)
Supervised Learning Algorithms
Linear Regression
python
from sklearn.linear_model import LinearRegression
# Simple linear regression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Equation: y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Key Points:
Assumes linear relationship between features and target
Sensitive to outliers
Requires feature scaling for regularization
Logistic Regression
python
from sklearn.linear_model import LogisticRegression
# Binary classification
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
# Equation: p = 1 / (1 + e^(-z)) where z = β₀ + β₁x₁ + ... + βₙxₙ
Key Points:
Uses sigmoid function for probability
Good for binary and multiclass classification
Provides probability estimates
Decision Trees
python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# Classification
model = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Key Points:
Easy to interpret and visualize
Can handle both numerical and categorical features
Prone to overfitting (use pruning)
Random Forest
python
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# Classification
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Feature importance
importances = model.feature_importances_
Key Points:
Ensemble of decision trees
Reduces overfitting compared to single trees
Provides feature importance scores
Support Vector Machines (SVM)
python
from sklearn.svm import SVC, SVR
# Classification
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Key Points:
Effective for high-dimensional data
Memory efficient
Versatile (different kernels: linear, polynomial, RBF)
K-Nearest Neighbors (KNN)
python
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# Classification
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Key Points:
Simple, non-parametric algorithm
Sensitive to feature scaling
Computationally expensive for large datasets
Naive Bayes
python
from sklearn.naive_bayes import GaussianNB, MultinomialNB
# Gaussian Naive Bayes
model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Key Points:
Assumes feature independence
Fast and efficient
Good for text classification
Unsupervised Learning Algorithms
K-Means Clustering
python
from sklearn.cluster import KMeans
# Clustering
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)
# Elbow method for optimal k
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
Key Points:
Partitions data into k clusters
Sensitive to initialization and outliers
Assumes spherical clusters
Hierarchical Clustering
python
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Agglomerative clustering
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = model.fit_predict(X)
# Dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
DBSCAN
python
from sklearn.cluster import DBSCAN
# Density-based clustering
model = DBSCAN(eps=0.5, min_samples=5)
clusters = model.fit_predict(X)
Key Points:
Can find arbitrary shaped clusters
Handles noise and outliers
Doesn't require specifying number of clusters
Principal Component Analysis (PCA)
python
from sklearn.decomposition import PCA
# Dimensionality reduction
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Explained variance ratio
print(pca.explained_variance_ratio_)
Key Points:
Linear dimensionality reduction
Preserves maximum variance
Components are orthogonal
Model Evaluation
Classification Metrics
python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_repo
# Basic metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Comprehensive report
report = classification_report(y_true, y_pred)
Regression Metrics
python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Regression metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
Cross-Validation
python
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
# K-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# Stratified K-fold for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=skf)
Model Selection and Tuning
Train-Test Split
python
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Grid Search
python
from sklearn.model_selection import GridSearchCV
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
# Best parameters
print(grid_search.best_params_)
print(grid_search.best_score_)
Random Search
python
from sklearn.model_selection import RandomizedSearchCV
# Random search
random_search = RandomizedSearchCV(
RandomForestClassifier(), param_grid, n_iter=10, cv=5
)
random_search.fit(X_train, y_train)
Overfitting and Underfitting
Bias-Variance Tradeoff
High Bias (Underfitting): Model is too simple
High Variance (Overfitting): Model is too complex
Goal: Find the right balance
Regularization Techniques
python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso regression (L1 regularization)
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
# Elastic Net (L1 + L2 regularization)
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)
Early Stopping
python
from sklearn.neural_network import MLPClassifier
# Neural network with early stopping
model = MLPClassifier(
hidden_layer_sizes=(100, 50),
max_iter=1000,
early_stopping=True,
validation_fraction=0.1
)
model.fit(X_train, y_train)
Ensemble Methods
Bagging
python
from sklearn.ensemble import BaggingClassifier
# Bagging
model = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=10,
random_state=42
)
model.fit(X_train, y_train)
Boosting
python
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
# AdaBoost
ada = AdaBoostClassifier(n_estimators=100)
ada.fit(X_train, y_train)
# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)
# XGBoost
xgb = XGBClassifier(n_estimators=100)
xgb.fit(X_train, y_train)
Voting
python
from sklearn.ensemble import VotingClassifier
# Voting classifier
voting_clf = VotingClassifier(
estimators=[
('lr', LogisticRegression()),
('rf', RandomForestClassifier()),
('svm', SVC())
],
voting='hard' # or 'soft' for probability averaging
)
voting_clf.fit(X_train, y_train)
Deep Learning Basics
Neural Network Architecture
python
from sklearn.neural_network import MLPClassifier
# Multi-layer perceptron
model = MLPClassifier(
hidden_layer_sizes=(100, 50),
activation='relu',
solver='adam',
max_iter=1000
)
model.fit(X_train, y_train)
Key Concepts
Neurons: Basic processing units
Layers: Input, Hidden, Output
Activation Functions: ReLU, Sigmoid, Tanh
Backpropagation: Learning algorithm
Gradient Descent: Optimization method
Feature Selection
Filter Methods
python
from sklearn.feature_selection import SelectKBest, chi2, f_classif
# Select k best features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get selected feature names
selected_features = selector.get_support(indices=True)
Wrapper Methods
python
from sklearn.feature_selection import RFE
# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
Embedded Methods
python
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# Select features based on importance
selector = SelectFromModel(RandomForestClassifier())
X_selected = selector.fit_transform(X, y)
Common Pitfalls and Best Practices
Data Leakage
Time Series: Don't use future data to predict past
Cross-validation: Ensure proper splitting
Feature Engineering: Apply transformations after splitting
Best Practices
1. Start Simple: Begin with simple models
2. Understand Your Data: Explore before modeling
3. Feature Engineering: Often more important than algorithm choice
4. Cross-Validation: Always validate your results
5. Monitor Performance: Track metrics on validation set
6. Document Everything: Keep track of experiments
Common Mistakes
Using test data for model selection
Ignoring class imbalance
Not scaling features for distance-based algorithms
Overfitting to validation set
Not checking for data leakage
Performance Optimization
Computational Efficiency
python
# Use appropriate data types
df['category'] = df['category'].astype('category')
# Vectorized operations
np.sum(array) # Instead of for loops
# Parallel processing
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1) # Use all CPU cores
Memory Management
python
# Use generators for large datasets
def data_generator():
for chunk in pd.read_csv('large_file.csv', chunksize=1000):
yield chunk
# Reduce memory usage
df = df.select_dtypes(include=['int64']).apply(pd.to_numeric, downcast='integer')
Model Deployment
Saving and Loading Models
python
import joblib
import pickle
# Save model
joblib.dump(model, 'model.pkl')
# Load model
model = joblib.load('model.pkl')
# With pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
Model Serving
python
# Simple Flask API example
from flask import Flask, request, jsonify
app = Flask(__name__)
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
prediction = model.predict([data['features']])
return jsonify({'prediction': prediction[0]})
Quick Reference
Algorithm Selection Guide
Linear Regression: Simple, interpretable, continuous target
Logistic Regression: Binary classification, probability estimates
Decision Trees: Interpretable, handles mixed data types
Random Forest: General purpose, handles overfitting
SVM: High-dimensional data, non-linear relationships
KNN: Simple, non-parametric, local patterns
Naive Bayes: Text classification, categorical features
K-Means: Spherical clusters, known number of clusters
DBSCAN: Arbitrary shaped clusters, unknown number of clusters
Python Libraries
scikit-learn: General machine learning
pandas: Data manipulation
numpy: Numerical computing
matplotlib/seaborn: Visualization
xgboost: Gradient boosting
tensorflow/pytorch: Deep learning
statsmodels: Statistical modeling