KEMBAR78
Machine Learning Cheat Sheet | PDF | Cluster Analysis | Principal Component Analysis
0% found this document useful (0 votes)
28 views15 pages

Machine Learning Cheat Sheet

Uploaded by

okayvansh123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views15 pages

Machine Learning Cheat Sheet

Uploaded by

okayvansh123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Machine Learning Cheat Sheet

Types of Machine Learning

Supervised Learning
Definition: Learning with labeled training data
Goal: Predict labels for new, unseen data

Types:
Classification: Predict discrete categories/classes

Regression: Predict continuous numerical values

Unsupervised Learning
Definition: Learning patterns in data without labels
Goal: Discover hidden structure in data

Types:
Clustering: Group similar data points

Dimensionality Reduction: Reduce feature space


Association Rules: Find relationships between variables

Reinforcement Learning
Definition: Learning through interaction with environment

Goal: Maximize cumulative reward


Components: Agent, Environment, Actions, Rewards, Policy

Data Preprocessing

Data Cleaning

python
import pandas as pd
import numpy as np

# Handle missing values


df.dropna() # Remove rows with missing values
df.fillna(value) # Fill missing values
df.interpolate() # Interpolate missing values

# Handle duplicates
df.drop_duplicates() # Remove duplicate rows

Feature Scaling

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standardization (z-score normalization)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max scaling (0-1 range)


scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Robust scaling (using median and IQR)


scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

Feature Engineering

python

# One-hot encoding for categorical variables


pd.get_dummies(df, columns=['category'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# Feature selection
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=10)
X_selected = selector.fit_transform(X, y)
Supervised Learning Algorithms

Linear Regression

python

from sklearn.linear_model import LinearRegression

# Simple linear regression


model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Equation: y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Key Points:

Assumes linear relationship between features and target

Sensitive to outliers

Requires feature scaling for regularization

Logistic Regression

python

from sklearn.linear_model import LogisticRegression

# Binary classification
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

# Equation: p = 1 / (1 + e^(-z)) where z = β₀ + β₁x₁ + ... + βₙxₙ

Key Points:

Uses sigmoid function for probability

Good for binary and multiclass classification

Provides probability estimates

Decision Trees

python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Classification
model = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Points:

Easy to interpret and visualize

Can handle both numerical and categorical features

Prone to overfitting (use pruning)

Random Forest

python

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Classification
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Feature importance
importances = model.feature_importances_

Key Points:

Ensemble of decision trees

Reduces overfitting compared to single trees

Provides feature importance scores

Support Vector Machines (SVM)

python

from sklearn.svm import SVC, SVR

# Classification
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Points:
Effective for high-dimensional data

Memory efficient

Versatile (different kernels: linear, polynomial, RBF)

K-Nearest Neighbors (KNN)

python

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Classification
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Points:

Simple, non-parametric algorithm

Sensitive to feature scaling

Computationally expensive for large datasets

Naive Bayes

python

from sklearn.naive_bayes import GaussianNB, MultinomialNB

# Gaussian Naive Bayes


model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Points:

Assumes feature independence

Fast and efficient

Good for text classification

Unsupervised Learning Algorithms

K-Means Clustering

python
from sklearn.cluster import KMeans

# Clustering
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)

# Elbow method for optimal k


inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)

Key Points:

Partitions data into k clusters

Sensitive to initialization and outliers


Assumes spherical clusters

Hierarchical Clustering

python

from sklearn.cluster import AgglomerativeClustering


from scipy.cluster.hierarchy import dendrogram, linkage

# Agglomerative clustering
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = model.fit_predict(X)

# Dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)

DBSCAN

python

from sklearn.cluster import DBSCAN

# Density-based clustering
model = DBSCAN(eps=0.5, min_samples=5)
clusters = model.fit_predict(X)

Key Points:
Can find arbitrary shaped clusters
Handles noise and outliers

Doesn't require specifying number of clusters

Principal Component Analysis (PCA)

python

from sklearn.decomposition import PCA

# Dimensionality reduction
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Explained variance ratio


print(pca.explained_variance_ratio_)

Key Points:

Linear dimensionality reduction

Preserves maximum variance


Components are orthogonal

Model Evaluation

Classification Metrics

python

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_repo

# Basic metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Comprehensive report
report = classification_report(y_true, y_pred)

 

Regression Metrics
python

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Regression metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

Cross-Validation

python

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

# K-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Stratified K-fold for classification


skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=skf)

Model Selection and Tuning

Train-Test Split

python

from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

Grid Search

python
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)
print(grid_search.best_score_)

Random Search

python

from sklearn.model_selection import RandomizedSearchCV

# Random search
random_search = RandomizedSearchCV(
RandomForestClassifier(), param_grid, n_iter=10, cv=5
)
random_search.fit(X_train, y_train)

Overfitting and Underfitting

Bias-Variance Tradeoff
High Bias (Underfitting): Model is too simple
High Variance (Overfitting): Model is too complex

Goal: Find the right balance

Regularization Techniques

python
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge regression (L2 regularization)


ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso regression (L1 regularization)


lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)

# Elastic Net (L1 + L2 regularization)


elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)

Early Stopping

python

from sklearn.neural_network import MLPClassifier

# Neural network with early stopping


model = MLPClassifier(
hidden_layer_sizes=(100, 50),
max_iter=1000,
early_stopping=True,
validation_fraction=0.1
)
model.fit(X_train, y_train)

Ensemble Methods

Bagging

python

from sklearn.ensemble import BaggingClassifier

# Bagging
model = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=10,
random_state=42
)
model.fit(X_train, y_train)
Boosting

python

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier


from xgboost import XGBClassifier

# AdaBoost
ada = AdaBoostClassifier(n_estimators=100)
ada.fit(X_train, y_train)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)

# XGBoost
xgb = XGBClassifier(n_estimators=100)
xgb.fit(X_train, y_train)

Voting

python

from sklearn.ensemble import VotingClassifier

# Voting classifier
voting_clf = VotingClassifier(
estimators=[
('lr', LogisticRegression()),
('rf', RandomForestClassifier()),
('svm', SVC())
],
voting='hard' # or 'soft' for probability averaging
)
voting_clf.fit(X_train, y_train)

Deep Learning Basics

Neural Network Architecture

python
from sklearn.neural_network import MLPClassifier

# Multi-layer perceptron
model = MLPClassifier(
hidden_layer_sizes=(100, 50),
activation='relu',
solver='adam',
max_iter=1000
)
model.fit(X_train, y_train)

Key Concepts
Neurons: Basic processing units

Layers: Input, Hidden, Output


Activation Functions: ReLU, Sigmoid, Tanh

Backpropagation: Learning algorithm


Gradient Descent: Optimization method

Feature Selection

Filter Methods

python

from sklearn.feature_selection import SelectKBest, chi2, f_classif

# Select k best features


selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature names


selected_features = selector.get_support(indices=True)

Wrapper Methods

python

from sklearn.feature_selection import RFE

# Recursive feature elimination


rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
Embedded Methods

python

from sklearn.ensemble import RandomForestClassifier


from sklearn.feature_selection import SelectFromModel

# Select features based on importance


selector = SelectFromModel(RandomForestClassifier())
X_selected = selector.fit_transform(X, y)

Common Pitfalls and Best Practices

Data Leakage
Time Series: Don't use future data to predict past

Cross-validation: Ensure proper splitting


Feature Engineering: Apply transformations after splitting

Best Practices
1. Start Simple: Begin with simple models

2. Understand Your Data: Explore before modeling


3. Feature Engineering: Often more important than algorithm choice

4. Cross-Validation: Always validate your results


5. Monitor Performance: Track metrics on validation set

6. Document Everything: Keep track of experiments

Common Mistakes
Using test data for model selection
Ignoring class imbalance

Not scaling features for distance-based algorithms


Overfitting to validation set

Not checking for data leakage

Performance Optimization

Computational Efficiency

python
# Use appropriate data types
df['category'] = df['category'].astype('category')

# Vectorized operations
np.sum(array) # Instead of for loops

# Parallel processing
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1) # Use all CPU cores

Memory Management

python

# Use generators for large datasets


def data_generator():
for chunk in pd.read_csv('large_file.csv', chunksize=1000):
yield chunk

# Reduce memory usage


df = df.select_dtypes(include=['int64']).apply(pd.to_numeric, downcast='integer')

Model Deployment

Saving and Loading Models

python

import joblib
import pickle

# Save model
joblib.dump(model, 'model.pkl')

# Load model
model = joblib.load('model.pkl')

# With pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)

Model Serving

python
# Simple Flask API example
from flask import Flask, request, jsonify

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
data = request.json
prediction = model.predict([data['features']])
return jsonify({'prediction': prediction[0]})

Quick Reference

Algorithm Selection Guide


Linear Regression: Simple, interpretable, continuous target

Logistic Regression: Binary classification, probability estimates


Decision Trees: Interpretable, handles mixed data types
Random Forest: General purpose, handles overfitting

SVM: High-dimensional data, non-linear relationships


KNN: Simple, non-parametric, local patterns

Naive Bayes: Text classification, categorical features


K-Means: Spherical clusters, known number of clusters

DBSCAN: Arbitrary shaped clusters, unknown number of clusters

Python Libraries
scikit-learn: General machine learning
pandas: Data manipulation

numpy: Numerical computing


matplotlib/seaborn: Visualization

xgboost: Gradient boosting


tensorflow/pytorch: Deep learning
statsmodels: Statistical modeling

You might also like