Logistic Regression
Logistic Regression is a statistical method used for binary classification—
predicting one of two possible outcomes based on one or more input features.
Despite the name, it's actually a classification algorithm, not a regression one.
When to Use Logistic Regression?
The dependent variable is categorical (often binary: 0/1, Yes/No,
True/False).
You want to estimate the probability of a class occurring, such as
whether a customer will buy a product or not.
Key Concepts
1. The Logistic Function (Sigmoid Function)
The core of logistic regression is the sigmoid function, which maps any real-
valued number to the (0, 1) interval:
Where:
z=β0+β1x1+β2x2+…+βnxn
β0,β1,…,βn, are the parameters (weights) learned from data
This output σ(z)can be interpreted as the probability that the output belongs
to class 1.
2. Model Prediction
The model predicts class 1 if the probability is greater than 0.5:
3. Loss Function
Instead of Mean Squared Error (used in linear regression), logistic regression
uses Log Loss (also called Binary Cross-Entropy):
Where:
y is the true label (0 or 1)
y^ is the predicted probability
Advantages
Simple and easy to implement
Interpretable (coefficients show the effect of features)
Fast to train and predict
Works well for linearly separable data
Limitations
Only works for binary or multinomial classification
Assumes a linear relationship between input variables and the log-odds
Not suitable for complex relationships unless features are engineered
well
Use Cases
Email spam detection
Medical diagnosis (e.g., predicting if a tumor is malignant)
Customer churn prediction
Credit scoring
Binary Classification and the Sigmoid Function in Logistic Regression
Binary Classification
Binary classification is a type of supervised learning where the goal is to classify
input data into one of two classes (e.g., spam vs. not spam, malignant vs.
benign tumor).
Target variable:
y∈{0,1}
Examples:
Email classification (spam or not)
Disease diagnosis (positive or negative)
Loan default prediction (default or not)
Sigmoid Function
At the heart of logistic regression is the sigmoid (logistic) function, which maps
any real-valued number into the (0, 1) interval.
Where:
z=wTx+b (linear combination of input features)
σ(z) is the predicted probability of class 1
Interpretation:
σ(z)→1: strong prediction for class 1
σ(z)→0: strong prediction for class 0
Prediction Rule
Once the sigmoid gives the probability, logistic regression classifies as:
Loss Function
To train logistic regression, we minimize the binary cross-entropy loss:
Where:
y is the true label
y^=σ(z) is the predicted probability
Why Use Sigmoid?
Squashes outputs between 0 and 1 → perfect for probabilities.
Smooth and differentiable → great for gradient descent optimization.
Model evaluation: Accuracy, Precision, Recall, F1-Score
1. Accuracy
The proportion of total correct predictions out of all predictions made.
Use Case: Best used when the classes are balanced.
Limitation: Misleading in imbalanced datasets (e.g., 95% one class).
2. Precision
The proportion of correctly predicted positive observations out of all predicted positives.
Use Case: Important when the cost of false positives is high (e.g., spam detection).
Interpretation: "Of all the things we predicted as positive, how many actually were?"
3. Recall (Sensitivity or True Positive Rate)
The proportion of actual positives correctly identified.
Use Case: Important when the cost of false negatives is high (e.g., cancer detection).
Interpretation: "Of all the actual positives, how many did we catch?"
4. F1-Score
The harmonic mean of precision and recall. It balances the two metrics.
Use Case: Useful when you need a balance between precision and recall, especially with
imbalanced classes.
Metric Good For Bad For
Accuracy Balanced datasets Imbalanced datasets
Precision Low tolerance for false positives Low tolerance for false negatives
Recall Low tolerance for false negatives Low tolerance for false positives
F1-Score Balance of Precision and Recall Requires tradeoff between the two
Use case :
1. Import Libraries
Predicting Product Purchase Based on Age and Estimated Salary
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
2. Generate Sample Data (or load your own CSV)
# Simulate some data
np.random.seed(0)
n_samples = 200
age = np.random.randint(18, 60, size=n_samples)
salary = np.random.randint(20000, 100000, size=n_samples)
purchased = (salary > 50000).astype(int) # Simplified rule for
demonstration
# Create DataFrame
df = pd.DataFrame({'Age': age, 'EstimatedSalary': salary, 'Purchased':
purchased})
3. Prepare the Data
# Features and target
X = df[['Age', 'EstimatedSalary']]
y = df['Purchased']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
4. Train Logistic Regression Model
# Model training
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
5. Evaluate the Model
# Predictions
y_pred = model.predict(X_test_scaled)
# Evaluation metrics
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
6. Make a New Prediction
# Predict for a new customer: Age 35, Salary $60,000
new_data = scaler.transform([[35, 60000]])
prediction = model.predict(new_data)
print("Will the customer buy the product?", "Yes" if prediction[0] == 1
else "No")
python coding:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix, classification_report
# Step 2: Create or load the dataset (example: customer purchase)
# Simulated dataset
data = {
'Age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61],
'Salary': [15000, 29000, 48000, 60000, 52000, 80000, 82000, 90000,
95000, 99000],
'Purchased': [0, 0, 0, 1, 0, 1, 1, 1, 1, 1]
df = pd.DataFrame(data)
# Step 3: Define features (X) and target (y)
X = df[['Age', 'Salary']]
y = df['Purchased']
# Step 4: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Step 5: Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Step 6: Predict on test set
y_pred = model.predict(X_test)
# Step 7: Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))