Mathematical Approaches to Credit Card
Transactions Fraud Detection:
Logistic Regression
Supervisor: Prof. Navnit Jha
JARIFUL HASSAN
(South Asian University, New Delhi)
Outline
• Introduction to Logistic Regression
• Binary Classification Problem
• Probabilities of Positive and Negative Classes
• Sigmoid Function in Logistic Regression
• Estimation of Parameters using Maximum Likelihood
• Estimation of Parameters via Minimizing Negative Log-Likelihood
• First Gradient of Negative Log-Likelihood
• Second Gradient of Negative Log-Likelihood
• Convexity & Positive Definite Matrix of Negative Log-Likelihood Function
• Update Rule for Parameters (Newton-Raphson Method)
• Evaluation Metrics for Logistic Regression
• Project Overview: Fraud Detection in Credit Card Transactions
• Logistic Regression for Fraud Detection
• Random Forest Classifier for Fraud Detection
Logistic Regression
• Logistic Regression is a machine learning algorithm used for
binary classification tasks.
• It predicts the probability that a given input belongs to a
certain class.
• The output of logistic regression lies in the range (0, 1) as a
probability. To classify inputs, we use a threshold (commonly
0.5) to assign them to one of the binary classes.
Examples of Binary Classification
• Spam Detection: Classify emails as ”spam” (1) or ”not spam”
(-1).
• Fraud Detection: Identify credit card transactions as
”fraudulent” (1) or ”legitimate” (-1).
• Purchase Prediction: Predict if a customer will buy a product
(1) or not (-1).
Binary Classification
Input Data
• x = [x1 , x2 , . . . , xp ], where xi ∈ R and p is the number of
features.
Output (Target)
• The output y is binary, meaning it has two possible
values/calasses, often represented as:
▶ y ∈ {−1, 1} (e.g., −1 = Negative, 1 = Positive).
Binary Classification Problem: The goal is to derive a decision
rule that maps a feature vector x ∈ Rp to a binary response
y ∈ {−1, 1}.
Objective: Given a dataset with N observations,
(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ), where xi ∈ Rp represents the i-th
feature vector and yi ∈ {−1, 1} is the corresponding binary
outcome, the objective is to find a function f : Rp → {−1, 1} that
minimizes the classification error.
Consider Health sector Problems
Problem: Predicting Disease Diagnosis Consider a problem
where a hospital is trying to predict whether a patient has a
disease based on several features:
• x: Age
• y: Blood pressure
• z: Cholesterol levels
• w: Exercise habits
• o: Smoking habits
Target:
• Yes (1) = Patient has the disease (label it as 1)
• No (-1) = Patient does not have the disease (label it as -1)
Objective: have to find a map f : R5 → {−1, 1} that minimizes
the classification error.
Question: How can we classify the input using the available data?
Ans: We will use sigmoid function.
Sigmoid Function
The sigmoid function takes any real-valued input and maps it to a
value between 0 and 1, thats make it perfect for probabilities
modeling. The sigmoid function can be represented as:
1
σ(z) = ;z ∈ R
1 + e −(z)
σ(z) → (0, 1)
• The sigmoid curve has an ”S” shape, which is why it is also
known as the S-shaped curve.
• It is differentiable & monotonic function.
• As z → ∞, σ(z) → 1.
• As z → −∞, σ(z) → 0.
• The sigmoid function approaches 1 and 0 as limits, but it
never actually reaches them, creating asymptotes at y = 1, 0.
Probabilities modeling of y = 1 and y = −1
Suppose that we have β0 ∈ R as the intercept and β ∈ Rp is the
vector of Weights for input x, and x ∈ Rp . Assume that The
probability of y = 1 (using sigmoid function) given x is:
1
P(y = 1|x) =
1+ e −(β0 +xβ)
Then, the probability of y = −1 given x can be expressed as:
1
P(y = −1|x) = 1 − P(y = 1|x) =
1+ e (β0 +xβ)
For some β0 ∈ R and β ∈ Rp , we can write the probability of
y ∈ {−1, 1} as:
1
P(y ) = −y
1 + e (β0 +xβ)
that is Binary Probability Function:
Thus, the probability for y = 1 (i.e., the sigmoid function) is given
by:
1
f (x) = −(β
1 + e 0 +xβ)
f (x) gives output between 0 and 1.
Figure: Sigmoid curve
Logistic Regression Algorithm
• Logistic Regression uses the sigmoid function to predict class
of input x = [x1 , x2 , x3 , ..., xp ].
• The sigmoid function is defined as:
1
σ(z) =
1 + e −z
where z = β0 + β1 x1 + β2 x2 + · · · + βp xp ; βi ∈ R for
i = 0, 1, 2, . . . , p.
• Find βi to obtain z with respect to the given input x.
• Decision Boundary: The decision boundary is typically set at
0.5: (
1 if σ(z) ≥ 0.5
ŷ =
−1 if σ(z) < 0.5
Using this, we can classify our data into two classes.
Problem: How to estimate βi to obtain z?
Estimation of Z using Maximum Likelihood
Assume that we are given N correctly classified observations
(training data on which we can train our model):
(x1 , y1 ), . . . , (xN , yN ) ∈ Rp × {−1, 1}
The likelihood function can be written as:
N N
Y Y 1
P(yi |xi ) = L(β0 , β) =
i=1 i=1
1+ e −yi (β0 +xi β)
To estimate the parameters β0 ∈ R, β ∈ Rp we maximize
L(β0 , β).
Why maximization of L(β0 , β) will help us to find
β0 ∈ R, β ∈ Rp ?
If we know x belongs to positive class(1) or negative class (-1), so
we can choose those β0 , β that are maximizing the Binary
Probability Function for given X:
1
P(y |x) =
1+ e −y (β0 +xβ)
Explanation:
• Specifically, we want the probability for each yi to be as close
to 1 as possible, indicating that the model’s prediction is
highly confident.
Maximizing the overall likelihood for the dataset:
• Maximizing (L(β0 , β) gives the optimal values for β0 and β,
making our model as accurate as possible in classifying the
data.
Estimation of z using Negative Log-Likelihood Function
Log-Likelihood Function: The log-likelihood is obtained by taking
the logarithm of the likelihood function:
N
X
log L(β0 , β) = log(P(yi |xi ))
i=1
Substituting the expression for P(yi |xi ):
N N
X 1 X
log =− log(1 + e −yi (β0 +xi β) )
i=1
1 + e −yi (β0 +xi β) i=1
The negative log likelihood is given by:
N
X
ℓ(β0 , β) = log(1 + vi ); vi = e −yi (β0 +xi β) for i = 1, 2, . . . , N.
i=1
we will minimize ℓ(β0 , β) to estimate :β0 ∈ R, β ∈ RP .
Why to Use Negative Log-Likelihood When We Have
Likelihood?
1. Avoiding Zero
▶ Direct maximization of the likelihood could result in extremely
small values, possibly approaching zero, making it harder to
train the model.
N
Y
for any i if P(yi |xi )) → 0 =⇒ P(yi |xi )) → 0
i=1
▶ Log-likelihood prevents this issue and maintains better
numerical accuracy.
2. Minimizing the Negative Log-Likelihood
▶ Minimizing the negative log-likelihood is equivalent to
maximizing the likelihood:
Maximizing L(β0 , β) ≡ Minimizing ℓ(β0 , β)
First Gradient of Neg-Log-Likelihood Function
We write (β0 , β) ∈ R × Rp as β ∈ Rp+1 .
The neg-log-likelihood function is:
N
X
ℓ(β) = log(1 + vi )
i=1
where:
−yi βxi
1
vi = e , β = β0 β , xi =
xi
The gradient of the neg-log-likelihood function is:
N
X −xi · yi · vi
∇ℓ(β) = = −X T u
1 + vi
i=1
Where:
y1 v1 1 x1
1+v1 1 x2
u = ... ; X = .
..
yN vN .. .
1+vN 1 xN
Second Derivative of Negative Log-Likelihood
N !
X −xi · yi · v i
∇2 ℓ(β) = ∇
1 + vi
i=1
vi
let ui = 1+vi , we use the quotient rule to differentiate with respect
to β:
d d
(1 + vi ) · dβ vi − vi · dβ (1 + vi )
(1 + vi )2
The differentiation of ui with respect β gives us:
yi · vi · xi
−
(1 + vi )2
N
2
X xi yi2 vi xi
=⇒ ∇ ℓ(β) =
(1 + vi )2
i=1
yi2 = 1
N
2
X xi vi xi
=⇒ ∇ ℓ(β) =
(1 + vi )2
i=1
This can be expressed as:
∇2 ℓ(β) = X T WX
where W is a diagonal matrix:
v1 v2 vN
W = diag , ,...,
(1 + v1 )2 (1 + v2 )2 (1 + vN )2
where vi = e −yi βxi > 0 for i = 0, 1, 2, . . . , n
=⇒ W is a positive definite matrix.
=⇒ X T WX > 0 for any x ̸= 0.
=⇒ ∇2 ℓ(β0 , β) is a positive definite matrix.
Convexity & Positive Definiteness of ℓ(β0 , β)
1. Convexity of Negative Log-Likelihood:
• The ℓ(β0 , β) is strictly convex because its Hessian matrix
∇2 ℓ(β0 , β) is positive definite.
2. Positive Definiteness and Its Implications:
• Positive definiteness implies that the function is
”bowl-shaped” and has only one minimum.
• In this case, the gradient ∇ℓ(β0 , β) = 0 leads to a unique
global minimum.
3. Unique Minimum, Not Maximum:
• A positive definite Hessian ensures that the point where
∇ℓ(β) = 0 is a local minimum.
• Since the function is convex, this local minimum is also the
global minimum There are no saddle points or local maxima.
Newton-Raphlson Method for Minimum Likelihood Estimation
We aim to find β0 ∈ R and β ∈ Rp such that:
∇ℓ(β0 , β) = 0
The update rule is given by:
(β0 , β) ← (β0 , β) − {∇2 ℓ(β0 , β)}−1 ∇ℓ(β0 , β)
Where:
∇ℓ(β0 , β) = −X T u
Evaluation Metrices for Classification Algorithms
Common Evaluation Metrics:
• Accuracy
• Confusion Matrix
• AUC-ROC (Area Under ROC Curve)
• True Positive Rate (TPR)
• False Positive Rate (FPR)
• False Negative Rate (FNR)
Confusion Matrix
• A confusion matrix is used to evaluate the performance of a
classification model.
• It compares the predicted labels with the actual labels.
Structure:
TP FP
FN TN
• TP (True Positive): Correctly predicted positive instances.
• FP (False Positive): Incorrectly predicted positive instances
(Type I error).
• FN (False Negative): Incorrectly predicted negative
instances (Type II error).
• TN (True Negative): Correctly predicted negative instances.
Accuracy
• Definition: Measures the overall correctness of the model.
• Formula:
TP + TN
Accuracy =
TP + TN + FP + FN
• Example: If 90 out of 100 predictions are correct, the
accuracy is 90%.
True Positive Rate (TPR) and False Positive Rate (FPR)
True Positive Rate (TPR):
• Also known as Sensitivity or Recall.
• TPR measures the proportion of actual positives that are
correctly identified.
• Formula:
TP
TPR =
TP + FN
False Positive Rate (FPR):
• FPR measures the proportion of actual negatives that are
incorrectly identified as positives.
• Formula:
FP
FPR =
FP + TN
where FP = False Positives, TN = True Negatives.
ROC Curve and AUC
Receiver Operating Characteristic (ROC) Curve
• The Receiver Operating Characteristic (ROC) Curve is a
graphical representation used to evaluate the performance of a
binary classification model.It illustrates the trade-off between the
True Positive Rate (TPR) and False Positive Rate (FPR)
• As the ROC curve moves closer to the top-left corner, it increases
the Area Under the Curve (AUC), indicating a perfect classifier.
Figure: Roc Curve
• AUC = 1: Perfect classifier, AUC = 0.5: random guessing
AUC - Area Under the ROC Curve
AUC (Area Under the ROC Curve):
• AUC measures the model’s ability to distinguish between
classes.
• AUC = 1: Perfect classifier
• AUC = 0.5: Random guessing
• AUC < 0.5: Worse than random guessing
Interpretation:
• AUC = 0.9 to 1: Excellent model
• AUC = 0.7 to 0.9: Good model
• AUC = 0.5 to 0.7: Poor model (random guessing)
Formula for AUC:
Z 1
AUC = TPR(FPR) dFPR
0
AUC - Area Under the ROC Curve using Trapezoidal Rule
Trapezoidal Rule for AUC:
Trapezoidal Rule Formula:
N−1
X 1
AUC ≈ (TPRi + TPRi+1 ) (FPRi+1 − FPRi )
2
i=1
• TPRi : True Positive Rate at the i-th point.
• FPRi : False Positive Rate at the i-th point.
• This method assumes linear interpolation between consecutive
points.
Credit Card Transactions Fraud Detection
Practical Problem: Predicting Fraud in Credit Card
Transactions
To tackle the issue of credit card fraud, we will utilize a dataset
from Kaggle. Our goal is to develop a model that can accurately
predict fraudulent transactions.
Data Set:
• Kaggle Dataset Link: https://www.kaggle.com/datasets/
kartik2112/fraud-detection
Implementation:
• Jupyter Notebook Link:
https://github.com/JarifulHassan/
Credit-Card-Transactions-Fraud-Detection/blob/
main/test.ipynb
Let’s dive into the practical aspects of this project!
Dataset Overwiew
• Dataset: Credit Card Transactions Fraud Detection (Kaggle)
• Total Entries: 555,719
• Number of Features: 23
• Fraudulent Transactions (1): 2,145
• Non-Fraudulent Transactions (0): 553,574
• The dataset includes features related to transaction time,
amount, merchant details, and demographic information
about the cardholder.
• Column Names: Unnamed: 0, trans date trans time,
cc num, merchant, category, amt, first, last, gender, street,
city, state, zip, lat, long, city pop, job, dob, trans num,
unix time, merch lat, merch long, is fraud.
Data Preprocessing and Feature Engineering to
Apply the Algorithm
• Handled missing values and cleaned irrelevant data
• Converted date columns to DateTime format
• Engineered features like transaction hour, transaction day,
month, and calculated customer age
• Dropped unnecessary columns such as ’cc num’, ’first’, ’last’,
’street’, ’city’, ’state’, etc.
• One-hot encoded categorical variables (e.g., ’category’,
’gender’, ’job’, ’merchant’)
• Split the dataset into training (0.6) and testing (0.4) sets.
• Standardized numerical features using StandardScaler.
Train Model using Logistic regression
Evaluation metrics
• Accuracy: 1.00
• ROC
Figure: Roc with AUC:0.90
• Confusion Matrix:
Figure: Confusion Matrix
True Negative =221323, False positive=104,
False Negative=744, True positive =117.
• There are 104 people are those who are not committing fraud
but our model predicted them to commit fraud.
• 744 people commit fraud but our model is not predicting
them to Committing fraud.
• There are only 117 people who are predicted to fraud and
actually Committing fraud.
Logistic Regression is not performing well.
• Missed Fraud Cases:
▶ The Logistic Regression model missed many fraud cases (744
people who committed fraud were not caught).
• Too Simple:
▶ There are 104 people who were predicted to commit fraud, but
in reality, they did not. This can negatively affect customer
services.
What’s Next?
• To improve results, we can try using Random Forest
Classifier or other algorithms that are better at handling
more complex data.
• These models can help us catch more fraud cases and reduce
errors.
Train model with Random Forest Classifier
Evaluation metrics
• Accuracy: 1.00
• ROC
Figure: ROC with AUC:0.99
• Confusion Matrix:
Figure: Confusion Matrix
True Negative =221424, False positive=3,
False Negative=438, True positive =423.
• There are 3 people are those who are not committing fraud
but our model predicted them to commit fraud.
• 438 people commit fraud but our model is not predicting
them to Committing fraud.
• There are only 423 people who are predicted to fraud and
actually Committing fraud.
Conclusion
• Logistic Regression is machine learning algorithm for binary
classification problems.
• It models the probability that an input belongs to a particular
class, offering outputs between -1 and 1.
• The Sigmoid function makes it suitable for classification tasks
such as disease diagnosis and purchase prediction.
• The Maximum Likelihood Estimation (MLE) method helps
find the optimal parameters, β0 and β, by the probability of
correct classifications.
• Logistic Regression is widely used in various industries for
classification tasks.
• While Logistic Regression is highly effective for binary
classification, it can be extended to multiclass problems using
techniques like One-vs-All or Softmax regression.
References
Joe Suzuki. Statistical Learning. Graduate School of
Engineering Science, Osaka University, Toyonaka, Osaka,
Japan. ISBN 978-981-15-7876-2, ISBN 978-981-15-7877-9
(eBook). Available at:
https://doi.org/10.1007/978-981-15-7877-9. Springer
Nature Singapore Pte Ltd, 2021.
Data set for the project. Available at: https://www.kaggle.
com/datasets/kartik2112/fraud-detection.
Jariful Hassan. Credit Card Transactions Fraud Detection.
GitHub repository, 2024. Available at:
https://github.com/JarifulHassan/
Credit-Card-Transactions-Fraud-Detection/blob/
main/test.ipynb.
Thank You!
Questions?
Jariful Hassan
Enrollment No: SAU/AM(M)2023/07
MSc in Applied Mathematics
South Asian University, Delhi
Email: jariful0786@gmail.com
LinkedIn: jariful-hassan-69142424a