Advanced Analytics
Advanced Analytics
Introduction to Analytics
• Definition: Analytics is the systematic computational analysis of data to discover patterns,
trends, and insights that support decision-making
• Key Concept: Analytics transforms raw data into actionable business intelligence
• Types: Descriptive (what happened), Predictive (what will happen), Prescriptive (what
should happen)
1. Discovery
• Purpose: Understanding business requirements and available data
• Key Activities: Problem definition, data source identification, stakeholder alignment
• Common MCQ Tip: Remember this is the first phase - focuses on “what” not “how”
2. Data Preparation
• Purpose: Clean, transform, and structure data for analysis
• Key Activities: Data cleaning, integration, transformation, quality checks
• Common MCQ Tip: This phase typically takes 60–80% of project time
3. Model Planning
• Purpose: Select appropriate analytical methods and tools
• Key Activities: Algorithm selection, variable selection, model design
• Explanation: Like choosing the right tool for a job before starting work
5. Quality Assurance
• Purpose: Validate model accuracy and reliability
• Key Activities: Testing, validation, performance measurement
• Key Concept: Ensures model meets business requirements and statistical standards
6. Documentation
• Purpose: Record methodology, findings, and procedures
1
• Key Activities: Technical documentation, user guides, process documentation
• Common MCQ Tip: Critical for reproducibility and knowledge transfer
7. Management Approval
• Purpose: Obtain stakeholder sign-off for deployment
• Key Activities: Presentation, business case validation, risk assessment
• Explanation: Business validation before technical implementation
8. Installation
• Purpose: Deploy model into production environment
• Key Activities: System integration, performance optimization, user training
• Key Concept: Moving from development to live operational use
2
Conditional Probability
• Definition: P( A | B) = Probability of A given that B has occurred
P( A∩ B)
• Formula: P( A | B) = P( B)
, where P( B) ̸= 0
• Key Concept: Probability changes when we have additional information
• Common MCQ Tip: Look for “given that” or “|” symbol
Marginal Probability
• Definition: Probability of a single event, ignoring other variables
• Formula: P( A) = ∑ P( A ∩ Bi ) for all possible events Bi
• Explanation: Like finding totals in probability tables
Bayes’ Theorem
• Definition: Method to update probability based on new evidence
P( B| A)× P( A)
• Formula: P( A | B) = P( B)
• Key Components:
– P( A | B): Posterior probability
– P( B | A): Likelihood
– P( A): Prior probability
– P( B): Evidence
• Application: Medical diagnosis, spam filtering, machine learning
• Common MCQ Tip: Remember the formula structure - numerator has likelihood × prior
Quick Summary Checklist:
• ✓Sample space contains all possible outcomes
• ✓Joint probability uses AND logic
• ✓Conditional probability uses GIVEN information
• ✓Bayes’ theorem updates probabilities with evidence
• ✓Marginal probability ignores other variables
Random Variable
• Definition: A function that assigns numerical values to outcomes of random experiments
• Types:
– Discrete: Countable values (e.g., number of customers)
– Continuous: Any value in a range (e.g., height, weight)
• Key Concept: Bridge between probability theory and real-world measurements
Concepts of Correlation
• Definition: Statistical measure of linear relationship between two variables
3
• Range: −1 to +1
• Interpretation:
– +1: Perfect positive correlation
– 0: No linear correlation
– −1: Perfect negative correlation
[( x − x̄ )(y −ȳ)]
• Formula: r = √∑ i 2 i
∑( xi − x̄ ) ∑(yi −ȳ)2
Covariance
• Definition: Measure of how two variables change together
• Formula: Cov( X, Y ) = E[( X − µ X )(Y − µY )]
• Key Concept:
– Positive covariance: Variables increase together
– Negative covariance: One increases as other decreases
– Zero covariance: No linear relationship
Covariance
• Relationship: Correlation = σX ×σY
Outliers
• Definition: Data points significantly different from other observations
• Detection Methods:
– IQR Method: Values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR
– Z-score Method: | Z | > 2 or 3 (depending on threshold)
– Visual: Box plots, scatter plots
• Handling Techniques:
– Remove outliers
– Transform data
– Use robust statistics
– Cap/floor values
• Common MCQ Tip: Outliers can dramatically affect mean but not median
Quick Summary Checklist:
• ✓Random variables map outcomes to numbers
• ✓Correlation measures linear relationship strength
• ✓Covariance shows direction of relationship
• ✓Outliers are extreme values that need special handling
• ✓IQR and Z-score are common outlier detection methods
4
Session 7 & 8: Probability Distributions
Continuous Distributions
Uniform Distribution
• Definition: All values in an interval are equally likely
• Parameters: a (minimum), b (maximum)
• PDF: f ( x ) = 1
b− a for a ≤ x ≤ b
a+b
• Mean: 2
( b − a )2
• Variance: 12
• Example: Random number generation
Exponential Distribution
• Definition: Models time between events in Poisson process
• Parameter: λ (rate parameter)
• PDF: f ( x ) = λe−λx for x ≥ 0
1
• Mean: λ
1
• Variance: λ2
• Application: Reliability analysis, waiting times
• Common MCQ Tip: Memoryless property - P( X > s + t | X > s) = P( X > t)
Normal Distribution
• Definition: Bell-shaped, symmetric distribution
• Parameters: µ (mean), σ (standard deviation)
( x − µ )2
• PDF: f ( x ) = √1 e− 2σ2
σ 2π
• Properties:
– 68% data within 1σ
– 95% data within 2σ
– 99.7% data within 3σ
• Standard Normal: µ = 0, σ = 1
• Common MCQ Tip: Central Limit Theorem makes normal distribution fundamental
Discrete Distributions
Binomial Distribution
• Definition: Number of successes in n independent trials
5
• Parameters: n (trials), p (success probability)
• PMF: P( X = k) = C (n, k ) × pk × (1 − p)n−k
• Mean: np
• Variance: np(1 − p)
• Application: Quality control, survey analysis
Poisson Distribution
• Definition: Number of events in fixed time/space interval
• Parameter: λ (average rate)
λk × e−λ
• PMF: P( X = k ) = k!
• Mean: λ
• Variance: λ
• Application: Call center arrivals, defect counting
• Common MCQ Tip: Approximates binomial when n is large, p is small
Geometric Distribution
• Definition: Number of trials until first success
• Parameter: p (success probability)
• PMF: P( X = k) = (1 − p)k−1 × p
1
• Mean: p
1− p
• Variance: p2
6
∑ xi
• Formula: x̄ = n
• Properties: Sensitive to outliers, uses all data points
• Common MCQ Tip: Mean can be misleading with skewed data
Median
• Definition: Middle value when data is arranged in order
• Calculation:
– Odd n: Middle value
– Even n: Average of two middle values
• Properties: Robust to outliers, represents 50th percentile
• Key Concept: Better than mean for skewed distributions
Mode
• Definition: Most frequently occurring value
• Types:
– Unimodal: One mode
– Bimodal: Two modes
– Multimodal: Multiple modes
• Properties: Can be used for categorical data
• Common MCQ Tip: A dataset can have no mode, one mode, or multiple modes
Quartiles
• Definition: Values that divide data into four equal parts
• Q1 (25th percentile): 25% of data below this value
• Q2 (50th percentile): Median
• Q3 (75th percentile): 75% of data below this value
7
Percentiles
• Definition: Values below which a certain percentage of data falls
• Example: 90th percentile means 90% of data is below this value
• Application: Standardized test scores, growth charts
Standard Deviation
• Definition: Average distance of data points from mean
√ √
( x − µ )2 ( x − x̄ )2
• Formula: σ = ∑ iN (population), s = ∑ ni−1 (sample)
• Properties: Same units as original data, sensitive to outliers
• Key Concept: Measures typical deviation from average
Variance
• Definition: Average of squared deviations from mean
∑ ( x i − µ )2 ∑( xi − x̄ )2
• Formula: σ2 = N (population), s2 = n −1 (sample)
• Properties: Units are squared, always non-negative
√
• Relationship: Standard deviation = Variance
Coefficient of Variation
• Definition: Relative measure of variability
( )
• Formula: CV = Standard Deviation
Mean × 100%
• Purpose: Compare variability across different datasets or units
• Common MCQ Tip: Useful when comparing datasets with different scales
Quick Summary Checklist:
• ✓Central tendency: Mean, Median, Mode
• ✓Dispersion: Range, IQR, Standard deviation, Variance
• ✓Median is robust to outliers, mean is not
• ✓IQR measures middle 50% spread
• ✓Coefficient of variation allows relative comparison
• ✓Standard deviation has same units as data
8
Uni-variate and Bi-variate Sampling
Uni-variate Sampling
• Definition: Sampling involving one variable
• Purpose: Estimate population parameter for single characteristic
• Example: Average height of students
Bi-variate Sampling
• Definition: Sampling involving two variables simultaneously
• Purpose: Study relationship between two characteristics
• Example: Relationship between study hours and exam scores
Re-sampling
• Definition: Repeatedly drawing samples from original sample
• Types:
– Bootstrap: Sample with replacement
– Cross-validation: Sample without replacement
• Purpose: Estimate sampling distribution, validate models
• Application: Confidence intervals, model validation
Sampling Techniques
• Simple Random Sampling: Every item has equal selection probability
• Systematic Sampling: Select every kth item
• Stratified Sampling: Divide population into groups, sample from each
• Cluster Sampling: Select entire groups randomly
• Convenience Sampling: Select easily accessible items
Quick Summary Checklist:
• ✓Population vs Sample distinction is fundamental
9
• ✓Uni-variate: one variable, Bi-variate: two variables
• ✓Re-sampling helps estimate uncertainty
• ✓Central Limit Theorem enables normal approximation
• ✓Sample size ≥ 30 typically sufficient for CLT
• ✓Multiple sampling techniques available
Tails of Test
• One-tailed Test: Tests direction of difference (>, <>)
• Two-tailed Test: Tests existence of difference (̸=)
• Critical Region: Area where null hypothesis is rejected
• Key Concept: Choice depends on research question
Confidence Intervals
• Definition: Range of plausible values for population parameter
• Formula: Point estimate ± Margin of error
• Interpretation: 95% CI means 95% of such intervals contain true parameter
• Common MCQ Tip: Confidence level = 1 − α
Hypothesis Testing
• Null Hypothesis (H0 ): Statement of no effect or difference
• Alternative Hypothesis (H1 ): Statement we want to prove
• p-value: Probability of observing data if H0 is true
• Decision Rule: If p-value < α, reject H0
Parametric Tests
ANOVA (Analysis of Variance)
• Purpose: Compare means of three or more groups
• Assumptions:
– Normal distribution
– Equal variances
10
– Independent observations
• Types:
– One-way ANOVA: One factor
– Two-way ANOVA: Two factors
• F-statistic: Ratio of between-group to within-group variance
• Common MCQ Tip: ANOVA tests equality of means, not individual differences
t-test
• Purpose: Compare means when population standard deviation is unknown
• Types:
– One-sample t-test: Compare sample mean to known value
– Two-sample t-test: Compare two group means
– Paired t-test: Compare paired observations
• Assumptions: Normal distribution, independent observations
x̄ −µ
• Formula: t = √
s/ n
Non-parametric Tests
Chi-Square Test
• Purpose: Test relationships in categorical data
• Types:
– Goodness of fit: Compare observed vs expected frequencies
– Test of independence: Test relationship between variables
(Observed−Expected)2
• Formula: χ2 = ∑ Expected
• Assumptions: Expected frequency ≥ 5 in each cell
• Common MCQ Tip: Used when data doesn’t meet parametric assumptions
U-Test (Mann-Whitney)
• Purpose: Non-parametric alternative to two-sample t-test
• Application: Compare two groups when normality assumption violated
• Advantages: Robust to outliers, doesn’t require normal distribution
• Test statistic: Based on ranks of combined data
Quick Summary Checklist:
• ✓Type I error: False positive, Type II error: False negative
• ✓Confidence intervals provide range of plausible values
• ✓ANOVA compares multiple group means
• ✓t-tests used when population σ unknown
• ✓Chi-square for categorical data relationships
• ✓Non-parametric tests don’t assume normal distribution
11
Session 15 & 16: Predictive Modelling
Models
• Definition: Mathematical representations of real-world processes
• Types:
– Linear models: Linear relationships
– Tree models: Hierarchical decisions
– Ensemble models: Combine multiple models
• Purpose: Capture patterns in data for prediction
Supervised Segmentation
• Definition: Partitioning data using target variable information
• Goal: Create segments that are homogeneous in target variable
12
• Advantage: Segments directly relate to prediction objective
• Example: Decision trees create segments based on target purity
Visualizing Segmentations
• Purpose: Understand how model partitions data space
• Methods:
– Decision boundaries: Show classification regions
– Tree diagrams: Show hierarchical splits
– Scatter plots: Show segment separation
• Benefit: Interpretability and model validation
Probability Estimation
• Purpose: Provide confidence measures with predictions
• Methods:
– Class probabilities: P(class|features)
– Prediction intervals: Range of likely values
– Confidence scores: Model certainty measures
• Application: Risk assessment, decision making under uncertainty
• Common MCQ Tip: Probability estimation helps quantify prediction uncertainty
Quick Summary Checklist:
• ✓Predictive modeling uses historical data to predict future outcomes
• ✓Feature selection identifies most informative variables
• ✓Supervised segmentation uses target variable for partitioning
• ✓Trees can be converted to interpretable rules
• ✓Visualization helps understand model behavior
• ✓Probability estimation quantifies prediction uncertainty
13
Monte Carlo Simulation
• Definition: Computational method using random sampling to solve problems
• Process:
1. Define probability distributions for input variables
2. Generate random samples from distributions
3. Calculate outcomes for each sample
4. Analyze distribution of results
• Key Concept: Uses randomness to solve deterministic problems
• Applications:
– Finance: Portfolio risk, option pricing
– Eitemngineering: Reliability analysis
– PProject management: Schedule risk
• Common MCQ Tip: More simulations = more accurate results
Optimization, Linear
• Definition: Finding best solution subject to constraints
• Linear Programming: Objective function and constraints are linear
• Components:
– Objective function: What to maximize/minimize
– DItem itemecision variables: What we can control
– Citemonstraints: Limitations or requirements
• Methods:
– Graphical method: For 2-variable problems
– Simplex item method: For multi-variable problems
• Applications: Resource allocation, production planning, transportation
14
Session 18 & 19: Decision Analytics
Decision Analytics
• Definition: Use of data and analytical techniques to support decision-making
• Goal: Optimize decisions by combining data insights with business objectives
Evaluating Classifiers
• Purpose: Assess how well classification models perform
• Key Metrics:
TP+TN
– Accuracy: TTP+TN+FP+FN
TP
– Precision: TP+FP - How many predicted positives are correct
– Recall/Sensitivity: TP P+TFN - How many actual positives found
N
– Specificity: TN+FP - How many actual negatives found
TPrecision×Recall
– F1-score: 2 × PPrecision+Recall
• Confusion Matrix: Table showing actual vs predicted classifications
• Common MCQ Tip: High accuracy doesn’t always mean good model (class imbalance)
Analytical Framework
• Definition: Structured approach to analytical decision-making
• Components:
1. Problem definition: What decision needs to be made?
2. Data collection: What information is available?
3. An itemalysis: What patterns exist in data?
4. Ins dataights: What do patterns mean for business?
5. Action items: What should be done based on insights?
Evaluation
• Model evaluation: How well does model perform?
• Business evaluation: Does model create value?
• Methods:
– : - Split data to test generalization
– Hold-out validation: - Reserve data for final testing
– A/B testing : Compare model performance in practice
• Key Concept: Technical performance must align with business value
Baseline
• s: Definition Simple benchmark for comparison
• Purpose: Establish minimum performance standard
• Examples:
15
– Random guessinging: For classification
– Mean prediction: for regression
– Previoustext period: for time series
• Common MCQ Tip: t Good models should significantly outperform baseline
16
• Components:
– P( H | E): Posterior probability (updated belief)
– P( E | H ): Likelihood (evidence given hypothesis)
– P( H ): Prior probability (initial belief)
– P( E): Evidence probability (normalization factor)
Probabilistic Reasoning
• Definition: Making logical inferences under uncertainty
• Principles:
– Uncertainty quantification: Express beliefs as probabilities
– Evidence integration: Combine multiple information sources
– Decision under risk: Choose actions based on expected outcomes
Evidence Quality
• Reliability: How trustworthy is the source?
• Relevance: How related is evidence to hypothesis?
• Independence: Are evidence sources correlated?
• Completeness: What evidence might be missing?
Applications
• Medical diagnosis: Combine symptoms, tests, patient history
• Legal reasoning: Evaluate evidence strength in court cases
• Intelligence analysis: Assess threat levels from multiple sources
• Quality control: Combine inspection results
Limitations
• Prior specification: How to set initial probabilities?
17
• Computational complexity: Many hypotheses and evidence types
• Independence assumptions: Often unrealistic
• Subjective probabilities: Different experts, different priors
Quick Summary Checklist:
• ✓Bayes’ rule updates probabilities with new evidence
• ✓Sequential updates allow continuous learning
• ✓Evidence quality affects reasoning accuracy
• ✓Independence assumption simplifies but may be unrealistic
• ✓Applications span medical, legal, and business domains
• Prior specification remains challenging in practice
Business Strategy
• Definition: Long-term plan for achieving competitive advantage
• Goal: Create sustainable value for stakeholders
18
Sustainability Mechanisms
• Network effects: Value increases with more users
• Learning curves: Experience reduces costs over time
• Brand loyalty: Customer switching costs
• Regulatory barriers: Legal protection of advantages
• Continuous innovation: Constant improvement and development
19
Session 23: Factor Analysis and Directional Data Analytics
Factor Analysis
• Definition: Statistical technique to identify underlying factors that explain correlations
among variables
• Purpose: Reduce dimensionality while preserving information
• Goal: Find latent (hidden) variables that influence observed variables
Key Concepts
• Factors: Unobserved variables that influence multiple observed variables
• Factor loadings: Relationships between factors and observed variables
• Communality: Proportion of variable’s variance explained by factors
• Eigenvalues: Measure of variance explained by each factor
• Factor rotation: Method to make factors more interpretable
Extraction Methods
• Principal Component Method: Most common, maximizes variance explained
• Principal Axis Factoring: Focuses on shared variance only
• Maximum Likelihood: Assumes multivariate normal distribution
• Common MCQ Tip: Kaiser criterion (eigenvalue > 1) for factor selection
Rotation Methods
• Orthogonal rotation: Factors remain uncorrelated
– Varimax: Maximizes variance of loadings
– Quartimax: Simplifies variables
• Oblique rotation: Allows factor correlation
– Promax: Faster oblique method
– Direct oblimin: Flexible oblique method
20
Applications
• Psychology: Intelligence, personality factors
• Marketing: Customer attitude dimensions
• Finance: Risk factors in portfolios
• Quality control: Process variation sources
Circular Statistics
• Circular mean: Average direction accounting for circularity
• Circular variance: Measure of directional spread
• Von Mises distribution: Circular analog of normal distribution
• Rayleigh test: Test for uniformity of directions
Analysis Techniques
• Rose diagrams: Circular histograms
• Circular correlation: Relationships between circular variables
• Circular regression: Predicting circular outcomes
• Time series: Analyzing cyclical patterns over time
Quick Summary Checklist:
• ✓Factor analysis identifies underlying latent variables
• ✓EFA discovers structure, CFA tests hypotheses
• ✓Eigenvalue > 1 rule for factor selection
21
• ✓Rotation makes factors more interpretable
• ✓Directional data requires specialized circular statistics
• ✓Applications span psychology, marketing, meteorology
• ✓Circular mean and variance account for periodicity
22
• Statistical measures calculation
• Sampling technique exploration
• Hypothesis testing on real data
• Predictive modeling practice
• Monte Carlo simulation
• Factor analysis implementation
Best of luck with your CCEE exam!
[End of Complete Study Notes]
23