DATA ANALYTICS
UNIT Ⅰ
What are Data Analytics?
Data Analytics is the process of examining, cleaning, transforming, and interpreting data to
extract useful information, patterns, and insights. The main goal of data analytics is to support
better decision-making by discovering meaningful trends within data.
In simple terms, data analytics helps us understand what the data is telling us. Whether it's a
business, healthcare system, or even sports team — data analytics helps identify what is working
well and what needs improvement.
Why is Data Analytics Important?
Data analytics is important because it allows organizations to:
1. Make Informed Decisions:
o Businesses use data to understand customer behavior, track sales trends, and predict
future demands.
o Example: An e-commerce website analyzing customer purchase history to suggest
relevant products.
2. Improve Efficiency:
o Companies analyze data to identify inefficiencies and improve processes.
o Example: A delivery company tracking routes to find the fastest path for their
drivers.
3. Enhance Customer Experience:
o By understanding customer preferences, businesses can provide personalized
services.
o Example: Netflix recommending shows based on your watch history.
4. Detect Fraud and Risks:
o Banks and financial institutions use data analytics to detect suspicious transactions.
o Example: Credit card companies identifying unusual spending patterns to prevent
fraud.
5. Boost Innovation:
o Data insights often reveal new opportunities for product development or marketing
strategies.
o Example: Car manufacturers using data to improve vehicle performance and safety.
How Many Sectors Use Data Analytics?
Data analytics is used in almost every industry today. Some major sectors that heavily rely on data
analytics include:
1. Healthcare:
o Predicting disease outbreaks, improving patient care, and managing hospital
resources.
o Example: Tracking COVID-19 spread and identifying high-risk areas.
2. Finance and Banking:
o Risk assessment, fraud detection, and investment analysis.
o Example: Analyzing stock market trends to guide investment decisions.
3. Retail and E-commerce:
o Analyzing customer buying behavior, improving inventory management, and
personalizing marketing strategies.
o Example: Amazon recommending products based on your browsing history.
4. Telecommunications:
o Optimizing network performance, reducing downtime, and improving customer
service.
o Example: Telecom companies predicting network failures before they occur.
5. Manufacturing:
o Predictive maintenance, quality control, and supply chain management.
o Example: Factories using data analytics to prevent machine breakdowns.
6. Education:
o Tracking student performance, improving learning strategies, and predicting
dropout rates.
o Example: Schools using data to tailor teaching methods for better student outcomes.
7. Sports:
o Analyzing player performance, predicting match outcomes, and developing
winning strategies.
o Example: Cricket teams analyzing players’ strengths and weaknesses for better
team selection.
8. Transportation and Logistics:
o Optimizing delivery routes, reducing fuel costs, and tracking vehicle maintenance.
o Example: Uber using data to predict rider demand and assign drivers efficiently.
Why Should You Study Data Analytics?
Learning data analytics offers several benefits, especially in today's data-driven world:
1. High Demand for Data Analytics Skills:
o Companies across industries are investing in data analytics, creating a strong
demand for skilled professionals.
2. Diverse Career Opportunities:
o Roles such as Data Analyst, Data Scientist, Business Analyst, and Machine
Learning Engineer are growing rapidly.
3. Better Decision-Making Skills:
o Understanding data enables you to make evidence-based decisions in both
professional and personal situations.
4. Improved Problem-Solving Abilities:
o Data analytics trains you to analyze complex problems, identify patterns, and
develop effective solutions.
5. Attractive Salary Packages:
o Data analytics professionals are among the highest-paid in the tech industry.
6. Innovation and Creativity:
o Learning data analytics opens the door to creating unique solutions by interpreting
data trends creatively.
Conclusion
Data analytics is a powerful skill that is transforming industries by turning raw data into
meaningful insights. By studying data analytics, you gain valuable skills that can enhance your
career prospects, improve decision-making, and open opportunities in various sectors. Whether
you aim to work in business, healthcare, finance, or technology, mastering data analytics will
prepare you for success in today's data-driven world.
Introduction to Data Analytics Concepts
1. Sources and Nature of Data
Data can come from various sources and exists in different forms. Understanding the origin and
nature of data is essential for effective data analytics.
Sources of Data:
• Internal Sources: Data generated within an organization (e.g., sales records, employee
data).
• External Sources: Data from third-party providers, websites, or social media platforms.
• Machine-Generated Data: Data collected by sensors, logs, or automated systems.
• Human-Generated Data: Data created through user interactions, emails, or feedback.
Nature of Data:
• Quantitative Data: Numerical data that can be measured (e.g., sales figures, temperature).
• Qualitative Data: Descriptive data that is non-numeric (e.g., customer feedback, product
reviews).
2. Classification of Data
Data is classified into three types based on its structure:
1. Structured Data:
• Data organized in a tabular format with rows and columns.
• Stored in relational databases like MySQL, Oracle, or SQL Server.
• Example: Employee database with fields like ID, Name, Age, and Salary.
2. Semi-Structured Data:
• Data that is partially organized but does not follow a strict tabular structure.
• Stored in formats like JSON, XML, or CSV.
• Example: Emails, social media posts, or web pages.
3. Unstructured Data:
• Data that has no predefined structure and is difficult to organize.
• Requires specialized tools to manage and analyze.
• Example: Videos, images, audio files, and text documents.
3. Characteristics of Data
Data has several key characteristics that determine its value and complexity:
• Volume: The vast amount of data generated daily.
• Velocity: The speed at which data is created and processed.
• Variety: The diverse types and sources of data (structured, semi-structured, unstructured).
• Veracity: The reliability and accuracy of data.
• Value: The insights and benefits derived from data analysis.
4. Introduction to Big Data Platform
A Big Data Platform is an integrated system that combines various tools and frameworks to
manage large volumes of data efficiently.
Popular Big Data platforms include:
• Apache Hadoop: Used for distributed storage and processing.
• Apache Spark: Known for fast in-memory data processing.
• Amazon Web Services (AWS): Cloud-based services for data storage and analytics.
• Google Cloud Platform (GCP): Offers scalable data solutions.
Big Data platforms are essential for processing large-scale data that traditional systems cannot
handle.
5. Need for Data Analytics:
Data analytics is crucial for several reasons:
✅ Identifying trends and patterns to support decision-making.
✅ Improving operational efficiency by analyzing business processes.
✅ Enhancing customer experience through personalized services.
✅ Predicting future outcomes using historical data.
✅ Reducing risks and preventing fraud by identifying anomalies.
6. Evolution of Analytic Scalability
As data volume increased over time, analytics methods evolved to manage this growth:
• Traditional Analytics: Limited to small datasets and performed using basic tools like
Excel or SQL.
• Big Data Analytics: Introduced frameworks like Hadoop and Spark to handle massive
datasets.
• Cloud-Based Analytics: Uses platforms like AWS, Azure, and GCP to provide scalable
and efficient data processing.
7. Analytic Process and Tools
The data analytics process involves multiple stages:
1. Data Collection: Gathering data from various sources.
2. Data Cleaning: Removing duplicates, errors, and irrelevant information.
3. Data Transformation: Converting data into a suitable format for analysis.
4. Data Analysis: Applying statistical methods, machine learning models, or algorithms to
extract insights.
5. Data Visualization: Presenting insights through graphs, charts, and dashboards.
6. Decision-Making: Using the insights to make informed business decisions.
Popular Tools for Analytics:
• Python (with Pandas, NumPy, and Matplotlib)
• R (for statistical analysis)
• SQL (for database querying)
• Tableau and Power BI (for visualization)
• Apache Spark (for large-scale data processing)
8. Analysis vs Reporting
Though both deal with data, they serve different purposes:
• Analysis: Focuses on exploring data to identify trends, patterns, and insights.
o Example: Identifying customer segments that spend the most.
• Reporting: Presents data in a structured format for review.
o Example: Generating monthly sales reports.
Key Difference: Analysis discovers insights, while reporting presents organized information.
9. Modern Data Analytics Tools
Modern tools provide powerful features for data handling, visualization, and analysis:
• Apache Spark: Fast data processing for large datasets.
• Google BigQuery: Cloud-based platform for real-time analytics.
• Power BI: Business intelligence tool for interactive dashboards.
• D3.js: JavaScript library for creating dynamic visualizations.
• Jupyter Notebook: Interactive environment for coding and data exploration.
10. Applications of Data Analytics
Data analytics is widely used in various industries:
• Healthcare: Predicting disease outbreaks, improving treatment plans.
• Finance: Detecting fraud, managing risks, and optimizing investments.
• Retail: Personalized recommendations, sales trend prediction.
• Manufacturing: Predictive maintenance and improving production quality.
• Sports: Analyzing player performance and game strategies.
• Social Media: Identifying trends and user behavior patterns.
Data Analytics Lifecycle:
The Data Analytics Lifecycle is a structured process that guides data professionals through the
steps required to extract insights from data. It ensures that data analytics projects are carried out
efficiently, leading to valuable business insights and successful outcomes.
Need for Data Analytics Lifecycle
Following a defined lifecycle is essential because:
✅ It provides a clear framework for managing complex data projects.
✅ Ensures each stage is carefully executed, reducing errors and improving results.
✅ Helps teams collaborate effectively by defining roles and responsibilities.
✅ Ensures data insights are actionable and aligned with business goals.
✅ Reduces project risks by following a step-by-step approach.
Key Roles in a Successful Data Analytics Project
A successful data analytics project requires collaboration between various roles, each
contributing unique skills:
1. Business Analyst:
o Understands business goals and translates them into project requirements.
o Ensures insights align with business objectives.
2. Data Engineer:
o Collects, organizes, and manages large volumes of data.
o Builds data pipelines and ensures data quality.
3. Data Scientist:
o Applies statistical methods, machine learning, and programming to extract
insights.
o Designs and builds predictive models.
4. Data Analyst:
o Analyzes data using tools like Excel, SQL, or Python.
o Creates visual reports and dashboards for stakeholders.
5. Project Manager:
o Manages timelines, resources, and ensures the project stays on track.
o Coordinates communication between team members.
6. Stakeholders:
o Business leaders, decision-makers, or clients who define the project’s objectives.
Phases of Data Analytics Lifecycle
The data analytics lifecycle consists of six key phases:
1. Discovery Phase
Purpose: Understand the project’s objectives, goals, and scope.
Identify the business problem to solve.
Define project goals, timelines, and key performance indicators (KPIs).
Understand the data sources and their availability.
Engage with stakeholders to gather insights.
Example: An e-commerce company wants to predict customer churn. The goal is to identify
customers likely to leave and take preventive actions.
2. Data Preparation Phase
Purpose: Collect, clean, and organize data for analysis.
Gather data from multiple sources (databases, APIs, etc.).
Clean data by handling missing values, duplicates, and errors.
Perform data transformation (e.g., converting text to numerical values).
Split data into training and testing sets for model building.
Example: In a sales prediction project, data may include customer demographics, past
purchases, and website activity.
3. Model Planning Phase
Purpose: Decide on the analytical techniques and models to apply.
Select appropriate techniques like regression, clustering, or decision trees.
Identify the tools and frameworks for modeling (e.g., Python, R, or Spark).
Develop a strategy for feature selection and model evaluation.
Example: For predicting loan defaults, logistic regression or decision trees might be chosen for
their accuracy in classification tasks.
4. Model Building Phase
Purpose: Develop the chosen model and test its performance.
Write code to train the model using sample data.
Fine-tune model parameters for better accuracy.
Test the model on unseen data (testing data) to evaluate its performance.
Perform model validation using metrics like accuracy, precision, or recall.
Example: A credit scoring model may be trained on historical data to predict customer risk
levels.
5. Communicating Results Phase
Purpose: Present the findings to stakeholders in a clear and actionable format.
Visualize insights using tools like Tableau, Power BI, or Python’s Matplotlib.
Create dashboards, charts, and reports to summarize the results.
Provide actionable recommendations based on insights.
Example: Presenting a report that identifies the top customer segments contributing to sales
growth.
6. Operationalization Phase
Purpose: Deploy the model and integrate it into the organization’s workflow.
Implement the model in production environments for real-time analysis.
Automate data pipelines to ensure continuous data flow.
Monitor model performance and retrain it as needed to improve accuracy.
Example: A fraud detection system integrated with a bank’s transaction system to alert
suspicious activities in real-time.
Summary of Data Analytics Lifecycle Phases
Phase Purpose Key Activities
Discovery Understand project goals Define objectives, scope, and KPIs
Data collection, cleaning, and
Data Preparation Clean and organize data
transformation
Model Planning Choose suitable models Select algorithms, techniques, and tools
Model Building Develop and test models Train model, evaluate performance
Present insights to
Communicating Results Visualizations, dashboards, and reports
stakeholders
Deploy and monitor the
Operationalization Integrate model into business processes
model
UNIT Ⅱ
Data Analysis Techniques and Methods
Data analysis involves applying various mathematical, statistical, and machine learning
techniques to understand data, uncover patterns, and make predictions. Below are key methods
used in data analysis with simple explanations.
1. Regression Modeling
Regression modeling is a technique used to understand the relationship between one or more
independent variables (predictors) and a dependent variable (outcome).
Types of Regression Models:
• Linear Regression: Establishes a straight-line relationship between variables.
Example: Predicting house prices based on area size.
• Multiple Regression: Involves multiple independent variables.
Example: Predicting sales based on factors like price, advertising, and season.
• Logistic Regression: Used for classification problems where the outcome is binary (e.g.,
Yes/No, True/False).
Example: Predicting whether a customer will buy a product.
2. Multivariate Analysis
Multivariate analysis examines multiple variables at the same time to understand their
relationships and patterns.
Common Multivariate Techniques:
• Principal Component Analysis (PCA): Reduces the number of features while retaining
key information.
• Cluster Analysis: Groups similar data points based on common characteristics.
• Factor Analysis: Identifies underlying variables (factors) that explain data patterns.
Example: In marketing, multivariate analysis can be used to segment customers based on
demographics, spending habits, and preferences.
3. Bayesian Modeling
Bayesian modeling is a statistical approach that incorporates prior knowledge (beliefs) along
with new data to improve decision-making.
Key Concepts in Bayesian Modeling:
• Bayes' Theorem: Used to update the probability of a hypothesis based on new evidence.
• Posterior Probability: The updated probability after considering new data.
Example: In spam email detection, Bayesian modeling updates the probability that an email is
spam based on keywords found in the email.
4. Inference and Bayesian Networks
Bayesian networks are graphical models that use probabilities to represent the relationships
between variables.
Key Features of Bayesian Networks:
• Represent uncertainty and dependencies in data.
• Useful in decision-making under uncertainty.
Example: In medical diagnosis, Bayesian networks can predict the likelihood of a disease based
on observed symptoms.
5. Support Vector Machines (SVM) and Kernel Methods
SVM is a powerful machine learning algorithm used for classification and regression tasks.
How SVM Works:
• It finds a hyperplane (decision boundary) that best separates data points into classes.
• Kernel Methods transform non-linear data into a higher dimension where it becomes
linearly separable.
Example: SVM is widely used in image recognition, like identifying handwritten digits.
6. Analysis of Time Series Data
Time series analysis deals with data points collected over time, focusing on identifying trends,
seasonal patterns, and forecasting.
Key Techniques in Time Series Analysis:
• Linear Systems Analysis: Models that assume a stable and linear relationship over time.
• Nonlinear Dynamics: Used for complex data patterns that do not follow linear trends.
Example: Predicting stock prices, weather forecasting, or analyzing website traffic trends.
7. Rule Induction
Rule induction is a machine learning method that extracts decision rules from data.
Key Features:
• Produces human-readable rules for decision-making.
• Often used in data mining for discovering hidden patterns.
Example: In retail, rule induction may reveal that "customers who buy milk are likely to buy
bread."
8. Neural Networks: Learning and Generalization
Neural networks are designed to mimic the human brain and are highly effective in solving
complex problems.
Key Concepts:
• Learning: Neural networks learn patterns from data through training.
• Generalization: Ensures the model performs well on unseen data.
Example: Image recognition systems that detect objects in photos.
9. Competitive Learning
Competitive learning is a type of unsupervised learning where neurons compete to respond to
input data.
Key Features:
• Each neuron adjusts itself to specialize in recognizing specific patterns.
• Often used in clustering algorithms.
Example: Identifying customer segments based on shopping behavior.
10. Principal Component Analysis (PCA) and Neural
Networks
PCA is a dimensionality reduction technique that simplifies data by reducing the number of
variables while retaining key information.
PCA in Neural Networks:
• PCA reduces input dimensions, improving the efficiency of neural networks.
• Useful for visualizing high-dimensional data.
Example: PCA can reduce the number of features in a facial recognition model while
maintaining accuracy.
11. Fuzzy Logic and Fuzzy Models
Fuzzy logic is a method used for handling uncertainty and imprecision in data.
Key Concepts in Fuzzy Logic:
• Instead of strict "True" or "False" values, fuzzy logic allows partial truths (values
between 0 and 1).
• Useful for modeling complex systems where precise data may be lacking.
Example: In washing machines, fuzzy logic determines the wash cycle based on dirt level and
fabric type.
Fuzzy Decision Trees:
• Combines decision trees with fuzzy logic for improved handling of uncertain data.
• Useful in medical diagnosis, finance, and control systems.
12. Stochastic Search Methods
Stochastic search methods are algorithms that use random sampling to find optimal solutions in
complex problems.
Common Techniques:
• Genetic Algorithms (GA): Inspired by natural evolution, GA optimizes solutions
through mutation and selection.
• Simulated Annealing: Mimics the heating and cooling process to explore possible
solutions.
• Particle Swarm Optimization (PSO): Inspired by the behavior of birds in a flock.
Example: Stochastic methods are widely used in financial modeling, logistics optimization, and
game theory.
Summary of Techniques and Applications
Technique Purpose Example Applications
Regression Modeling Predict continuous outcomes Predicting house prices
Analyzing relationships between
Multivariate Analysis Customer segmentation
variables
Updating predictions using new
Bayesian Modeling Spam email detection
data
Modeling uncertainty and
Bayesian Networks Medical diagnosis
dependencies
Support Vector Machines Classification and pattern
Handwriting recognition
(SVM) recognition
Time Series Analysis Forecasting future trends Stock price prediction
Rule Induction Discovering patterns in data Market basket analysis
Neural Networks Learning complex patterns Facial recognition
Clustering and pattern Customer behavior
Competitive Learning
recognition analysis
PCA (Principal Component
Dimensionality reduction Data visualization
Analysis)
Handling uncertainty and
Fuzzy Logic Smart home systems
imprecise data
Route optimization in
Stochastic Search Methods Finding optimal solutions
logistics
UNIT Ⅲ
Mining Data Streams
Mining data streams is the process of analyzing continuous and fast-flowing data in real-time.
Unlike traditional data storage systems where data is static, data streams are dynamic, requiring
specialized techniques to extract useful insights.
1. Introduction to Stream Concepts
A data stream is a continuous flow of data that arrives in real-time or near-real-time. This data
is generated rapidly and in large volumes, making traditional batch processing methods
inefficient.
Key Features of Data Streams:
• Continuous Flow: Data arrives endlessly without stopping.
• Fast-Paced: Data must be processed quickly to gain real-time insights.
• Dynamic Nature: Data streams can change patterns frequently.
• Limited Memory: Only a portion of the data can be stored or analyzed at a time.
Examples of Data Streams:
• Social media feeds (Twitter, Facebook posts)
• Stock market data
• Sensor data from IoT devices
• Website clickstreams
• Financial transactions for fraud detection
2. Stream Data Model and Architecture
The stream data model describes how data is represented and processed in real-time systems.
Components of Stream Data Architecture:
1. Data Sources: Devices or platforms that generate continuous data (e.g., IoT sensors,
social media platforms).
2. Data Stream Management System (DSMS): Specialized systems like Apache Kafka,
Apache Flink, and Apache Storm that handle data streams.
3. Processing Engine: Performs data filtering, aggregation, and analysis (e.g., Apache
Spark Streaming).
4. Storage System: Stores processed data for future analysis (e.g., Amazon S3, Hadoop
HDFS).
5. Visualization Tools: Displays insights using dashboards like Grafana or Tableau.
3. Stream Computing
Stream computing refers to processing and analyzing data as it arrives rather than waiting to
collect all data first.
Key Features of Stream Computing:
✅ Real-time processing with minimal delay
✅ Capable of handling large-scale data
✅ Often uses distributed computing for scalability
Example: Detecting credit card fraud by identifying unusual transaction patterns in real-time.
4. Sampling Data in a Stream
Since it's impossible to store an entire data stream due to its continuous nature, sampling
techniques are used to select representative data points.
Common Sampling Methods:
• Random Sampling: Selects random data points to represent the stream.
• Reservoir Sampling: Maintains a fixed-size sample from a stream, ensuring all elements
have an equal chance of being selected.
Example: Sampling social media posts to analyze trending hashtags.
5. Filtering Streams
Filtering is the process of extracting relevant data from a continuous stream based on specific
conditions.
Techniques for Filtering:
• Keyword Filtering: Extracting tweets that mention specific words (e.g., “AI,” “Big
Data”).
• Value-Based Filtering: Only accepting data points that meet certain criteria (e.g.,
temperatures above 30°C).
Example: Filtering Twitter data to analyze tweets related to elections.
6.Counting Distinct Elements in a Stream
Counting unique elements in a data stream can be challenging due to memory constraints.
Efficient Algorithms for Counting:
• HyperLogLog Algorithm: Estimates the number of unique elements with minimal
memory usage.
• Bloom Filters: A space-efficient data structure for checking the existence of an element
in a set.
Example: Counting the number of unique IP addresses visiting a website.
7. Estimating Moments
In data streams, moments are statistical properties that describe the data distribution.
Key Moments:
• First Moment (Mean): The average value of the data stream.
• Second Moment (Variance): Measures data spread or variation.
• Third and Fourth Moments: Capture the skewness and shape of the data distribution.
Example: Estimating the average temperature from IoT sensor data.
8. Counting Oneness in a Window
Counting oneness refers to counting the number of specific values (like "1") within a particular
time window.
Window Concepts in Streaming Data:
• Sliding Window: Moves continuously with time, maintaining recent data points.
• Tumbling Window: Divides the data into fixed-sized intervals without overlap.
Example: Counting the number of failed login attempts within the last 5 minutes.
9. Decaying Window
A decaying window assigns less importance to older data points and gives more weight to recent
data.
Purpose of Decaying Window:
• Ensures recent trends have a stronger impact on analysis.
• Helps prevent outdated information from influencing predictions.
Example: In stock trading, recent price fluctuations are given more importance than older prices.
10. Real-time Analytics Platform (RTAP) Applications
RTAP platforms are designed to process and analyze streaming data in real-time.
Popular RTAP Tools:
• Apache Kafka: For building real-time data pipelines.
• Apache Flink: For large-scale stream data processing.
• Amazon Kinesis: For streaming data on AWS cloud.
• Apache Spark Streaming: For distributed stream processing.
RTAP Use Cases:
✅ Fraud detection in financial systems
✅ Real-time recommendation engines for e-commerce
✅ Monitoring server logs for security threats
✅ Tracking social media trends
11. Case Studies in Data Stream Mining
Case Study 1: Real-time Sentiment Analysis
Objective: Analyze social media data to understand public sentiment.
Process:
• Collect real-time tweets or posts.
• Filter tweets based on keywords or hashtags.
• Use natural language processing (NLP) techniques to classify sentiments as positive,
negative, or neutral.
• Visualize sentiment trends using dashboards.
Example: Identifying customer reactions to a new product launch on Twitter.
Case Study 2: Stock Market Predictions
Objective: Predict stock price changes based on real-time market data.
Process:
• Stream financial data such as stock prices, trade volumes, and economic news.
• Use time series analysis techniques like ARIMA or LSTM models to predict future
prices.
• Generate alerts for potential market trends or risks.
Example: Identifying early signs of a market crash based on unusual trading patterns.
Summary of Concepts and Applications
Concept Description Example Application
Defines how streaming data is
Stream Data Model Website clickstream analysis
structured
Fraud detection during credit
Stream Computing Real-time data processing
card transactions
Sampling Data in a Selecting key data points for
Monitoring social media trends
Stream analysis
Extracting relevant data from Identifying specific keywords in
Filtering Streams
continuous flow live tweets
Counting Distinct Tracking unique data points
Counting unique website visitors
Elements efficiently
Calculating statistical properties of Analyzing average sensor
Estimating Moments
data readings
Counting Oneness in a Counting specific values within a Detecting frequent failed logins
Window defined window in 5 minutes
Giving higher importance to recent Predicting stock prices based on
Decaying Window
data recent trends
Real-time data platforms for fast Real-time customer support
RTAP Applications
processing response systems
UNIT Ⅳ
Frequent Itemset and Clustering
1. Mining Frequent Itemset
Frequent itemset are groups of items that appear together often in a dataset.
Mining these patterns is very important for understanding customer behavior, recommendations,
and decision-making.
Example:
In a supermarket, customers often buy bread and butter together. "Bread, Butter" would be a
frequent itemset.
2. Market-Based Modelling
Market-based modeling uses frequent itemsets to find relationships between products and
customer buying habits.
This is the idea behind Market Basket Analysis.
• Goal: Find which products are often bought together to increase sales.
• Example: If a customer buys a printer, they are likely to buy ink too.
• Result: The store can bundle products or offer discounts to increase sales.
3. Apriori Algorithm
The Apriori Algorithm is a classic and popular method for mining frequent itemsets.
How Apriori Works:
1. Find all individual items that meet the minimum support (appear frequently enough).
2. Combine items to form pairs, triples, etc., and check if these combinations meet the
minimum support.
3. Repeat the process until no more combinations are frequent.
Important Concepts:
• Support: How often an itemset appears.
• Confidence: How likely an item B is bought when A is bought.
Simple Example:
If 70% of customers who buy milk also buy bread, then:
• Support for (milk, bread) = 70%
• Confidence of buying bread after milk = 70%
4. Handling Large Datasets in Main Memory
Dealing with huge datasets that don’t fit into the computer’s memory is a big challenge.
Techniques:
• Data Partitioning: Divide data into smaller parts and process separately.
• Compression: Store data more compactly to save memory.
• Efficient Data Structures: Use trees like FP-Tree (Frequent Pattern Tree) to store data
efficiently.
5. Limited Pass Algorithm
When data is very large, we cannot scan it many times.
Limited pass algorithms try to find frequent itemsets with only 1 or 2 scans of the database.
Why Limited Pass?
• Saves time.
• Reduces memory usage.
• Important for real-time systems.
Example: The SON Algorithm for distributed data mining uses two passes.
6. Counting Frequent Itemsets in a Stream
In streaming data, items keep coming continuously (e.g., live transaction data).
Challenges:
• Cannot store the whole data.
• Must process data quickly.
Techniques Used:
• Approximate Counting: Allow small errors but save a lot of memory.
• Lossy Counting Algorithm: Keep only frequent itemsets and forget rare ones.
Clustering Techniques
Clustering is grouping similar data points into clusters so that points in the same group are more
similar to each other.
1. Hierarchical Clustering
• Builds a tree (called dendrogram) of clusters.
• Can be:
o Agglomerative: Start with each data point as its own cluster and merge them step
by step.
o Divisive: Start with one big cluster and split it step by step.
Example:
In a company, employees can be grouped first by department, then by teams.
2. K-Means Clustering
• Most popular clustering algorithm.
• Steps:
1. Choose K (number of clusters).
2. Randomly select K points as centers.
3. Assign each data point to the nearest center.
4. Update centers by averaging points in each cluster.
5. Repeat until clusters don’t change much.
Example:
Grouping customers based on their spending patterns.
3. Clustering High Dimensional Data
When data has hundreds or thousands of features (like in genetics or text data), clustering
becomes difficult.
Problems:
• Distance between points becomes meaningless.
• Memory and processing requirements are high.
Solutions:
Use special methods like CLIQUE and ProCLUS.
4. CLIQUE and ProCLUS
CLIQUE (Clustering in Quest):
• Divides the data space into small grids.
• Finds dense areas in some dimensions.
• Suitable for very large and high-dimensional datasets.
ProCLUS (Projected Clustering):
• Clusters data based only on relevant dimensions (not all features).
• Reduces noise by ignoring irrelevant dimensions.
Example:
In customer data, only "income" and "spending" may be important, ignoring "zip code."
5. Frequent Pattern-Based Clustering Methods
These methods use frequent itemsets to cluster data.
Steps:
1. Find frequent patterns.
2. Group data points having similar patterns.
Example:
Clustering customers based on commonly bought product groups.
6. Clustering in Non-Euclidean Space
Normally, we use Euclidean distance (straight-line distance) to measure similarity.
But sometimes, it’s not the best choice.
Other Distance Measures:
• Manhattan Distance
• Cosine Similarity
• Jaccard Similarity
Example:
In text documents, cosine similarity (based on angles) is better than simple distance.
7. Clustering for Streams and Parallelism
Clustering for streams means clustering data that arrives continuously.
Challenges:
• Cannot store all data.
• Need to update clusters in real-time.
Solution:
• Use micro-clusters that summarize the data.
• Merge or split clusters as new data arrives.
Parallelism:
To speed up clustering:
• Distribute the work across multiple processors/machines.
• Each processor works on a part of the data.
Example:
In sensor networks, cluster live data from different sensors in real-time using distributed
computing.
Summary Table
Topic Meaning Example
Mining Frequent Finding commonly occurring item
Bread and butter bought together
Itemsets groups
Market-Based
Using frequent patterns for marketing Cross-selling related products
Modelling
A step-by-step method to find Suggesting related products on
Apriori Algorithm
frequent itemsets Amazon
Managing huge data without
Handling Large Data Data partitioning, compression
overloading memory
Clustering Grouping similar items together Grouping similar customers
Hierarchical Department- and team-wise
Building a tree of clusters
Clustering employee grouping
K-Means Clustering Dividing data into K clusters Customer segmentation
High Dimensional Special clustering when features are
Gene expression data analysis
Clustering too many
Real-time clustering of continuous
Clustering in Streams Social media trend analysis
data
Distributing clustering work across Big sensor network data
Parallel Clustering
multiple machines clustering
UNIT Ⅴ
Frameworks and Visualization
Frameworks for Big Data Processing
Big data processing requires strong frameworks and tools. These frameworks help to store,
manage, and analyze huge amounts of data.
1. MapReduce
• MapReduce is a programming model for processing large datasets in parallel.
• It breaks down tasks into two main steps:
o Map: Process and filter data.
o Reduce: Aggregate results from the Map step.
Example: Counting the number of times each word appears in a large book.
2. Hadoop
• Hadoop is an open-source framework that uses MapReduce and provides a system to
store and process big data across many computers.
• Main components:
o HDFS (Hadoop Distributed File System): For storing big data.
o YARN (Yet Another Resource Negotiator): For managing resources and
scheduling tasks.
Example: Storing and processing Facebook’s user data.
3. Pig
• Pig is a high-level platform for creating MapReduce programs using a language called
Pig Latin.
• Easier than writing raw Java MapReduce code.
Example: Processing large logs to find error patterns.
4. Hive
• Hive allows querying large datasets using a language similar to SQL called HiveQL.
• It translates HiveQL queries into MapReduce jobs automatically.
Example: Running queries on petabytes of sales data.
5. HBase
• HBase is a NoSQL database that runs on top of HDFS.
• It stores huge amounts of structured data and supports real-time read/write access.
Example: Facebook uses HBase to store user messaging data.
6. MapR
• MapR provides a commercial distribution of Hadoop with improvements.
• Features:
o High performance
o Easy installation
o Better data protection
Example: Used by industries like healthcare, finance, and retail for big data analytics.
7. Sharding
• Sharding means splitting a large database into smaller, faster, easily manageable parts
called shards.
• Helps in managing large-scale databases efficiently.
Example: A large customer database split based on regions (Asia, Europe, America).
8. NoSQL Databases
• NoSQL stands for "Not Only SQL."
• They handle unstructured and semi-structured data.
• Types of NoSQL databases:
o Document-based: (MongoDB)
o Key-Value store: (Redis)
o Column-based: (Cassandra)
o Graph-based: (Neo4j)
Example: Storing user profiles, social media data, IoT sensor readings.
9. S3 (Simple Storage Service)
• Amazon S3 is a cloud-based storage service.
• Stores and retrieves any amount of data from anywhere.
• Highly scalable, durable, and cost-effective.
Example: Netflix stores their movies and user data in Amazon S3.
10. Hadoop Distributed File System (HDFS)
• HDFS is the storage system of Hadoop.
• It splits large files into blocks and stores them across multiple machines.
• Provides fault tolerance — even if one machine fails, data is safe.
Example: Storing large volumes of e-commerce transaction data.
Visualization
Data Visualization means showing data in a graphical or pictorial format to help people
understand it better.
1. Visual Data Analysis Techniques
• Represent data visually before performing detailed analysis.
• Examples:
o Scatter Plots
o Bar Charts
o Heatmaps
o Pie Charts
Purpose: Helps identify patterns, outliers, and trends quickly.
2. Interaction Techniques
• Techniques that allow users to interact with data visualizations.
• Examples:
o Zooming: See data details by zooming into a graph.
o Filtering: Show only relevant data by applying filters.
o Drill-down: Click on a chart to get more detailed views.
Example: Clicking on a country on a world map to see detailed sales figures.
3. Systems and Applications for Visualization
• Tools and software used for creating visualizations:
o Tableau
o Power BI
o QlikView
o D3.js (JavaScript Library)
Applications:
• Business Intelligence (BI) dashboards
• Financial reporting
• Healthcare data monitoring
Introduction to R
R is a popular programming language for statistics, data analysis, and visualization.
1. R Graphical User Interfaces (GUI)
• GUIs make R easier to use, without writing too much code.
• Examples of R GUIs:
o RStudio: Most popular IDE for R.
o Rattle: GUI for data mining.
2. Data Import and Export
• Import: Bringing data into R from CSV, Excel, databases, etc.
• Export: Saving results back into files.
Example Code:
r
CopyEdit
data <- read.csv("data.csv") # Import
write.csv (data, "output.csv") # Export
3. Attribute and Data Types
• Attributes describe features of data.
• Data Types in R:
o Numeric (e.g., 1.5, 3.14)
o Integer (e.g., 2, 100)
o Character (e.g., "Hello")
o Factor (categorical data)
o Logical (TRUE, FALSE)
4. Descriptive Statistics
• Summary of data using numbers like:
o Mean (Average)
o Median (Middle value)
o Mode (Most frequent value)
o Standard Deviation (Measure of spread)
Example Code:
r
CopyEdit
mean(data$Age)
summary(data)
5. Exploratory Data Analysis (EDA)
• First step before detailed analysis.
• Involves:
o Checking missing data
o Finding distributions
o Spotting outliers
Example: Using histograms to see the age distribution of customers.
6. Visualization Before Analysis
• Plotting data helps in understanding its structure and spotting issues early.
• Common plots:
o Histograms
o Boxplots
o Scatterplots
Example Code:
r
CopyEdit
hist(data$Salary)
boxplot(data$Age)
7. Analytics for Unstructured Data
• Unstructured data = No fixed format (e.g., Text, Images, Videos).
• Techniques:
o Text Mining: Extract information from emails, tweets.
o Sentiment Analysis: Finding emotions in text (positive, negative).
Example: Analyzing customer reviews to find satisfaction levels.
Summary Table
Topic Meaning Example
MapReduce Process large data in parallel Word count in documents
Framework for big data storage and
Hadoop Facebook data management
processing
Pig Easy scripting for MapReduce Log file analysis
Hive SQL for Hadoop Sales report generation
HBase NoSQL database on Hadoop Storing messages
MapR Commercial Hadoop with improvements Financial data processing
Topic Meaning Example
Large customer database
Sharding Splitting databases
management
NoSQL Non-relational databases Storing social media data
S3 Cloud storage Netflix storing movies
Visualization Showing data graphically Sales dashboard
R Statistical computing language Data analysis, graphs creation