0% found this document useful (0 votes)

345 views33 pages

Data Analytics Complete Notes

Data analytics involves examining and interpreting data to extract insights that support decision-making across various sectors, including healthcare, finance, and retail. It is essential for improving efficiency, enhancing customer experience, and detecting fraud, with a growing demand for skilled professionals in this field. The data analytics lifecycle includes phases such as discovery, data preparation, model planning, model building, communicating results, and operationalization, each critical for successful data projects.

Uploaded by

Gaurav Trivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

345 views33 pages

Data Analytics Complete Notes

Uploaded by

Gaurav Trivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

DATA ANALYTICS

UNIT Ⅰ
What are Data Analytics?
Data Analytics is the process of examining, cleaning, transforming, and interpreting data to
extract useful information, patterns, and insights. The main goal of data analytics is to support
better decision-making by discovering meaningful trends within data.
In simple terms, data analytics helps us understand what the data is telling us. Whether it's a
business, healthcare system, or even sports team — data analytics helps identify what is working
well and what needs improvement.

Why is Data Analytics Important?

Data analytics is important because it allows organizations to:
1. Make Informed Decisions:
o Businesses use data to understand customer behavior, track sales trends, and predict
future demands.
o Example: An e-commerce website analyzing customer purchase history to suggest
relevant products.
2. Improve Efficiency:
o Companies analyze data to identify inefficiencies and improve processes.
o Example: A delivery company tracking routes to find the fastest path for their
drivers.
3. Enhance Customer Experience:
o By understanding customer preferences, businesses can provide personalized
services.
o Example: Netflix recommending shows based on your watch history.
4. Detect Fraud and Risks:
o Banks and financial institutions use data analytics to detect suspicious transactions.
o Example: Credit card companies identifying unusual spending patterns to prevent
fraud.
5. Boost Innovation:
o Data insights often reveal new opportunities for product development or marketing
strategies.
o Example: Car manufacturers using data to improve vehicle performance and safety.

How Many Sectors Use Data Analytics?

Data analytics is used in almost every industry today. Some major sectors that heavily rely on data
analytics include:
1. Healthcare:
o Predicting disease outbreaks, improving patient care, and managing hospital
resources.
o Example: Tracking COVID-19 spread and identifying high-risk areas.
2. Finance and Banking:
o Risk assessment, fraud detection, and investment analysis.
o Example: Analyzing stock market trends to guide investment decisions.
3. Retail and E-commerce:
o Analyzing customer buying behavior, improving inventory management, and
personalizing marketing strategies.
o Example: Amazon recommending products based on your browsing history.
4. Telecommunications:
o Optimizing network performance, reducing downtime, and improving customer
service.
o Example: Telecom companies predicting network failures before they occur.
5. Manufacturing:
o Predictive maintenance, quality control, and supply chain management.
o Example: Factories using data analytics to prevent machine breakdowns.
6. Education:
o Tracking student performance, improving learning strategies, and predicting
dropout rates.
o Example: Schools using data to tailor teaching methods for better student outcomes.
7. Sports:
o Analyzing player performance, predicting match outcomes, and developing
winning strategies.
o Example: Cricket teams analyzing players’ strengths and weaknesses for better
team selection.
8. Transportation and Logistics:
o Optimizing delivery routes, reducing fuel costs, and tracking vehicle maintenance.
o Example: Uber using data to predict rider demand and assign drivers efficiently.

Why Should You Study Data Analytics?

Learning data analytics offers several benefits, especially in today's data-driven world:
1. High Demand for Data Analytics Skills:
o Companies across industries are investing in data analytics, creating a strong
demand for skilled professionals.
2. Diverse Career Opportunities:
o Roles such as Data Analyst, Data Scientist, Business Analyst, and Machine
Learning Engineer are growing rapidly.
3. Better Decision-Making Skills:
o Understanding data enables you to make evidence-based decisions in both
professional and personal situations.
4. Improved Problem-Solving Abilities:
o Data analytics trains you to analyze complex problems, identify patterns, and
develop effective solutions.
5. Attractive Salary Packages:
o Data analytics professionals are among the highest-paid in the tech industry.
6. Innovation and Creativity:
o Learning data analytics opens the door to creating unique solutions by interpreting
data trends creatively.

Conclusion

Data analytics is a powerful skill that is transforming industries by turning raw data into
meaningful insights. By studying data analytics, you gain valuable skills that can enhance your
career prospects, improve decision-making, and open opportunities in various sectors. Whether
you aim to work in business, healthcare, finance, or technology, mastering data analytics will
prepare you for success in today's data-driven world.
Introduction to Data Analytics Concepts

1. Sources and Nature of Data

Data can come from various sources and exists in different forms. Understanding the origin and
nature of data is essential for effective data analytics.

Sources of Data:

• Internal Sources: Data generated within an organization (e.g., sales records, employee
data).
• External Sources: Data from third-party providers, websites, or social media platforms.
• Machine-Generated Data: Data collected by sensors, logs, or automated systems.
• Human-Generated Data: Data created through user interactions, emails, or feedback.

Nature of Data:

• Quantitative Data: Numerical data that can be measured (e.g., sales figures, temperature).
• Qualitative Data: Descriptive data that is non-numeric (e.g., customer feedback, product
reviews).

2. Classification of Data

Data is classified into three types based on its structure:

1. Structured Data:

• Data organized in a tabular format with rows and columns.

• Stored in relational databases like MySQL, Oracle, or SQL Server.
• Example: Employee database with fields like ID, Name, Age, and Salary.

2. Semi-Structured Data:

• Data that is partially organized but does not follow a strict tabular structure.
• Stored in formats like JSON, XML, or CSV.
• Example: Emails, social media posts, or web pages.

3. Unstructured Data:

• Data that has no predefined structure and is difficult to organize.

• Requires specialized tools to manage and analyze.
• Example: Videos, images, audio files, and text documents.
3. Characteristics of Data

Data has several key characteristics that determine its value and complexity:

• Volume: The vast amount of data generated daily.

• Velocity: The speed at which data is created and processed.
• Variety: The diverse types and sources of data (structured, semi-structured, unstructured).
• Veracity: The reliability and accuracy of data.
• Value: The insights and benefits derived from data analysis.

4. Introduction to Big Data Platform

A Big Data Platform is an integrated system that combines various tools and frameworks to
manage large volumes of data efficiently.

Popular Big Data platforms include:

• Apache Hadoop: Used for distributed storage and processing.

• Apache Spark: Known for fast in-memory data processing.
• Amazon Web Services (AWS): Cloud-based services for data storage and analytics.
• Google Cloud Platform (GCP): Offers scalable data solutions.

Big Data platforms are essential for processing large-scale data that traditional systems cannot
handle.

5. Need for Data Analytics:

Data analytics is crucial for several reasons:

✅ Identifying trends and patterns to support decision-making.

✅ Improving operational efficiency by analyzing business processes.
✅ Enhancing customer experience through personalized services.
✅ Predicting future outcomes using historical data.
✅ Reducing risks and preventing fraud by identifying anomalies.

6. Evolution of Analytic Scalability

As data volume increased over time, analytics methods evolved to manage this growth:

• Traditional Analytics: Limited to small datasets and performed using basic tools like
Excel or SQL.
• Big Data Analytics: Introduced frameworks like Hadoop and Spark to handle massive
datasets.
• Cloud-Based Analytics: Uses platforms like AWS, Azure, and GCP to provide scalable
and efficient data processing.

7. Analytic Process and Tools

The data analytics process involves multiple stages:

1. Data Collection: Gathering data from various sources.

2. Data Cleaning: Removing duplicates, errors, and irrelevant information.
3. Data Transformation: Converting data into a suitable format for analysis.
4. Data Analysis: Applying statistical methods, machine learning models, or algorithms to
extract insights.
5. Data Visualization: Presenting insights through graphs, charts, and dashboards.
6. Decision-Making: Using the insights to make informed business decisions.

Popular Tools for Analytics:

• Python (with Pandas, NumPy, and Matplotlib)

• R (for statistical analysis)
• SQL (for database querying)
• Tableau and Power BI (for visualization)
• Apache Spark (for large-scale data processing)

8. Analysis vs Reporting

Though both deal with data, they serve different purposes:

• Analysis: Focuses on exploring data to identify trends, patterns, and insights.

o Example: Identifying customer segments that spend the most.
• Reporting: Presents data in a structured format for review.
o Example: Generating monthly sales reports.

Key Difference: Analysis discovers insights, while reporting presents organized information.

9. Modern Data Analytics Tools

Modern tools provide powerful features for data handling, visualization, and analysis:

• Apache Spark: Fast data processing for large datasets.

• Google BigQuery: Cloud-based platform for real-time analytics.
• Power BI: Business intelligence tool for interactive dashboards.
• D3.js: JavaScript library for creating dynamic visualizations.
• Jupyter Notebook: Interactive environment for coding and data exploration.

10. Applications of Data Analytics

Data analytics is widely used in various industries:

• Healthcare: Predicting disease outbreaks, improving treatment plans.

• Finance: Detecting fraud, managing risks, and optimizing investments.
• Retail: Personalized recommendations, sales trend prediction.
• Manufacturing: Predictive maintenance and improving production quality.
• Sports: Analyzing player performance and game strategies.
• Social Media: Identifying trends and user behavior patterns.

Data Analytics Lifecycle:

The Data Analytics Lifecycle is a structured process that guides data professionals through the
steps required to extract insights from data. It ensures that data analytics projects are carried out
efficiently, leading to valuable business insights and successful outcomes.

Need for Data Analytics Lifecycle

Following a defined lifecycle is essential because:

✅ It provides a clear framework for managing complex data projects.

✅ Ensures each stage is carefully executed, reducing errors and improving results.
✅ Helps teams collaborate effectively by defining roles and responsibilities.
✅ Ensures data insights are actionable and aligned with business goals.
✅ Reduces project risks by following a step-by-step approach.

Key Roles in a Successful Data Analytics Project

A successful data analytics project requires collaboration between various roles, each
contributing unique skills:

1. Business Analyst:
o Understands business goals and translates them into project requirements.
o Ensures insights align with business objectives.
2. Data Engineer:
o Collects, organizes, and manages large volumes of data.
o Builds data pipelines and ensures data quality.
3. Data Scientist:
o Applies statistical methods, machine learning, and programming to extract
insights.
o Designs and builds predictive models.
4. Data Analyst:
o Analyzes data using tools like Excel, SQL, or Python.
o Creates visual reports and dashboards for stakeholders.
5. Project Manager:
o Manages timelines, resources, and ensures the project stays on track.
o Coordinates communication between team members.
6. Stakeholders:
o Business leaders, decision-makers, or clients who define the project’s objectives.

Phases of Data Analytics Lifecycle

The data analytics lifecycle consists of six key phases:

1. Discovery Phase

Purpose: Understand the project’s objectives, goals, and scope.

Identify the business problem to solve.

Define project goals, timelines, and key performance indicators (KPIs).
Understand the data sources and their availability.
Engage with stakeholders to gather insights.

Example: An e-commerce company wants to predict customer churn. The goal is to identify
customers likely to leave and take preventive actions.

2. Data Preparation Phase

Purpose: Collect, clean, and organize data for analysis.

Gather data from multiple sources (databases, APIs, etc.).

Clean data by handling missing values, duplicates, and errors.
Perform data transformation (e.g., converting text to numerical values).
Split data into training and testing sets for model building.

Example: In a sales prediction project, data may include customer demographics, past
purchases, and website activity.

3. Model Planning Phase

Purpose: Decide on the analytical techniques and models to apply.

Select appropriate techniques like regression, clustering, or decision trees.

Identify the tools and frameworks for modeling (e.g., Python, R, or Spark).
Develop a strategy for feature selection and model evaluation.

Example: For predicting loan defaults, logistic regression or decision trees might be chosen for
their accuracy in classification tasks.

4. Model Building Phase

Purpose: Develop the chosen model and test its performance.

Write code to train the model using sample data.

Fine-tune model parameters for better accuracy.
Test the model on unseen data (testing data) to evaluate its performance.
Perform model validation using metrics like accuracy, precision, or recall.

Example: A credit scoring model may be trained on historical data to predict customer risk
levels.

5. Communicating Results Phase

Purpose: Present the findings to stakeholders in a clear and actionable format.

Visualize insights using tools like Tableau, Power BI, or Python’s Matplotlib.
Create dashboards, charts, and reports to summarize the results.
Provide actionable recommendations based on insights.
Example: Presenting a report that identifies the top customer segments contributing to sales
growth.

6. Operationalization Phase

Purpose: Deploy the model and integrate it into the organization’s workflow.

Implement the model in production environments for real-time analysis.

Automate data pipelines to ensure continuous data flow.
Monitor model performance and retrain it as needed to improve accuracy.

Example: A fraud detection system integrated with a bank’s transaction system to alert
suspicious activities in real-time.

Summary of Data Analytics Lifecycle Phases

Phase Purpose Key Activities

Discovery Understand project goals Define objectives, scope, and KPIs
Data collection, cleaning, and
Data Preparation Clean and organize data
transformation
Model Planning Choose suitable models Select algorithms, techniques, and tools
Model Building Develop and test models Train model, evaluate performance
Present insights to
Communicating Results Visualizations, dashboards, and reports
stakeholders
Deploy and monitor the
Operationalization Integrate model into business processes
model
UNIT Ⅱ

Data Analysis Techniques and Methods

Data analysis involves applying various mathematical, statistical, and machine learning
techniques to understand data, uncover patterns, and make predictions. Below are key methods
used in data analysis with simple explanations.

1. Regression Modeling
Regression modeling is a technique used to understand the relationship between one or more
independent variables (predictors) and a dependent variable (outcome).

Types of Regression Models:

• Linear Regression: Establishes a straight-line relationship between variables.

Example: Predicting house prices based on area size.
• Multiple Regression: Involves multiple independent variables.
Example: Predicting sales based on factors like price, advertising, and season.
• Logistic Regression: Used for classification problems where the outcome is binary (e.g.,
Yes/No, True/False).
Example: Predicting whether a customer will buy a product.

2. Multivariate Analysis
Multivariate analysis examines multiple variables at the same time to understand their
relationships and patterns.

Common Multivariate Techniques:

• Principal Component Analysis (PCA): Reduces the number of features while retaining
key information.
• Cluster Analysis: Groups similar data points based on common characteristics.
• Factor Analysis: Identifies underlying variables (factors) that explain data patterns.

Example: In marketing, multivariate analysis can be used to segment customers based on

demographics, spending habits, and preferences.

3. Bayesian Modeling
Bayesian modeling is a statistical approach that incorporates prior knowledge (beliefs) along
with new data to improve decision-making.

Key Concepts in Bayesian Modeling:

• Bayes' Theorem: Used to update the probability of a hypothesis based on new evidence.
• Posterior Probability: The updated probability after considering new data.

Example: In spam email detection, Bayesian modeling updates the probability that an email is
spam based on keywords found in the email.

4. Inference and Bayesian Networks

Bayesian networks are graphical models that use probabilities to represent the relationships
between variables.

Key Features of Bayesian Networks:

• Represent uncertainty and dependencies in data.

• Useful in decision-making under uncertainty.

Example: In medical diagnosis, Bayesian networks can predict the likelihood of a disease based
on observed symptoms.

5. Support Vector Machines (SVM) and Kernel Methods

SVM is a powerful machine learning algorithm used for classification and regression tasks.

How SVM Works:

• It finds a hyperplane (decision boundary) that best separates data points into classes.
• Kernel Methods transform non-linear data into a higher dimension where it becomes
linearly separable.

Example: SVM is widely used in image recognition, like identifying handwritten digits.

6. Analysis of Time Series Data

Time series analysis deals with data points collected over time, focusing on identifying trends,
seasonal patterns, and forecasting.

Key Techniques in Time Series Analysis:

• Linear Systems Analysis: Models that assume a stable and linear relationship over time.
• Nonlinear Dynamics: Used for complex data patterns that do not follow linear trends.

Example: Predicting stock prices, weather forecasting, or analyzing website traffic trends.

7. Rule Induction
Rule induction is a machine learning method that extracts decision rules from data.

Key Features:

• Produces human-readable rules for decision-making.

• Often used in data mining for discovering hidden patterns.

Example: In retail, rule induction may reveal that "customers who buy milk are likely to buy
bread."

8. Neural Networks: Learning and Generalization

Neural networks are designed to mimic the human brain and are highly effective in solving
complex problems.

Key Concepts:

• Learning: Neural networks learn patterns from data through training.

• Generalization: Ensures the model performs well on unseen data.

Example: Image recognition systems that detect objects in photos.

9. Competitive Learning
Competitive learning is a type of unsupervised learning where neurons compete to respond to
input data.
Key Features:

• Each neuron adjusts itself to specialize in recognizing specific patterns.

• Often used in clustering algorithms.

Example: Identifying customer segments based on shopping behavior.

10. Principal Component Analysis (PCA) and Neural

Networks
PCA is a dimensionality reduction technique that simplifies data by reducing the number of
variables while retaining key information.

PCA in Neural Networks:

• PCA reduces input dimensions, improving the efficiency of neural networks.

• Useful for visualizing high-dimensional data.

Example: PCA can reduce the number of features in a facial recognition model while
maintaining accuracy.

11. Fuzzy Logic and Fuzzy Models

Fuzzy logic is a method used for handling uncertainty and imprecision in data.

Key Concepts in Fuzzy Logic:

• Instead of strict "True" or "False" values, fuzzy logic allows partial truths (values
between 0 and 1).
• Useful for modeling complex systems where precise data may be lacking.

Example: In washing machines, fuzzy logic determines the wash cycle based on dirt level and
fabric type.

Fuzzy Decision Trees:

• Combines decision trees with fuzzy logic for improved handling of uncertain data.
• Useful in medical diagnosis, finance, and control systems.
12. Stochastic Search Methods
Stochastic search methods are algorithms that use random sampling to find optimal solutions in
complex problems.

Common Techniques:

• Genetic Algorithms (GA): Inspired by natural evolution, GA optimizes solutions

through mutation and selection.
• Simulated Annealing: Mimics the heating and cooling process to explore possible
solutions.
• Particle Swarm Optimization (PSO): Inspired by the behavior of birds in a flock.

Example: Stochastic methods are widely used in financial modeling, logistics optimization, and
game theory.

Summary of Techniques and Applications

Technique Purpose Example Applications
Regression Modeling Predict continuous outcomes Predicting house prices
Analyzing relationships between
Multivariate Analysis Customer segmentation
variables
Updating predictions using new
Bayesian Modeling Spam email detection
data
Modeling uncertainty and
Bayesian Networks Medical diagnosis
dependencies
Support Vector Machines Classification and pattern
Handwriting recognition
(SVM) recognition
Time Series Analysis Forecasting future trends Stock price prediction
Rule Induction Discovering patterns in data Market basket analysis
Neural Networks Learning complex patterns Facial recognition
Clustering and pattern Customer behavior
Competitive Learning
recognition analysis
PCA (Principal Component
Dimensionality reduction Data visualization
Analysis)
Handling uncertainty and
Fuzzy Logic Smart home systems
imprecise data
Route optimization in
Stochastic Search Methods Finding optimal solutions
logistics
UNIT Ⅲ

Mining Data Streams

Mining data streams is the process of analyzing continuous and fast-flowing data in real-time.
Unlike traditional data storage systems where data is static, data streams are dynamic, requiring
specialized techniques to extract useful insights.

1. Introduction to Stream Concepts

A data stream is a continuous flow of data that arrives in real-time or near-real-time. This data
is generated rapidly and in large volumes, making traditional batch processing methods
inefficient.

Key Features of Data Streams:

• Continuous Flow: Data arrives endlessly without stopping.

• Fast-Paced: Data must be processed quickly to gain real-time insights.
• Dynamic Nature: Data streams can change patterns frequently.
• Limited Memory: Only a portion of the data can be stored or analyzed at a time.

Examples of Data Streams:

• Social media feeds (Twitter, Facebook posts)

• Stock market data
• Sensor data from IoT devices
• Website clickstreams
• Financial transactions for fraud detection

2. Stream Data Model and Architecture

The stream data model describes how data is represented and processed in real-time systems.

Components of Stream Data Architecture:

1. Data Sources: Devices or platforms that generate continuous data (e.g., IoT sensors,
social media platforms).
2. Data Stream Management System (DSMS): Specialized systems like Apache Kafka,
Apache Flink, and Apache Storm that handle data streams.
3. Processing Engine: Performs data filtering, aggregation, and analysis (e.g., Apache
Spark Streaming).
4. Storage System: Stores processed data for future analysis (e.g., Amazon S3, Hadoop
HDFS).
5. Visualization Tools: Displays insights using dashboards like Grafana or Tableau.

3. Stream Computing
Stream computing refers to processing and analyzing data as it arrives rather than waiting to
collect all data first.

Key Features of Stream Computing:

✅ Real-time processing with minimal delay

✅ Capable of handling large-scale data
✅ Often uses distributed computing for scalability

Example: Detecting credit card fraud by identifying unusual transaction patterns in real-time.

4. Sampling Data in a Stream

Since it's impossible to store an entire data stream due to its continuous nature, sampling
techniques are used to select representative data points.

Common Sampling Methods:

• Random Sampling: Selects random data points to represent the stream.

• Reservoir Sampling: Maintains a fixed-size sample from a stream, ensuring all elements
have an equal chance of being selected.

Example: Sampling social media posts to analyze trending hashtags.

5. Filtering Streams
Filtering is the process of extracting relevant data from a continuous stream based on specific
conditions.

Techniques for Filtering:

• Keyword Filtering: Extracting tweets that mention specific words (e.g., “AI,” “Big
Data”).
• Value-Based Filtering: Only accepting data points that meet certain criteria (e.g.,
temperatures above 30°C).

Example: Filtering Twitter data to analyze tweets related to elections.

6.Counting Distinct Elements in a Stream

Counting unique elements in a data stream can be challenging due to memory constraints.

Efficient Algorithms for Counting:

• HyperLogLog Algorithm: Estimates the number of unique elements with minimal

memory usage.
• Bloom Filters: A space-efficient data structure for checking the existence of an element
in a set.

Example: Counting the number of unique IP addresses visiting a website.

7. Estimating Moments
In data streams, moments are statistical properties that describe the data distribution.

Key Moments:

• First Moment (Mean): The average value of the data stream.

• Second Moment (Variance): Measures data spread or variation.
• Third and Fourth Moments: Capture the skewness and shape of the data distribution.

Example: Estimating the average temperature from IoT sensor data.

8. Counting Oneness in a Window

Counting oneness refers to counting the number of specific values (like "1") within a particular
time window.

Window Concepts in Streaming Data:

• Sliding Window: Moves continuously with time, maintaining recent data points.
• Tumbling Window: Divides the data into fixed-sized intervals without overlap.

Example: Counting the number of failed login attempts within the last 5 minutes.
9. Decaying Window
A decaying window assigns less importance to older data points and gives more weight to recent
data.

Purpose of Decaying Window:

• Ensures recent trends have a stronger impact on analysis.

• Helps prevent outdated information from influencing predictions.

Example: In stock trading, recent price fluctuations are given more importance than older prices.

10. Real-time Analytics Platform (RTAP) Applications

RTAP platforms are designed to process and analyze streaming data in real-time.

Popular RTAP Tools:

• Apache Kafka: For building real-time data pipelines.

• Apache Flink: For large-scale stream data processing.
• Amazon Kinesis: For streaming data on AWS cloud.
• Apache Spark Streaming: For distributed stream processing.

RTAP Use Cases:

✅ Fraud detection in financial systems

✅ Real-time recommendation engines for e-commerce
✅ Monitoring server logs for security threats
✅ Tracking social media trends

11. Case Studies in Data Stream Mining

Case Study 1: Real-time Sentiment Analysis

Objective: Analyze social media data to understand public sentiment.

Process:

• Collect real-time tweets or posts.

• Filter tweets based on keywords or hashtags.
• Use natural language processing (NLP) techniques to classify sentiments as positive,
negative, or neutral.
• Visualize sentiment trends using dashboards.
Example: Identifying customer reactions to a new product launch on Twitter.

Case Study 2: Stock Market Predictions

Objective: Predict stock price changes based on real-time market data.

Process:

• Stream financial data such as stock prices, trade volumes, and economic news.
• Use time series analysis techniques like ARIMA or LSTM models to predict future
prices.
• Generate alerts for potential market trends or risks.

Example: Identifying early signs of a market crash based on unusual trading patterns.

Summary of Concepts and Applications

Concept Description Example Application
Defines how streaming data is
Stream Data Model Website clickstream analysis
structured
Fraud detection during credit
Stream Computing Real-time data processing
card transactions
Sampling Data in a Selecting key data points for
Monitoring social media trends
Stream analysis
Extracting relevant data from Identifying specific keywords in
Filtering Streams
continuous flow live tweets
Counting Distinct Tracking unique data points
Counting unique website visitors
Elements efficiently
Calculating statistical properties of Analyzing average sensor
Estimating Moments
data readings
Counting Oneness in a Counting specific values within a Detecting frequent failed logins
Window defined window in 5 minutes
Giving higher importance to recent Predicting stock prices based on
Decaying Window
data recent trends
Real-time data platforms for fast Real-time customer support
RTAP Applications
processing response systems
UNIT Ⅳ

Frequent Itemset and Clustering

1. Mining Frequent Itemset

Frequent itemset are groups of items that appear together often in a dataset.
Mining these patterns is very important for understanding customer behavior, recommendations,
and decision-making.

Example:

In a supermarket, customers often buy bread and butter together. "Bread, Butter" would be a
frequent itemset.

2. Market-Based Modelling
Market-based modeling uses frequent itemsets to find relationships between products and
customer buying habits.
This is the idea behind Market Basket Analysis.

• Goal: Find which products are often bought together to increase sales.
• Example: If a customer buys a printer, they are likely to buy ink too.
• Result: The store can bundle products or offer discounts to increase sales.

3. Apriori Algorithm
The Apriori Algorithm is a classic and popular method for mining frequent itemsets.

How Apriori Works:

1. Find all individual items that meet the minimum support (appear frequently enough).
2. Combine items to form pairs, triples, etc., and check if these combinations meet the
minimum support.
3. Repeat the process until no more combinations are frequent.

Important Concepts:
• Support: How often an itemset appears.
• Confidence: How likely an item B is bought when A is bought.

Simple Example:

If 70% of customers who buy milk also buy bread, then:

• Support for (milk, bread) = 70%

• Confidence of buying bread after milk = 70%

4. Handling Large Datasets in Main Memory

Dealing with huge datasets that don’t fit into the computer’s memory is a big challenge.

Techniques:

• Data Partitioning: Divide data into smaller parts and process separately.
• Compression: Store data more compactly to save memory.
• Efficient Data Structures: Use trees like FP-Tree (Frequent Pattern Tree) to store data
efficiently.

5. Limited Pass Algorithm

When data is very large, we cannot scan it many times.
Limited pass algorithms try to find frequent itemsets with only 1 or 2 scans of the database.

Why Limited Pass?

• Saves time.
• Reduces memory usage.
• Important for real-time systems.

Example: The SON Algorithm for distributed data mining uses two passes.

6. Counting Frequent Itemsets in a Stream

In streaming data, items keep coming continuously (e.g., live transaction data).

Challenges:
• Cannot store the whole data.
• Must process data quickly.

Techniques Used:

• Approximate Counting: Allow small errors but save a lot of memory.

• Lossy Counting Algorithm: Keep only frequent itemsets and forget rare ones.

Clustering Techniques
Clustering is grouping similar data points into clusters so that points in the same group are more
similar to each other.

1. Hierarchical Clustering
• Builds a tree (called dendrogram) of clusters.
• Can be:
o Agglomerative: Start with each data point as its own cluster and merge them step
by step.
o Divisive: Start with one big cluster and split it step by step.

Example:
In a company, employees can be grouped first by department, then by teams.

2. K-Means Clustering
• Most popular clustering algorithm.
• Steps:
1. Choose K (number of clusters).
2. Randomly select K points as centers.
3. Assign each data point to the nearest center.
4. Update centers by averaging points in each cluster.
5. Repeat until clusters don’t change much.

Example:
Grouping customers based on their spending patterns.
3. Clustering High Dimensional Data
When data has hundreds or thousands of features (like in genetics or text data), clustering
becomes difficult.

Problems:

• Distance between points becomes meaningless.

• Memory and processing requirements are high.

Solutions:
Use special methods like CLIQUE and ProCLUS.

4. CLIQUE and ProCLUS

CLIQUE (Clustering in Quest):

• Divides the data space into small grids.

• Finds dense areas in some dimensions.
• Suitable for very large and high-dimensional datasets.

ProCLUS (Projected Clustering):

• Clusters data based only on relevant dimensions (not all features).

• Reduces noise by ignoring irrelevant dimensions.

Example:
In customer data, only "income" and "spending" may be important, ignoring "zip code."

5. Frequent Pattern-Based Clustering Methods

These methods use frequent itemsets to cluster data.

Steps:

1. Find frequent patterns.

2. Group data points having similar patterns.

Example:
Clustering customers based on commonly bought product groups.
6. Clustering in Non-Euclidean Space
Normally, we use Euclidean distance (straight-line distance) to measure similarity.
But sometimes, it’s not the best choice.

Other Distance Measures:

• Manhattan Distance
• Cosine Similarity
• Jaccard Similarity

Example:
In text documents, cosine similarity (based on angles) is better than simple distance.

7. Clustering for Streams and Parallelism

Clustering for streams means clustering data that arrives continuously.

Challenges:

• Cannot store all data.

• Need to update clusters in real-time.

Solution:

• Use micro-clusters that summarize the data.

• Merge or split clusters as new data arrives.

Parallelism:
To speed up clustering:

• Distribute the work across multiple processors/machines.

• Each processor works on a part of the data.

Example:
In sensor networks, cluster live data from different sensors in real-time using distributed
computing.
Summary Table
Topic Meaning Example
Mining Frequent Finding commonly occurring item
Bread and butter bought together
Itemsets groups
Market-Based
Using frequent patterns for marketing Cross-selling related products
Modelling
A step-by-step method to find Suggesting related products on
Apriori Algorithm
frequent itemsets Amazon
Managing huge data without
Handling Large Data Data partitioning, compression
overloading memory
Clustering Grouping similar items together Grouping similar customers
Hierarchical Department- and team-wise
Building a tree of clusters
Clustering employee grouping
K-Means Clustering Dividing data into K clusters Customer segmentation
High Dimensional Special clustering when features are
Gene expression data analysis
Clustering too many
Real-time clustering of continuous
Clustering in Streams Social media trend analysis
data
Distributing clustering work across Big sensor network data
Parallel Clustering
multiple machines clustering
UNIT Ⅴ

Frameworks and Visualization

Frameworks for Big Data Processing

Big data processing requires strong frameworks and tools. These frameworks help to store,
manage, and analyze huge amounts of data.

1. MapReduce

• MapReduce is a programming model for processing large datasets in parallel.

• It breaks down tasks into two main steps:
o Map: Process and filter data.
o Reduce: Aggregate results from the Map step.

Example: Counting the number of times each word appears in a large book.

2. Hadoop

• Hadoop is an open-source framework that uses MapReduce and provides a system to

store and process big data across many computers.
• Main components:
o HDFS (Hadoop Distributed File System): For storing big data.
o YARN (Yet Another Resource Negotiator): For managing resources and
scheduling tasks.

Example: Storing and processing Facebook’s user data.

3. Pig

• Pig is a high-level platform for creating MapReduce programs using a language called
Pig Latin.
• Easier than writing raw Java MapReduce code.

Example: Processing large logs to find error patterns.

4. Hive

• Hive allows querying large datasets using a language similar to SQL called HiveQL.
• It translates HiveQL queries into MapReduce jobs automatically.

Example: Running queries on petabytes of sales data.

5. HBase

• HBase is a NoSQL database that runs on top of HDFS.

• It stores huge amounts of structured data and supports real-time read/write access.

Example: Facebook uses HBase to store user messaging data.

6. MapR

• MapR provides a commercial distribution of Hadoop with improvements.

• Features:
o High performance
o Easy installation
o Better data protection

Example: Used by industries like healthcare, finance, and retail for big data analytics.

7. Sharding

• Sharding means splitting a large database into smaller, faster, easily manageable parts
called shards.
• Helps in managing large-scale databases efficiently.

Example: A large customer database split based on regions (Asia, Europe, America).

8. NoSQL Databases

• NoSQL stands for "Not Only SQL."

• They handle unstructured and semi-structured data.
• Types of NoSQL databases:
o Document-based: (MongoDB)
o Key-Value store: (Redis)
o Column-based: (Cassandra)
o Graph-based: (Neo4j)

Example: Storing user profiles, social media data, IoT sensor readings.

9. S3 (Simple Storage Service)

• Amazon S3 is a cloud-based storage service.

• Stores and retrieves any amount of data from anywhere.
• Highly scalable, durable, and cost-effective.

Example: Netflix stores their movies and user data in Amazon S3.

10. Hadoop Distributed File System (HDFS)

• HDFS is the storage system of Hadoop.

• It splits large files into blocks and stores them across multiple machines.
• Provides fault tolerance — even if one machine fails, data is safe.

Example: Storing large volumes of e-commerce transaction data.

Visualization
Data Visualization means showing data in a graphical or pictorial format to help people
understand it better.

1. Visual Data Analysis Techniques

• Represent data visually before performing detailed analysis.

• Examples:
o Scatter Plots
o Bar Charts
o Heatmaps
o Pie Charts
Purpose: Helps identify patterns, outliers, and trends quickly.

2. Interaction Techniques

• Techniques that allow users to interact with data visualizations.

• Examples:
o Zooming: See data details by zooming into a graph.
o Filtering: Show only relevant data by applying filters.
o Drill-down: Click on a chart to get more detailed views.

Example: Clicking on a country on a world map to see detailed sales figures.

3. Systems and Applications for Visualization

• Tools and software used for creating visualizations:

o Tableau
o Power BI
o QlikView
o D3.js (JavaScript Library)

Applications:

• Business Intelligence (BI) dashboards

• Financial reporting
• Healthcare data monitoring

Introduction to R
R is a popular programming language for statistics, data analysis, and visualization.

1. R Graphical User Interfaces (GUI)

• GUIs make R easier to use, without writing too much code.

• Examples of R GUIs:
o RStudio: Most popular IDE for R.
o Rattle: GUI for data mining.
2. Data Import and Export

• Import: Bringing data into R from CSV, Excel, databases, etc.

• Export: Saving results back into files.

Example Code:

r
CopyEdit
data <- read.csv("data.csv") # Import
write.csv (data, "output.csv") # Export

3. Attribute and Data Types

• Attributes describe features of data.

• Data Types in R:
o Numeric (e.g., 1.5, 3.14)
o Integer (e.g., 2, 100)
o Character (e.g., "Hello")
o Factor (categorical data)
o Logical (TRUE, FALSE)

4. Descriptive Statistics

• Summary of data using numbers like:

o Mean (Average)
o Median (Middle value)
o Mode (Most frequent value)
o Standard Deviation (Measure of spread)

Example Code:

r
CopyEdit
mean(data$Age)
summary(data)

5. Exploratory Data Analysis (EDA)

• First step before detailed analysis.

• Involves:
o Checking missing data
o Finding distributions
o Spotting outliers

Example: Using histograms to see the age distribution of customers.

6. Visualization Before Analysis

• Plotting data helps in understanding its structure and spotting issues early.
• Common plots:
o Histograms
o Boxplots
o Scatterplots

Example Code:

r
CopyEdit
hist(data$Salary)
boxplot(data$Age)

7. Analytics for Unstructured Data

• Unstructured data = No fixed format (e.g., Text, Images, Videos).

• Techniques:
o Text Mining: Extract information from emails, tweets.
o Sentiment Analysis: Finding emotions in text (positive, negative).

Example: Analyzing customer reviews to find satisfaction levels.

Summary Table
Topic Meaning Example
MapReduce Process large data in parallel Word count in documents
Framework for big data storage and
Hadoop Facebook data management
processing
Pig Easy scripting for MapReduce Log file analysis
Hive SQL for Hadoop Sales report generation
HBase NoSQL database on Hadoop Storing messages
MapR Commercial Hadoop with improvements Financial data processing
Topic Meaning Example
Large customer database
Sharding Splitting databases
management
NoSQL Non-relational databases Storing social media data
S3 Cloud storage Netflix storing movies
Visualization Showing data graphically Sales dashboard
R Statistical computing language Data analysis, graphs creation

Data Analytics Notes
No ratings yet
Data Analytics Notes
26 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
4 pages
Data Analytics For Healthcare Notes
No ratings yet
Data Analytics For Healthcare Notes
11 pages
Business Analytics Summary (Units 1.2 - 1.8)
No ratings yet
Business Analytics Summary (Units 1.2 - 1.8)
8 pages
Chapter 1 Introduction To Data Analytics
No ratings yet
Chapter 1 Introduction To Data Analytics
4 pages
Data Analytics and Its Applications
No ratings yet
Data Analytics and Its Applications
2 pages
Here Is An Even More Detailed and Expanded Version of Chapter 1
No ratings yet
Here Is An Even More Detailed and Expanded Version of Chapter 1
5 pages
DA Unit 1
No ratings yet
DA Unit 1
14 pages
Unit 1 Data Analytics-1
No ratings yet
Unit 1 Data Analytics-1
3 pages
DA Module 1
No ratings yet
DA Module 1
132 pages
DA Unit 2
No ratings yet
DA Unit 2
16 pages
Internship Report
No ratings yet
Internship Report
9 pages
Introduction to Data Analytics
No ratings yet
Introduction to Data Analytics
30 pages
Unit2 Da
No ratings yet
Unit2 Da
7 pages
Unit 2
No ratings yet
Unit 2
22 pages
Intro To Data Analytics
No ratings yet
Intro To Data Analytics
42 pages
What Is Data Analytics?
No ratings yet
What Is Data Analytics?
56 pages
Data Analytics Syllabus PDF
No ratings yet
Data Analytics Syllabus PDF
5 pages
Notes On Data Analytics
No ratings yet
Notes On Data Analytics
5 pages
Unit 1
No ratings yet
Unit 1
57 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
72 pages
Da Unit 2
No ratings yet
Da Unit 2
18 pages
Data Analytics IA-1 IMP
No ratings yet
Data Analytics IA-1 IMP
9 pages
Lecture 0
No ratings yet
Lecture 0
21 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Analytics: Insights & Future
No ratings yet
Data Analytics: Insights & Future
3 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
30 pages
CH 1
No ratings yet
CH 1
31 pages
Big Data
No ratings yet
Big Data
54 pages
Data Analytics Unit - I Data Analytics and Lifecycle
No ratings yet
Data Analytics Unit - I Data Analytics and Lifecycle
46 pages
UNIT-1 Data Analytics
No ratings yet
UNIT-1 Data Analytics
37 pages
Business Analytics Chapter1 3
No ratings yet
Business Analytics Chapter1 3
3 pages
Summary - Introduction To Data Analytics (2) - 3978
No ratings yet
Summary - Introduction To Data Analytics (2) - 3978
7 pages
ISPFL9 Module1
100% (1)
ISPFL9 Module1
22 pages
Ba Notes Ete
No ratings yet
Ba Notes Ete
16 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
44 pages
Unit 1 Introduction To Data Analytics
No ratings yet
Unit 1 Introduction To Data Analytics
20 pages
Unitwise Imp Notes
No ratings yet
Unitwise Imp Notes
34 pages
Da
No ratings yet
Da
6 pages
Week 1
No ratings yet
Week 1
50 pages
BDA Notes-1
No ratings yet
BDA Notes-1
90 pages
Manan 1
No ratings yet
Manan 1
65 pages
Dadv Unit1
No ratings yet
Dadv Unit1
40 pages
Chap 1 Notes
No ratings yet
Chap 1 Notes
26 pages
Big Data Analytics
No ratings yet
Big Data Analytics
37 pages
Data Analysis Vs Analytics
No ratings yet
Data Analysis Vs Analytics
4 pages
Assignment Week 2 BDA
No ratings yet
Assignment Week 2 BDA
4 pages
Bisma Itc
No ratings yet
Bisma Itc
7 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Da Unit-Ii
No ratings yet
Da Unit-Ii
21 pages
Tybms Sem V Business Analytics I MODULE I - Basics of Data Analysis and Visualization
No ratings yet
Tybms Sem V Business Analytics I MODULE I - Basics of Data Analysis and Visualization
43 pages
Da 2
No ratings yet
Da 2
25 pages
Unit - II (Bca01)
No ratings yet
Unit - II (Bca01)
17 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
7 pages
Intro To Data Analytics
No ratings yet
Intro To Data Analytics
30 pages
Mathematical Foundations For Machine Learning
100% (1)
Mathematical Foundations For Machine Learning
2 pages
Introductionto Cyber Crime
No ratings yet
Introductionto Cyber Crime
16 pages
Book 3 SR
No ratings yet
Book 3 SR
2 pages
New Microsoft Excel Worksheet
No ratings yet
New Microsoft Excel Worksheet
2 pages
BCC 401 Cyber Securityunit 1 240724114242 7d29d922
No ratings yet
BCC 401 Cyber Securityunit 1 240724114242 7d29d922
54 pages
2014 Ars Na
No ratings yet
2014 Ars Na
16 pages
Unlocking Singapores Digital Potential
No ratings yet
Unlocking Singapores Digital Potential
108 pages
IIT Jammu Case Study
No ratings yet
IIT Jammu Case Study
4 pages
Unit 1 Data Science
No ratings yet
Unit 1 Data Science
12 pages
Post Graduate Program in Data Engineering: Objective
No ratings yet
Post Graduate Program in Data Engineering: Objective
2 pages
Gartner Big Data Opportunities in Industries
No ratings yet
Gartner Big Data Opportunities in Industries
24 pages
1 - Educational Data Mining Survey
No ratings yet
1 - Educational Data Mining Survey
32 pages
International Conference: Aligarh Muslim University, Aligarh
No ratings yet
International Conference: Aligarh Muslim University, Aligarh
4 pages
Romney 15e Accessible Fullppt 05
No ratings yet
Romney 15e Accessible Fullppt 05
20 pages
Libro Realiability and Risk
No ratings yet
Libro Realiability and Risk
523 pages
Big Data and General Insurance
No ratings yet
Big Data and General Insurance
2 pages
IT and Data Management Quiz
100% (4)
IT and Data Management Quiz
12 pages
Chapter 6 E-Commerce Marketing and Advertising Concepts
100% (1)
Chapter 6 E-Commerce Marketing and Advertising Concepts
24 pages
Trending Real Estate Technology Innovations in 2024
No ratings yet
Trending Real Estate Technology Innovations in 2024
12 pages
Data Science and Its Importance
No ratings yet
Data Science and Its Importance
9 pages
Sneha Sabu Project Report
No ratings yet
Sneha Sabu Project Report
81 pages
Pre PH.D Courses
No ratings yet
Pre PH.D Courses
114 pages
VW Strategy 2018: Big Data's Role
No ratings yet
VW Strategy 2018: Big Data's Role
5 pages
Big Data Insights for Car Sales
No ratings yet
Big Data Insights for Car Sales
33 pages
Wu, 2022
No ratings yet
Wu, 2022
5 pages
Great Learning
100% (1)
Great Learning
2 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Vijayalakshmi Biodata 02.03.2022
No ratings yet
Vijayalakshmi Biodata 02.03.2022
8 pages
2254 IT Consultant Interview Questions Answers Guide
No ratings yet
2254 IT Consultant Interview Questions Answers Guide
9 pages
Curriculum Structure Btech CSE
No ratings yet
Curriculum Structure Btech CSE
4 pages
Business Data Analytics Question Bank
No ratings yet
Business Data Analytics Question Bank
2 pages
1) Explain in Detail Core Function of Edge Analytics With Diagram
No ratings yet
1) Explain in Detail Core Function of Edge Analytics With Diagram
13 pages
BDSCP Module 08 Mindmap
No ratings yet
BDSCP Module 08 Mindmap
1 page
Study On Data Governance at Dataeaze Systems (PDF - Io)
No ratings yet
Study On Data Governance at Dataeaze Systems (PDF - Io)
84 pages
Design Contractualism For Pervasive/ Affective Computing: Jeremy Pitt
No ratings yet
Design Contractualism For Pervasive/ Affective Computing: Jeremy Pitt
8 pages