KEMBAR78
Dsbda Unit4 | PDF | Analytics | Receiver Operating Characteristic
0% found this document useful (0 votes)
46 views110 pages

Dsbda Unit4

The document outlines the curriculum for a Data Science and Big Data Analytics course, detailing prerequisites, course outcomes, and the teaching and examination schemes. It provides an overview of Python's features, advantages, and libraries relevant to data science, including Pandas, NumPy, SciPy, and TensorFlow, along with their applications in data manipulation and visualization. Additionally, it emphasizes the importance of data preprocessing in ensuring data quality for analysis.

Uploaded by

Sunita Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views110 pages

Dsbda Unit4

The document outlines the curriculum for a Data Science and Big Data Analytics course, detailing prerequisites, course outcomes, and the teaching and examination schemes. It provides an overview of Python's features, advantages, and libraries relevant to data science, including Pandas, NumPy, SciPy, and TensorFlow, along with their applications in data manipulation and visualization. Additionally, it emphasizes the importance of data preprocessing in ensuring data quality for analysis.

Uploaded by

Sunita Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Third Year of Computer Engineering

(2019 Course)
310251: Data Science and Big Data
Analytics
Teaching Scheme • TH: 04 Hours/Week Credit: 04
Examination Scheme: Mid-Semester (TH) : 30 Marks End-Sem (TH): 70
Marks
• Prerequisites Courses: Discrete Mathematics (210241), Database Management
Systems (310341)
• Companion Course: Big Data Analytics Laboratory (310256)
• Course Outcomes:
• After completion of the course, students should be able
to
• CO1: Analyze needs and challenges for Data Science Big Data
Analytics
• CO2: Apply statistics for Big Data Analytics
• CO3: Apply the lifecycle of Big Data analytics to real world
problems
• CO4: Implement Big Data Analytics using Python programming
• CO5: Implement data visualization using visualization tools in Python
programming
• CO6: Design and implement Big Databases using the Hadoop
ecosystem
Features of Python
1. Free and Open Source
2. Easy to code
3. Easy to Read
4. Object-Oriented Language
5. GUI Programming Support
6. High-Level Language
7. Large Community Support
8. Easy to Debug
9. Python is a Portable language
10. Python is an Integrated language
11. Interpreted Language:
12. Large Standard Library
13. Dynamically Typed Language
14. Frontend and backend development
15. Allocating Memory Dynamically
• Perform complex tasks using a few lines of code.
• Run equally on different platforms such as Windows, Linux,
Unix, Macintosh, etc
• Provides a vast range of libraries for the various fields such as
machine learning, web developer, and also for the scripting
Advantages of Python
• Ease of programming
• Minimizes the time to develop and maintain
code
• Modular and object-oriented
• Large community of users
• A large standard and user-contributed library
Disadvantages of Python
• Interpreted and therefore slower than compiled
languages
• Decentralized with packages
Python Libraries for Data Science
● TensorFlow ● PyTorch
● NumPy ● Scrapy
● SciPy ● BeautifulSoup
● Pandas ● LightGBM
● Matplotlib ● ELI5
● Keras ● Theano
● SciKit-Learn ● NuPIC
● PyBrain ● Ramp
● Caffe2 ● Pipenv
● Chainer ● Bob
Pandas
● Free software library for data analysis and data handling.
● It was created as a community library project and was initially released
around 2008.
● Pandas provide various high-performance and easy-to-use data structures and
operations for manipulating data in the form of numerical tables and time
series.
● Pandas also has multiple tools for reading and writing data between in-
memory data structures and different file formats.
● It is perfect for quick and easy data manipulation, data aggregation, reading,
and writing the data and data visualization.
● Pandas can also take in data from different types of files such as CSV, Excel,
etc., or a SQL database and create a Python object known as a data frame.
● A data frame contains rows and columns and it can be used for data
manipulation with operations such as join, merge, groupby, concatenate, etc.
Numpy
● It is a free Python software library for numerical computing on data that can
be in the form of large arrays and multi-dimensional matrices.
● These multidimensional matrices are the main objects in NumPy where their
dimensions are called axes and the number of axes is called a rank.
● NumPy also provides various tools to work with these arrays and high-level
mathematical functions to manipulate this data with linear algebra, Fourier
transforms, random number crunchings, etc.
● Some of the basic array operations that can be performed using NumPy
include adding, slicing, multiplying, flattening, reshaping, and indexing the
arrays.
● Other advanced functions include stack, the arrays, splitting them into
sections, broadcasting arrays, etc.
SciPy
● SciPy is a free software library for scientific computing and technical
computing of data.
● It was created as a community library project and was initially released
around 2001.
● SciPy library is built on the NumPy array object and it is part of the NumPy
stack which also includes other scientific computing libraries and tools such
as Matplotlib, SymPy, pandas, etc.
● This NumPy stack has users who also use comparable applications such as
GNU Octave, MATLAB, GNU Octave, Scilab, etc.
● SciPy allows for various scientific computing tasks that handle data
optimization, data integration, data interpolation, and data modification using
linear algebra, Fourier transforms, random number generation, special
functions, etc. Just like NumPy, the multidimensional matrices are the main
objects in SciPy, which are provided by the NumPy module itself.
Scikit-learn
● Scikit-learn is among those libraries for Python that is a free, software
library for Machine Learning coding primarily in the Python programming
language.
● It was initially developed as a Google Summer of Code project by David
Cournapeau and was originally released in June 2007.
● Scikit-learn is built on top of other Python libraries like NumPy, SciPy,
Matplotlib, Pandas, etc. and so it provides full interoperability with these
libraries.
● While Scikit-learn is written mainly in Python, it has also used Cython to
write some core algorithms in order to improve performance.
● You can implement various Supervised and Unsupervised Machine learning
models on Scikit-learn like Classification, Regression, Support Vector
Machines, Random Forests, Nearest Neighbors, Naive Bayes, Decision
Trees, and Clustering, etc. with Scikit-learn.
Tensorflow
● TensorFlow is a free end-to-end open-source platform that has a wide
variety of tools, libraries, and resources for Artificial Intelligence.
● It was developed by the Google Brain team and was initially released on
November 9, 2015.
● You can easily build and train Machine Learning models with high-level
APIs such as Keras using TensorFlow.
● It also provides multiple levels of abstraction so you can choose the option
you need for your model. TensorFlow also allows you to deploy Machine
Learning models anywhere such as the cloud, browser, or your own device.
● You should use TensorFlow Extended (TFX) if you want the full experience,
TensorFlow Lite if you want usage on mobile devices, and TensorFlow.js if
you want to train and deploy models in JavaScript environments.
● TensorFlow is available for Python and C APIs and also for C++, Java,
JavaScript, Go, Swift, etc. but without an API backward compatibility
guarantee. Third-party packages are also available for MATLAB, C#, Julia,
Scala, R, Rust, etc.
Keras
● Keras is a free and open-source neural network library written in Python. It
was primarily created by François Chollet, a Google engineer, and initially
released on 27 March 2015.
● Keras was created to be user-friendly, extensible, and modular while being
supportive of experimentation in deep neural networks. Hence, it can be run
on top of other libraries and languages like TensorFlow, Theano, Microsoft
Cognitive Toolkit, R, etc.
● Keras has multiple tools that make it easier to work with different types of
image and textual data for coding in deep neural networks.
● It also has various implementations of the building blocks for neural
networks such as layers, optimizers, activation functions, objectives, etc.
● You can perform various actions using Keras such as creating custom
function layers, writing functions with repeating code blocks that are
multiple layers deep, etc.
Python Libraries for Data Visualization
1.Matplotlib
● Matplotlib is a data visualization library and 2-D plotting library of Python
● It was initially released in 2003 and it is the most popular and widely-used
plotting library in the Python community.
● It comes with an interactive environment across multiple platforms.
● Matplotlib can be used in Python scripts, the Python and IPython shells, the
Jupyter Notebook, web application servers, etc.
● It can be used to embed plots into applications using various GUI toolkits
like Tkinter, GTK+, wxPython, Qt, etc. So you can use Matplotlib to create
plots, bar charts, pie charts, histograms, scatterplots, error charts,
power spectra, stemplots, and whatever visualization charts you want! The
Pyplot module also provides a MATLAB-like interface that is just as
versatile and useful as MATLAB while being totally free and open source.
2. Seaborn
● Seaborn is among the best data visualization library for Python that is based
on Matplotlib and closely integrated with the NumPy and Pandas data
structures.
● Seaborn has various dataset-oriented plotting functions that operate on data
frames and arrays that have whole datasets within them.
● Then it internally performs the necessary statistical aggregation and mapping
functions to create informative plots that the user desires.
● It is a high-level interface for creating beautiful and informative statistical
graphics that are integral to exploring and understanding data.
● The Seaborn data graphics can include bar charts, pie charts, histograms,
scatterplots, error charts, etc. Seaborn also has various tools for choosing
color palettes that can reveal patterns in the data.
3 Plotly
● Plotly is a free open-source graphing library that can be used to form data
visualizations.
● Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and
can be used to create web-based data visualizations that can be displayed in
Jupyter notebooks or web applications using Dash or saved as individual
HTML files.
● Plotly provides more than 40 unique chart types like scatter plots,
histograms, line charts, bar charts, pie charts, error bars, box plots, multiple
axes, sparklines, dendrograms, 3-D charts, etc.
● Plotly also provides contour plots, which are not that common in other data
visualization libraries. In addition to all this, Plotly can be used offline with
no internet connection.
4. Ggplot
● Ggplot is a Python data visualization library that is based on the
implementation of ggplot2 which is created for the programming language R.
● Ggplot can create data visualizations such as bar charts, pie charts,
histograms, scatterplots, error charts, etc. using high-level API. It also allows
you to add different types of data visualization components or layers in a
single visualization.
● Once ggplot has been told which variables to map to which aesthetics in the
plot, it does the rest of the work so that the user can focus on interpreting the
visualizations and take less time to create them.
● But this also means that it is not possible to create highly customized
graphics in ggplot. Ggplot is also deeply connected with pandas so it is best
to keep the data in DataFrames.
• TensorFlow: This library was developed by Google in collaboration with the Brain
Team. It is an open-source library used for high-level computations. It is also used
in machine learning and deep learning algorithms. It contains a large number of
tensor operations. Researchers also use this Python library to solve complex
computations in Mathematics and Physics.
• Matplotlib: This library is responsible for plotting numerical data. And that‟s why
it is used in data analysis. It is also an open-source library and plots high-defined
figures like pie charts, histograms, scatterplots, graphs, etc.
• Pandas: Pandas are an important library for data scientists. It is an open-source
machine learning library that provides flexible high-level data structures and a
variety of analysis tools. It eases data analysis, data manipulation, and cleaning of
data. Pandas support operations like Sorting, Re-indexing, Iteration, Concatenation,
Conversion of data, Visualizations, Aggregations, etc.
• Numpy: The name “Numpy” stands for “Numerical Python”. It is the commonly
used library. It is a popular machine learning library that supports large matrices
and multi-dimensional data. It consists of in-built mathematical functions for easy
computations. Even libraries like TensorFlow use Numpy internally to perform
several operations on tensors. Array Interface is one of the key features of this
library.
• SciPy: The name “SciPy” stands for “Scientific Python”. It is an open-source
library used for high-level scientific computations. This library is built over an
extension of Numpy. It works with Numpy to handle complex computations. While
Numpy allows sorting and indexing of array data, the numerical data code is stored
in SciPy. It is also widely used by application developers and engineers.
• Scrapy: It is an open-source library that is used for extracting data from websites. It
provides very fast web crawling and high-level screen scraping. It can also be used
for data mining and automated testing of data.
• Scikit-learn: It is a famous Python library to work with complex data. Scikit-learn
is an open-source library that supports machine learning. It supports variously
supervised and unsupervised algorithms like linear regression, classification,
clustering, etc. This library works in association with Numpy and SciPy.
• PyGame: This library provides an easy interface to the Standard Directmedia
Library (SDL) platform-independent graphics, audio, and input libraries. It is used
for developing video games using computer graphics and audio libraries along with
Python programming language.
• PyTorch: PyTorch is the largest machine learning library that optimizes tensor
computations. It has rich APIs to perform tensor computations with strong GPU
acceleration. It also helps to solve application issues related to neural networks.
• PyBrain: The name “PyBrain” stands for Python Based Reinforcement Learning,
Artificial Intelligence, and Neural Networks library. It is an open-source library
built for beginners in the field of Machine Learning. It provides fast and easy-to-use
algorithms for machine learning tasks. It is so flexible and easily understandable
and that‟s why is really helpful for developers that are new in research fields.
Install Scikit-Learn in Python
To represent data
Building the Model
• Data Preprocessing in Pandas
Data Preprocessing
● Real-world datasets are generally messy, raw, incomplete, inconsistent,
and unusable. It can contain manual entry errors, missing values,
inconsistent schema, etc.
● Data Preprocessing is the process of converting or mapping data from the
initial “raw” form into another in order to prepare for data analysis that is
understandable and usable.
● It is a crucial step in any Data Science project to carry out an efficient and
accurate analysis. It ensures that data quality is consistent before applying
any Machine Learning or Data Mining techniques
• Data preprocessing is a crucial step in the data analysis pipeline, involving various
techniques to clean, transform, and prepare data for further analysis.
1. Removing Duplicates: to find the Duplicates is analysis results and should be
removed to ensure accuracy. This can be done using functions like
drop_duplicates() in Python pandas or SQL queries to filter out duplicate rows.
2. Transformation of Data using Function or Mapping: Data transformation
involves converting raw data into a more suitable format for analysis. This can
include scaling, normalization, or applying mathematical functions to the data.
Functions or mapping can be used for this purpose, such as the apply() function in
pandas or using lambda functions.
3. Replacing Values: Sometimes, data may contain missing or erroneous values that
need to be replaced. This can be done using techniques like imputation (replacing
missing values with a statistical measure like mean or median), or replacing specific
values based on domain knowledge. Functions like fillna() in pandas can be used
for imputation.
4. Handling Missing Data: Missing data is a common issue in datasets and needs to
be handled appropriately. Techniques for handling missing data include removing
rows with missing values, imputation (replacing missing values), or using advanced
algorithms like k-nearest neighbors to impute missing values based on similar
observations.
Output
•By default, drop_duplicates() keeps the first occurrence of each duplicated
row and removes subsequent duplicates.
•You can also specify which subset of columns to consider for identifying
duplicates using the subset parameter.
•This will remove duplicates based on values in column 'A' only.
Transformation of Data using
Function or Mapping:
• In pandas, you can transform data using functions or
mapping by utilizing the apply() function or the map()
function. Here's how you can use each:
Using apply() function:
• The apply() function in pandas can be used to apply a function
along an axis of the DataFrame or Series.
• This is particularly useful for applying custom functions to
each element, row, or column of the DataFrame.
Output
• Using map() function:
• The map() function in pandas can be used to apply a mapping
or a dictionary to each element of a Series.
• It is useful for element-wise transformations or for replacing
values based on a mapping
Output
Replacing missing value
• replacing missing values with a statistical
measure like mean or median
Handling erroneous value
Handling missing value using fillna()
Analytics Types
• Analytics types refer to different approaches or methodologies
used to analyze data and derive insights from it.
• BA is an iterative , methodical exploration of an organization‟s
data focuses on statistical analysis to make data driven
decisions
• The three main types of anaytics are.
1. Descriptive Analytics:
2. Predictive Analytics
3. Prescriptive Analytics:
Advantages of BA
• Conduct data mining.
• Complete statistical analysis and quantitative analysis to
explain why certain results occur
• Test previous decisions using A/B testing and multivariate
testing.
• Make use of predictive modeling and predictive analytics to
forecast future results.
Challenges with BA
• Executive ownership: requires buy–in from senior leadership and a
clear corporate strategy for predictive analysis.
• IT involvement: Infrastructure ,Tools
• Available production data vs Cleansed modeling data: know the
difference between historical data for model development and real-
time data in production.
• Project management office: The correct project mgmt. structure
should be adopted for agile approach.
• End-User involvement:
• Change Management: organization should be prepared for changes
that BA brings to current business and technology operations.

BA Process
1. Define Objectives: Clearly define the goals of your analysis. Whether it's
improving operational efficiency, increasing sales, reducing costs, or
understanding customer behavior, having well-defined objectives is
crucial.
2. Gather Data: Collect relevant data from various sources. This could
include internal sources such as sales records, customer databases, and
operational logs, as well as external sources like market research reports,
industry benchmarks, and social media analytics.
3. Data Cleaning and Preprocessing: Cleanse the data to remove errors,
duplicates, and inconsistencies. Preprocess the data by transforming,
aggregating, and structuring it in a format suitable for analysis. This step
is essential for ensuring data quality and reliability.
4. Exploratory Data Analysis (EDA): Explore the data to gain initial
insights and identify patterns, trends, and relationships. Techniques such
as summary statistics, data visualization, and correlation analysis can help
uncover valuable information hidden within the data.
5. Hypothesis Testing: Formulate hypotheses based on the insights gained from EDA and test them
using statistical methods. This step helps validate assumptions and determine the significance of
relationships between variables.
6. Predictive Modeling: Build predictive models to forecast future outcomes or trends. Techniques
such as regression analysis, time series forecasting, and machine learning algorithms can be used
to develop models that predict customer behavior, sales trends, or demand forecasting.
1. Data Interpretation: Interpret the results of your analysis in the context of your business
objectives. What do the findings mean for your business? How can you leverage these insights to
drive actionable decisions?
2. Actionable Insights and Recommendations: Based on the analysis results, formulate actionable
insights and recommendations to address business challenges or capitalize on opportunities. These
recommendations should be practical, data-driven, and aligned with the organization's goals and
priorities.
3. Implementation and Monitoring: Implement the recommended strategies or changes in business
processes. Monitor the outcomes to assess the effectiveness of the implemented solutions and
iterate as necessary. Continuous monitoring ensures that the business remains agile and responsive
to changing market conditions.
4. Feedback Loop: Establish a feedback loop to incorporate learnings from the analysis into future
decision-making processes. This involves capturing lessons learned, refining analytical
methodologies, and continuously improving the business analytics practice.
• Descriptive Analytics:
– Descriptive analytics focuses on understanding historical data to
describe what has happened in the past. It involves summarizing and
interpreting data to identify patterns, trends, and relationships.
Descriptive analytics answers questions such as "What happened?" and
"What are the key characteristics of the data?"
– Techniques used in descriptive analytics include summary statistics,
data visualization, dashboards, and exploratory data analysis (EDA).
These techniques help in gaining insights into historical performance
and understanding the current state of affairs.
• Characteristics: Descriptive analytics relies heavily on data visualization
techniques such as charts, graphs, and dashboards to present historical data
in a meaningful and easy-to-understand format.
• Example: Sales reports, customer segmentation analysis, and financial
performance summaries are common examples of descriptive analytics.
• Predictive Analytics:
– Predictive analytics involves using historical data and statistical
algorithms to forecast future outcomes or trends. It aims to answer
questions like "What is likely to happen?" or "What will be the future
behavior?"
– Predictive analytics models are built using techniques such as
regression analysis, machine learning algorithms (e.g., decision trees,
random forests, neural networks), time series analysis, and data mining.
These models use historical data to make predictions about future
events or behaviors.
• Characteristics: Predictive analytics involves the use of statistical
algorithms, machine learning models, and data mining techniques to
identify patterns and relationships in historical data and make predictions
about future events.
• Example: Demand forecasting, customer churn prediction, and sales
forecasting are examples of predictive analytics applications.
• Prescriptive Analytics:
– Prescriptive analytics goes beyond descriptive and predictive analytics by
recommending actions to optimize outcomes or achieve specific goals. It
aims to answer questions like "What should we do?" or "What actions
should be taken to achieve a desired outcome?"
– Prescriptive analytics combines historical data, predictive models,
optimization techniques, and business rules to generate actionable
recommendations. It considers various constraints and objectives to
provide decision-makers with insights on the best course of action to take.
• Characteristics: Prescriptive analytics combines insights from descriptive and
predictive analytics with optimization and simulation techniques to provide
actionable recommendations. It helps decision-makers understand the potential
impact of different courses of action and choose the best possible option.
• Example: Supply chain optimization, dynamic pricing strategies, and resource
allocation decisions are examples of prescriptive analytics applications.
• Each type of analytics has its own unique purpose and use
cases:
• Descriptive analytics helps in understanding past
performance and identifying patterns.
• Predictive analytics assists in making informed decisions by
forecasting future outcomes.
• Prescriptive analytics guides decision-making by
recommending optimal actions to achieve desired outcomes.
• Organizations often leverage a combination of these
analytics types to gain comprehensive insights into their
data and drive better decision-making across various
domains such as business, finance, healthcare, and
marketing.
Apriori algorithm
• The Apriori algorithm is a classic algorithm in data mining and
association rule learning.
• It is used to identify frequent item sets and generate
association rules based on these item sets.
• The algorithm is particularly useful for market basket
analysis, which involves discovering patterns in customer
purchase behavior.
Working of Apriori
• Support Counting: The algorithm starts by counting the support
(frequency) of each item in the dataset. Support refers to the number of
transactions that contain a particular item set.
• Generating Candidate Item Sets: Based on the minimum support
threshold set by the user, the algorithm generates candidate item sets. These
are sets of items that are likely to be frequent based on the support count.
• Pruning Infrequent Item Sets: Candidate item sets that do not meet the
minimum support threshold are pruned, as they are unlikely to be frequent
in the dataset.
• Generating Association Rules: Once frequent item sets are identified, the
algorithm generates association rules. These rules describe the relationships
between different items or item sets in the dataset. The rules are typically of
the form "if {A} then {B}", indicating that if item set A is present, item set
B is also likely to be present.
• Evaluating Rules: The generated rules are evaluated based on metrics such
as confidence and lift to determine their strength and relevance.
• Support:
– Support measures the frequency of occurrence of an item set in a
dataset.
– Mathematically, it is calculated as the number of transactions
containing the item set divided by the total number of transactions in
the dataset.
– A high support value indicates that the item set is frequently present in
the dataset, making it a potentially interesting pattern.
– For example, if "Item A" has a support of 0.2, it means that 20% of the
transactions in the dataset contain "Item A".
• Confidence:
– Confidence measures the reliability or strength of an association rule.
– Confidence=Support(AU B) /Support(A)
– A high confidence value indicates that the consequent is likely to be
present in transactions where the antecedent is present, making the rule
more reliable.(B is present where there is A)
– For example, if an association rule has a confidence of 0.8, it means
that 80% of the transactions containing the antecedent also contain the
consequent.
Example of Apriori algorithms
• Memory and Computationally Intensive
• High Number of Candidate Item Sets
• Fixed Minimum Support Threshold
• Doesn't Handle Noise or Sparse Data Well
• Inefficient for Large Datasets
• Doesn't Capture Item Order
Improving Apriori Algorithm
• Pruning Strategies:
– Apriori Property: Utilize the apriori property, which states that if an
itemset is infrequent, all its supersets are also infrequent. This reduces
the number of candidate item sets to be generated and counted.
– Hash-Based Pruning: Use hash-based techniques to prune infrequent
item sets efficiently, especially for large datasets with a high number of
items.It stores item set as keys and no. of appearance as value.Initialize
with zero and incremented with each occurance.
• Transaction Reduction:
– Remove transactions that do not contain any frequent items. This
reduces the size of the dataset and speeds up the support counting
process.
• Efficient Data Structures:
– Use efficient data structures like hash trees, bitmaps, or inverted
indexes to store and manipulate itemsets and transaction information.
These data structures can reduce memory usage and improve
computational efficiency.
• Dynamic Minimum Support Threshold:
– Adjust the minimum support threshold dynamically based on the size
of the dataset or the number of frequent item sets discovered. This can
help in focusing computational efforts on more relevant patterns.
• Parallelization:
– Implement parallel processing techniques to distribute the computation
of candidate item sets and support counting across multiple processors
or nodes. This can significantly reduce execution time for large
datasets.
• Sampling:
– Use sampling techniques to work with a subset of the dataset,
especially for very large datasets. Sampling can provide approximate
results while reducing computational overhead.
• Memory Management:
– Optimize memory usage by storing only necessary
information and discarding intermediate results or data
structures that are no longer needed.
• Alternative Algorithms:
– Consider using alternative algorithms like FP-Growth,
which addresses some of the limitations of Apriori,
such as multiple passes through the data and the
generation of a large number of candidate item sets.
FP Growth
• FP-Growth was introduced by Han,Pie & Yin in 2000. to
eliminate candidate generation.
• FP-Growth (Frequent Pattern Growth) is an algorithm used in
data mining for finding frequent item sets in large databases.
• It is an alternative to the Apriori algorithm
• It uses root-like data structure, divide & conquer strategy to
find the candidates.
• It represent the database in tree form called as FP tree.(It
maintains the association between the itemset)
• FP tree is a compact data structure.
FP- Growth
• FP-Tree Construction:
– FP-Growth constructs a special data structure called the FP-tree (Frequent Pattern tree)
from the transaction database. This tree structure efficiently stores the frequency of
itemsets and their relationships in a compact manner.
– The FP-tree is built by scanning the database once and counting the support of each item.
• Mining Frequent Item Sets:
– After constructing the FP-tree, frequent item sets are mined directly from the tree
without generating candidate item sets explicitly.
– The algorithm recursively explores the FP-tree to find frequent item sets using a
technique called "pattern growth" or "conditional pattern base".
• Conditional Pattern Base:
– The conditional pattern base is a set of conditional patterns associated with each item
in the FP-tree. It represents the projected transactions of an item after removing its prefix
path from the FP-tree.
– FP-Growth recursively constructs conditional FP-trees for each item in the FP-tree and
mines frequent item sets from these conditional FP-trees.
• Mining Association Rules:
– Once frequent item sets are discovered, association rules can be generated using standard
techniques such as calculating confidence.
Example of FP Growth
Consider Following Example with Min-Support=3
1. Find out support for each itemset
Generate frequent pattern set(L) whose frequency is greater than minimum support.
Now for each transaction prepare ordered set
Construction of FP tree
Frequent pattern generation
Regression
• Regression is a statistical method used to examine the
relationship between one or more independent variables
(predictors) and a dependent variable (outcome).
• It is commonly used for predictive modeling and
understanding the impact of predictors on the target variable.
There are various types of regression models, including linear
regression, logistic regression, polynomial regression, ridge
regression, and lasso regression, etc.
• Each type of regression has its own assumptions, advantages,
and limitations, making it suitable for different types of data
and research questions
Linear Regression
• This is the most basic form of regression, where the
relationship between the independent variables and the
dependent variable is assumed to be linear.
• It is used when the target variable is continuous.
• Equation of Linear regression Y=mX+b
• Y is independent variable
• X is Dependent variable
• m is slope( how much Y changes for unit changes in X)
• b=is intercept what is y when X=0
• m=
• # Import necessary libraries
• import pandas as pd
• from sklearn.datasets import load_boston
• from sklearn.model_selection import train_test_split
• from sklearn.linear_model import LinearRegression
• from sklearn.metrics import mean_squared_error
• # Load the Boston Housing dataset
• boston_data = load_boston()
• boston_df = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
• boston_df['PRICE'] = boston_data.target
• # Split the data into training and testing sets
• X = boston_df.drop('PRICE', axis=1)
• y = boston_df['PRICE']
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
• # Create a Linear Regression model
• model = LinearRegression()
• # Fit the model to the training data
• model.fit(X_train, y_train)
• # Make predictions on the testing data
• y_pred = model.predict(X_test)
• # Evaluate the model using Mean Squared Error (MSE)
• mse = mean_squared_error(y_test, y_pred)
• print("Mean Squared Error:", mse)
Logistic regression
• Logistic Regression is a statistical method used for binary
classification tasks( we can classify either as a class 0 or class
1)
• The goal is to predict the probability that an instance belongs
to a particular class (e.g., positive or negative, spam or not
spam).
• For. E.g we train the model on “study hours” and depending
on that study hours we can classified whether that student is
pass[1] or fail[0]
• Study hours is Independent variable and pass is dependent
variable
• Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic
function.
• In logistic regression, we use the concept of the threshold value,
which defines the probability of either 0 or 1. Such as values above
the threshold value tends to 1, and a value below the threshold
values tends to 0
Logistic regression
Formula for Sigmoid function
Example of Logistic Regression
• Z=-64+2*hours
• Q1. Find probability that student will pass for
33 hours
• Z=-64+2*33=2
• P=1/1+e-z = 0.88

• i.e. 88 % student is passed.


• How many hours students has to study so that
get the probability of 95%?
• P=1/1+e-z
• We put p=0.95 when we place this value we get z=33.5,
• So student has to study for 33.5 hrs to get
passing probability of 95%.
classification
• A classifier in machine learning is an algorithm that automatically
orders or categorizes data into one or more of a set of “classes.”.
• Types of Classification
• Binary Classification
• In binary classification, the goal is to classify the input into one of
two classes or categories. Example – On the basis of the given
health conditions of a person, we have to determine whether the
person has a certain disease or not.
• Multiclass Classification
• In multi-class classification, the goal is to classify the input into one
of several classes or categories. For Example – On the basis of data
about different species of flowers, we have to determine which
specie our observation belongs to.
• There are various types of classifiers algorithms. Some of them are :
• Linear Classifiers
• Linear models create a linear decision boundary between classes. They are simple
and computationally efficient. Some of the linear classification models are as
follows:
• Logistic Regression
• Support Vector Machines having kernel = „linear‟
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
• Non-linear Classifiers
• Non-linear models create a non-linear decision boundary between classes. They can
capture more complex relationships between the input features and the target
variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• AdaBoost,
• Bagging Classifier,
• Voting Classifier,
• ExtraTrees Classifier
• Multi-layer Artificial Neural Networks
Metrics for Classification
• Classification Accuracy: The proportion of correctly classified instances over the total
number of instances in the test set. It is a simple and intuitive metric but can be misleading in
imbalanced datasets where the majority class dominates the accuracy score.
• Confusion matrix: A table that shows the number of true positives, true negatives, false
positives, and false negatives for each class, which can be used to calculate various evaluation
metrics.
• Precision and Recall: Precision measures the proportion of true positives over the total
number of predicted positives, while recall measures the proportion of true positives over the
total number of actual positives. These metrics are useful in scenarios where one class is more
important than the other, or when there is a trade-off between false positives and false
negatives.
• F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and
recall are important.
• ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the
true positive rate (recall) against the false positive rate (1-specificity) for different threshold
values of the classifier‟s decision function. The Area Under the Curve (AUC) measures the
overall performance of the classifier, with values ranging from 0.5 (random guessing) to 1
(perfect classification).
• Cross-validation: A technique that divides the data into multiple folds and trains the model
on each fold while testing on the others, to obtain a more robust estimate of the model‟s
performance.
• Overfitting and Underfitting: Classification models are
susceptible to overfitting and underfitting. Overfitting occurs
when the model learns the training data too well and fails to
generalize to new data
Working Model
Naïve bayes Classification
•Naive Bayes classification is a probabilistic machine learning algorithm based
on Bayes' theorem with an assumption of independence between features. Here's
a breakdown of how it works:
• Types of Naive Bayes Classifiers:
– Gaussian Naive Bayes: Assumes that features follow a
Gaussian (normal) distribution.
– Multinomial Naive Bayes: Suitable for features with
discrete counts, commonly used in text classification with
word counts.
– Bernoulli Naive Bayes: Assumes features are binary (0 or
1), often used for text classification with binary features.
• Advantages of Naive Bayes:
– Fast and simple algorithm.
– Works well with high-dimensional data.
– Performs well with small training datasets.
– Can handle missing values.
• Limitations of Naive Bayes:
– Strong assumption of feature independence may not hold
in real-world data.
– Sensitivity to irrelevant features.
– Limited ability to capture complex relationships in data
compared to more advanced models like decision trees or
neural networks.
Example on naïve bayes classification
Decision Tree
• A decision tree is a supervised machine learning algorithm
used for both classification and regression tasks.
• Decision tree uses the tree representation to solve the
problem in which each leaf node corresponds to a class label
and attributes are represented on the internal node of the
tree
Terminology of decision tree
• Root Node: A decision tree‟s root node, which represents the original choice or feature
from which the tree branches, is the highest node.
• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes. There are branches on these nodes that go to other
nodes.
• Leaf Nodes (Terminal Nodes): The branches‟ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
• Branches (Edges): Links between nodes that show how decisions are made in response
to particular circumstances.
• Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of data.
• Parent Node: A node that is split into child nodes. The original node from which a split
originates.
• Child Node: Nodes created as a result of a split from a parent node.
• Decision Criterion: The rule or condition used to determine how the data should be split
at a decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve its
generalisation and prevent overfitting.
• Decision Tree Algorithms:
– ID3 (Iterative Dichotomiser 3): Uses information gain as
the splitting criterion, suitable for categorical data.
– C4.5 (Successor of ID3): Extends ID3 by handling both
categorical and numerical data and incorporating pruning.
– CART (Classification and Regression Trees): Uses Gini
impurity for classification and mean squared error for
regression. It can handle both categorical and numerical
features.
• Advantages of Decision Trees:
– Easy to understand and interpret.
– Can handle both numerical and categorical data.
– No need for feature scaling.
– Captures non-linear relationships in the data.
– Can be visualized for better insights.
• Limitations of Decision Trees:
– Prone to overfitting, especially with deep trees.
– Sensitive to small variations in the data.
– Biased towards features with more levels or categories.
– May not generalize well to unseen data if not properly
pruned or regularized.
• Entropy:Entropy is a measure of impurity or disorder in a set
of data. In the context of decision trees, it quantifies the
uncertainty of a random variable, which in this case is the class
label of the data points.
Example

You might also like