Fundamentals of Data
Science                       Unit-1
    Data Mining
                         Akshatha B Rai
                           Asst Professor
                Dept of Computer Science
              St Philomena College Puttur
Topics Covered…….
   Data Mining History
   Data Mining Introduction
   Data Mining Definitions
   Pros & cons
   Knowledge Discovery in Databases(KDD)
   KDD Vs Data Mining
   DBMS Vs Data Mining
   Data Mining Techniques
   Problems, Issues and Challenges
   Data Mining Applications
Data Mining History
   In the 1990s, the term "Data Mining" was introduced, but data
    mining is the evolution of a sector with an extensive history.
   Early techniques of identifying patterns in data include Bayes
    theorem (1700s), and the evolution of regression(1800s).
   The generation and growing power of computer science have
    boosted data collection, storage, and manipulation as data sets
    have broad in size and complexity level. Explicit hands-on data
    investigation has progressively been improved with indirect,
    automatic data processing, and other computer science
    discoveries such as neural networks, clustering, genetic
    algorithms (1950s), decision trees(1960s), and supporting
    vector machines (1990s).
   1989 The term “Knowledge Discovery in Databases” (KDD) is
    coined by Gregory Piatetsky-Shapiro.
History continued…
   Gregory Piatetsky-Shapiro coined the term
    "knowledge discovery in databases" for the first
    workshop on the same topic (KDD-1989) and this
    term became more popular in the AI and machine
    learning communities. However, the term data
    mining became more popular in the business and
    press communities
   1990s The term “data mining” appeared in the
    database community. Retail companies and the
    financial community are using data mining to
    analyze data and recognize trends to increase their
    customer base, predict fluctuations in interest rates,
    stock prices, customer demand.
Data Mining Introduction
   Data mining is the process of extracting useful
    information from large sets of data. It involves using
    various techniques from statistics, machine learning,
    and database systems to identify patterns,
    relationships, and trends in the data.
   “Data Mining” can be referred to as knowledge
    mining from data, knowledge extraction,
    data/pattern analysis, data archaeology, and
    data dredging.
   Data Mining emerged from the convergence of three
    Scientific Disciplines: Artificial Intelligence,
    Machine Learning, and Statistics.
Definitions..
   Data mining or knowledge discovery in databases, as it is also
    known, is the non- trivial extraction of implicit, previous1y
    unknown and potentially useful information from the data. This
    encompasses a number of technical approaches, such as
    clustering, data summarization, classification, finding
    dependency networks, analyzing changes, and detecting
    anomalies.
   Data mining is the process of discovering meaningful, new
    correlation patterns and trends by sifting through large amount
    of data stored in repositories, using pattern recognition
    techniques as well as statistical and mathematical techniques.
Steps for Data Mining
   Problem Definition
   Data Collection
   Data Cleaning
   Exploratory Data Analysis
   Model Building
   Model Evaluation
   Interpretation and deployment
Data mining Vs Data Science
             Data Mining                             Data Science
Data mining is a process of            Data science refers to the process
extracting   useful  information,      of obtaining valuable insights from
patterns, and trends from huge         structured and unstructured data by
databases.                             using various tools and methods.
Data mining is a technique.            Data science is a field.
Primarily used for business purposes. Primarily     used      for   scientific
                                      purposes.
It is involved with the process.       It emphasizes the science of the
                                       data.
Data mining aims to make data more The objective of data science is to
important and usable; it means create a dominant data product.
extracting only useful information.
Data mining is a technique that is a It is related to the field of study like
part of KDD (Knowledge discovery in Mechanical        engineering,     Cloud
database process).                   architecture, etc.
It primarily deals with structured It deals with any kind of data like
data.                              structured, semi-structured, and
Real life example…..
 Market  Basket Analysis: It is a technique
 that gives the careful study of purchases
 done by a customer in a supermarket. The
 concept is basically applied to identify the
 items that are bought together by a
 customer. Say, if a person buys bread, what
 are the chances that he/she will also
 purchase butter? This analysis helps in
 promoting offers and deals by the
 companies. The same is done with the help
 of data mining.
   Protein Folding: It is a technique that carefully
    studies biological cells and predicts the protein
    interactions and functionality within biological cells.
    Applications of this research include
    determining causes and possible cures for
    Alzheimer’s, Parkinson’s, and cancer caused by
    Protein misfolding.
   Fraud Detection: Nowadays, in this land of cell
    phones, we can use data mining to analyze cell
    phone activities for comparing suspicious phone
    activity. This can help us to detect calls made on
    cloned phones. Similarly, with credit cards,
    comparing purchases with historical purchases can
    detect activity with stolen cards.
Advantages of Data Mining
   Improved decision-making: Data mining can provide
    valuable insights that can help organizations make
    better decisions by identifying patterns and trends in
    large data sets.
   Increased efficiency: Data mining can automate
    repetitive and time-consuming tasks, such as data
    cleaning and preparation, which can help organizations
    save time and resources.
   Enhanced competitiveness: Data mining can help
    organizations gain a competitive edge by uncovering
    new business opportunities and identifying areas for
    improvement.
Continued……
   Improved customer service: Data mining
    can help organizations better understand
    their customers and tailor their products and
    services to meet their needs.
   Fraud detection: Data mining can be used
    to identify fraudulent activities by detecting
    unusual patterns and anomalies in data.
Advantages continued….
   Predictive modeling: Data mining can be used to
    build models that can predict future events and
    trends, which can be used to make proactive
    decisions.
    New product development: Data mining can be
    used to identify new product opportunities by
    analyzing customer purchase patterns and
    preferences.
    Risk management: Data mining can be used to
    identify potential risks by analyzing data on
    customer behavior, market conditions, and other
    factors.
Knowledge Discovery in
Database(KDD)
KDD (Knowledge Discovery in Databases) is a process
that involves the extraction of useful, previously
unknown, and potentially valuable information from
large datasets. The KDD process is an iterative
process and it requires multiple iterations of the above
steps   to   extract   accurate   knowledge   from   the
data.The following steps are included in KDD process:
Different Stages(steps) of KDD
KDD continued….
    Data Cleaning
  Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1.   Cleaning in case of Missing values.
2.   Cleaning noisy data, where noise is a random or variance error.
3.   Cleaning    with Data        discrepancy        detection and Data
     transformation tools.
    Data Integration
       Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse). Data
integration using Data Migration tools, Data Synchronization tools
and ETL(Extract-Load-Transformation) process.
KDD Continued…
    Data Selection
                 Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
    Data Transformation
               Data Transformation is defined as the process of
transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1.   Data Mapping: Assigning elements from source base to destination
     to capture transformations.
2.   Code generation: Creation of the actual transformation program.
KDD Continued…
   Data Mining
                Data mining is defined as techniques that are applied to
extract patterns potentially useful. It transforms task relevant data
into patterns,      and        decides      purpose       of     model
using classification or characterization.
   Pattern Evaluation
                Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures. It
find interestingness        score of      each      pattern,     and
uses summarization and Visualization to make data understandable
by user.
   Knowledge Representation
               This involves presenting the results in a way that is
Advantages of KDD
1.   Improves decision-making: KDD provides valuable insights and
     knowledge that can help organizations make better decisions.
2.   Increased efficiency: KDD automates repetitive and time-
     consuming tasks and makes the data ready for analysis, which saves
     time and money.
3.   Better customer service: KDD helps organizations gain a better
     understanding of their customers’ needs and preferences, which can
     help them provide better customer service.
4.   Fraud detection: KDD can be used to detect fraudulent activities by
     identifying patterns and anomalies in the data that may indicate
     fraud.
5.   Predictive modeling: KDD can be used to build predictive models
     that can forecast future trends and patterns.
Disadvantages of KDD
1.   Privacy concerns: KDD can raise privacy concerns as it involves collecting
     and analyzing large amounts of data, which can include sensitive information
     about individuals.
2.   Complexity: KDD can be a complex process that requires specialized skills
     and knowledge to implement and interpret the results.
3.   Unintended consequences: KDD can lead to unintended consequences,
     such as bias or discrimination, if the data or models are not properly
     understood or used.
4.   Data Quality: KDD process heavily depends on the quality of data, if data is
     not accurate or consistent, the results can be misleading
5.   High cost: KDD can be an expensive process,             requiring   significant
     investments in hardware, software, and personnel.
6.   Overfitting: KDD process can lead to overfitting, which is a common problem
     in machine learning where a model learns the detail and noise in the training
     data to the extent that it negatively impacts the performance of the model on
     new unseen data.
Difference between KDD and Data Mining
      Parameter                           KDD                                Data Mining
                        KDD refers to a process of identifying Data Mining refers to a
                        valid, novel, potentially useful, and process of extracting useful
   Definition
                        ultimately understandable patterns and and valuable information or
                        relationships in data.                 patterns from large data sets.
                                                                     To extract useful information
   Objective            To find useful knowledge from data.
                                                                     from data.
                        Data cleaning, data integration,      data   Association                rules,
                        selection,   data transformation,     data   classification,       clustering,
   Techniques Used      mining,     pattern    evaluation,     and   regression,     decision   trees,
                        knowledge        representation        and   neural       networks,       and
                        visualization.                               dimensionality reduction.
                                                                  Patterns,   associations, or
                        Structured information, such as rules and
                                                                  insights that can be used to
   Output               models, that can be used to make
                                                                  improve decision-making or
                        decisions or predictions.
                                                                  understanding.
                        Focus is on the discovery of useful Data mining focus is on the
   Focus                knowledge, rather than simply finding discovery   of    patterns or
                        patterns in data.                     relationships in data.
                                                                Domain      expertise  is   less
                       Domain expertise is important in KDD, as
                                                                critical in data mining, as the
   Role   of    domain it helps in defining the goals of the
                                                                algorithms are designed to
   expertise           process, choosing appropriate data, and
                                                                identify     patterns   without
                       interpreting the results.
                                                                relying on prior knowledge.
Difference between DBMS and Data Mining
                 DBMS                                Data mining
    It Create, store, maintain data in
                                           Extracting useful and unknown
     database
    The database is the organized          data from raw data
     collection of data. Most of the       Data Mining is analyzing data
     times, these raw data are stored
                                            from different information to
     in very large databases.
                                            discover useful knowledge.
    It supports Query languages.
                                           Automatic searching of data
    Can work without data mining
                                           May not work without database.
     technique.
    Basic elements- Query
                                           Basic concepts Classification,
     languages, data store and
                                            regression, Clustering and
     transaction mechanism.
                                            Association.
Types of Sources of Data in Data Mining
   1. Data stored in the database
A database is also called a database management system or DBMS.
Every DBMS stores data that are related to each other in a way or the
other. It also has a set of software programs that are used to manage
data and provide easy access to it. These software programs serve a lot
of purposes, including defining structure for database, making sure that
the stored information remains secured and consistent, and managing
different types of data access, such as shared, distributed, and
concurrent.
A relational database has tables that have different names, attributes,
and can store rows or records of large data sets. Every record stored in a
table has a unique key. Entity-relationship model is created to provide a
representation of a relational database that features entities and the
relationships that exist between them.
Types of Sources of Data in Data
Mining…..
   Data warehouse
A data warehouse is a single data storage location that collects data
from multiple sources and then stores it in the form of a unified plan.
When data is stored in a data warehouse, it undergoes cleaning,
integration, loading, and refreshing. Data stored in a data warehouse is
organized in several parts.
   Transactional data
Transactional database stores record that are captured as transactions.
These transactions include flight booking, customer purchase, click on a
website, and others. Every transaction record has a unique ID. It also
lists all those items that made it a transaction.
Types of Sources of Data in Data Mining……
   Multimedia Databases
•   Multimedia databases consists audio, video, images and text media.
•   They can be stored on Object-Oriented Databases.
•   They are used to store complex information in a pre-specified formats.
•   Application: Digital libraries, video-on demand, news-on demand, musical
    database, etc.
   Time-series Databases
    Time series databases contains stock exchange data and user logged activities.
•   Handles array of numbers indexed by time, date, etc.
•   It requires real-time analysis.
•   Application: eXtremeDB, Graphite, InfluxDB, etc
Types of Sources of Data in Data
Mining……
   Cloud Data:
    This type of data is stored and processed in cloud computing
environments such as AWS, Azure, and GCP.
   Big Data:
    This type of data is characterized by its huge volume, high velocity,
and high variety, and can be stored and processed using big data
technologies such as Hadoop and Spark.
Types of Data in Data Mining……
   Structured Data:
 This type of data is organized into a specific format, such as a database
table or spreadsheet. Examples include transaction data, customer data,
and inventory data.
   Semi-Structured Data:
  This type of data has some structure, but not as much as structured
data. Examples include XML and JSON files, and email messages.
   Unstructured Data:
   This type of data does not have a specific format, and can include
text, images, audio, and video. Examples include social media posts,
customer reviews, and news articles.
Types of Data Mining
   1. Predictive Data Mining
       As the name signifies, Predictive Data-Mining analysis
works on the data that may help to know what may happen
later (or in the future) in business. Predictive Data-Mining can
also be further divided into four types that are listed below:
•   Classification Analysis
•   Regression Analysis
•   Time Serious Analysis
•   Prediction Analysis
Types of Data Mining
   2. Descriptive Data Mining
      The main goal of the Descriptive Data Mining
tasks is to summarize or turn given data into relevant
information. The Descriptive Data-Mining Tasks can
also be further divided into four types that are as
follows:
•   Clustering Analysis
•   Summarization Analysis
•   Association Rules Analysis
•   Sequence Discovery Analysis
Difference between predictive and descriptive data mining
 Descriptive data mining                 Predictive data mining
 Descriptive mining is usually used The term 'Predictive' means to
 to provide correlation, cross- predict something, so predictive
 tabulation, frequency, etc.        data mining is the analysis done to
                                    predict the future event or other
                                    data or trends.
 It is based       on      the   reactive It is based     on      the   proactive
 approach.                                approach.
 It specifies the characteristics of It executes the induction over the
 the data in a target data set.      current and past data so that
                                     prediction can happen.
 It needs data aggregation and data It needs statistics and                 data
 mining.                            forecasting procedures.
 It provides precise data.               It   produces    outcomes       without
Data Mining Techniques
   Classification
   Clustering
   Association Rule
   Regression
   Anomaly Detection
   Time series Analysis
   Outlier Detection
   Artificial Neural Networks classifier
   Decision Trees
   Text Mining
1. Classification
      Classification is a technique used to categorize data into
predefined classes or categories based on the features or
attributes of the data instances. It involves training a model on
labeled data and using it to predict the class labels of new, unseen
data instances.
•   Decision Tree
•   SVM(Support Vector Machine)
•   Generalized Linear Models
•   Bayesian classification:
•   Classification by Backpropagation
•   K-NN Classifier
•   Rule-Based Classification
•   Frequent-Pattern Based Classification
•   Rough set theory
•   Fuzzy Logic
2.Clustering
   Clustering is a technique used to group similar
    data instances together based on their intrinsic
    characteristics or similarities. It aims to discover
    natural patterns or structures in the data without
    any predefined classes or labels.
   This technique helps to recognize the differences
    and similarities between the data. Clustering is
    very similar to the classification, but it involves
    grouping chunks of data together based on their
    similarities.
   Algorithms: K-means, Hierarchical clustering,
    density based clustering
3. Regression:
   Regression is employed to predict numeric or continuous
    values based on the relationship between input variables and
    a target variable.
   It establish a relationship between a dependent variable and
    one or more independent variable…
4. Association Rules:
   Association rules helps to discover a link between
    two or more items. It finds a hidden pattern in the
    data set.
   Association rules are if-then statements that support
    to show the probability of interactions between
    data items within large data sets in different
    types of databases. Association rule mining has
    several applications and is commonly used to help
    sales correlations in data or medical data sets.
   For example, a list of grocery items that you have
    been buying for the last six months. It calculates a
    percentage of items being purchased together.
•   Lift:
    This measurement technique measures the accuracy of the
    confidence over how often item B is purchased.
               (Confidence) / (item B)/ (Entire dataset)
•   Support:
    This measurement technique measures how often multiple items are
    purchased and compared it to the overall dataset.
              (Item A + Item B) / (Entire dataset)
•   Confidence:
    This measurement technique measures how often item B is
    purchased when item A is purchased as well.
              (Item A + Item B)/ (Item A)
5. Artificial Neural Network Classifier
   A process model supported by biological neurons could be an artificial
    neural network (ANN), also known as a "Neural Network" (NN). It is
    made up of a networked group of synthetic neurons. A neural network
    is a collection of connected input/output units with weights assigned
    to each connection.
   In order to be able to anticipate the class label of the input samples
    correctly, the network accumulates information during the knowledge
    phase by modifying the weights. Due to the links between units,
    neural network learning is also known as connectionist learning.
   Neural networks require lengthy training periods, making them more
    suitable for applications where it is possible. They need a variety of
    parameters, like the network topology or "structure," which are often
    best determined empirically.
6.Outlier Detection
   Data objects that do not adhere to the overall behavior or model of
    the data may be found in a database. These informational items are
    outliers. OUTLIER MINING is the process of looking into OUTLIER data.
   When employing distance measurements, objects with a tiny
    percentage of "near" neighbors in space are regarded as outliers.
    Statistical tests that assume a distribution and probability model for
    the data can also be used to identify outliers.
Issues and Challenges in Data
Mining
   Limited Information
   Noise or missing Data
   User Interaction and Prior Knowledge
   Data Complexity
   Size, updates and irrelevant fields.
   Data Privacy and Security
   Scalability
  Limited Information:
    In Databse sometimes, some attributes which are essential for knowledge
discovery of the application domain are not present in the data. Thus, it may be
very difficult to discover significant knowledge about a given domain.
  Noise or missing Data(Data Quality)
The quality of data used in data mining is one of the most significant challenges.
The accuracy, completeness, and consistency of the data affect the accuracy of
the results obtained. The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results. Moreover, the data may
be incomplete, meaning that some attributes or values are missing, making it
challenging to obtain a complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry
errors, data storage issues, data integration problems, and data transmission
errors
   User Interaction and Prior Knowledge
An analyst is usually not a KDD expert, but simply a person making use of the data by
means of the available KDD techniques. Since the KDD process is by definition interactive
and iterative, it is challenging to provide a high performance, rapid-response environment
that also assists the users in the proper selection and matching of the appropriate techniques
to achieve their goals.
   Data Complexity
Data complexity refers to the vast amounts of data generated by
various sources, such as sensors, social media, and the internet
of things (IoT). The complexity of the data may make it
challenging to process, analyze, and understand. In addition, the
data may be in different formats, making it challenging to
integrate into a single dataset.
    Issues and challenges of DM
   Size, updates and irrelevant fields
Databases tend to be large and dynamic, in that their contents are keep changing as
information is added, modified or removed. The problem with this, from the perspective of
data mining, is how to ensure that the rules are up-to-date and consistent with the most
current information.
   Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The
data may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules
on how data can be collected, used, and shared.
Issues and challenges of DM
   Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations
also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed
in real-time.
   Ethical and Legal Considerations:
Data mining raises various ethical and legal considerations, including
consent, data ownership, intellectual property rights, and compliance
with regulations such as GDPR (General Data Protection Regulation) and
HIPAA (Health Insurance Portability and Accountability Act).
Data Mining Applications
Business Transactions:
       The effective and in-time use of the data in a reasonable
time frame for competitive decision-making is definitely the most
important problem to solve for businesses that struggle to
survive in a highly competitive world. Data mining helps to
analyze these business transactions and identify marketing
approaches and decision-making. Example :
•   Direct mail targeting
•   Stock trading
•   Customer segmentation and retention
•   Churn prediction (Churn prediction is one of the most popular
    Big Data use cases in business)
Intrusion Detection
A network intrusion refers to any unauthorized activity on a
digital network. Network    intrusions often   involve   stealing
valuable network resources. Data mining technique plays a vital
role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful
and relevant information from large data sets. Data mining
technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the
network traffic about the foreign invasions in the system. For
example:
•   Detect security violations
•   Misuse Detection
•   Anomaly Detection
Market Basket Analysis:
   Market Basket Analysis: Market Basket Analysis is a
    technique that gives the careful study of purchases done by a
    customer in a supermarket. This concept identifies the pattern
    of frequent purchase items by customers. This analysis can
    help to promote deals, offers, sale by the companies and data
    mining techniques helps to achieve this analysis task.
    Example:
•   Data mining concepts are in use for Sales and marketing to
    provide better customer service, to improve cross-selling
    opportunities, to increase direct mail response rates.
Education:
              For analyzing the education sector, data mining
uses Educational Data Mining (EDM) method. This method
generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some
educational task:
•   Predicting students admission in higher education
•   Predicting students profiling
•   Predicting student performance
•   Teachers teaching performance
•   Curriculum development
•   Predicting student placement opportunities
                             Research
        A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in the
research area. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model
is a strategy to measure the precision of the proposed model. It is
called Train/Test because we split the data set into two sets: a training
data set and a testing data set. A training data set used to design the
training model whereas testing data set is used in the testing model.
Example:
•   Classification of uncertain data.
•   Information-based clustering.
•   Decision support system
•   Web Mining
•   IoT (Internet of Things)and Cybersecurity
•   Smart farming IoT(Internet of Things)
Healthcare and Insurance:
              A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of
high-value physicians and figure out which promoting activities
will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict
which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.
•   Claims analysis i.e which medical procedures are claimed
    together.
•   Identify successful medical therapies for different illnesses.
•   Characterizes patient behavior to predict office visits.
Financial/Banking Sector
            A credit card company can leverage its
vast warehouse of customer transaction data to
identify customers most likely to be interested in a
new credit product.
•   Credit card fraud detection.
•   Identify ‘Loyal’ customers.
•   Extraction of information related to customers.
•   Determine    credit   card    spending   by   customer
    groups.
Transportation:
      A diversified transportation company
with a large direct sales force can apply data
mining to identify the best prospects for its
services. A large consumer merchandise
organization can apply information mining to
improve its business cycle to retailers.
•   Determine the distribution schedules among
    outlets.
•   Analyze loading patterns.