KEMBAR78
Data Mining Introduction | PDF | Data Mining | Databases
0% found this document useful (0 votes)
23 views41 pages

Data Mining Introduction

The document serves as an introduction to Data Mining, outlining its importance, methodologies, and applications. It covers the course structure, evaluation criteria, class rules, and recommended texts for students. Additionally, it discusses the evolution of database technology, the KDD process, and various functionalities of data mining such as classification, clustering, and outlier analysis.

Uploaded by

Obaid Amir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views41 pages

Data Mining Introduction

The document serves as an introduction to Data Mining, outlining its importance, methodologies, and applications. It covers the course structure, evaluation criteria, class rules, and recommended texts for students. Additionally, it discusses the evolution of database technology, the KDD process, and various functionalities of data mining such as classification, clustering, and outlier analysis.

Uploaded by

Obaid Amir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

1

Data Mining
An Introduction

Instructor: Qurat-ul-Ain
quratulain.ssc@stmu.edu.pk

WELCOME TO THIS LOVELY AND JOYFUL SUBJECT


Recommended
Text
 Data Mining: Concepts and Techniques”,
Second Edition and above by Jiawei Han

 Mining of Massive Datasets, 3 edition
 Jure Leskovec, Anand Rajaraman, Jeffrey D.
Ullman

 Data Science and Big Data Analytics


 EMC Education Services

 Instructor’s Notes
 Lecture slides & Notes
3 Student’s Performance
Evaluation
Credit hours 3
Prerequisite Probability and Statistics
Quizzes 10%
Assignment 15%
Mid-term 20%
Class Participation 5%
Final-term 50%
4
Grading Policy
 No makeup for any of the evaluation activities.
 Regular project related assignments.
 Strict submission deadlines.

In case of late submissions, marks will be deducted


15% per late day. No submissions after 3 days of due date.

 Strict penalty for any copied/plagiarized material.



An individual/group may be assigned a straight-forward 0, if the submitted
assessed work (lab work, assignment or quiz) is copied from another
individual/group or from any other source (books, research papers, web
sites).


An individual/group may be penalized if substantial amount of the submitted
assessed work falls under plagiarism by deducting marks from the assessed work.
5
Class Rules [1/2]
 No visitor are allowed
 Be Punctual
 Late comers are not allowed
 75% attendance is compulsory
 Be Attentive
 Be Prompt
 Ready to learn
 Class participation
 Surprised quizzes
 Be Polite
 Soft-spoken
6 Class Rules [2/2]
 Be Honest
 With yourself
 Credit others
 No cheating
 No wastage of time
 Be Responsible
 SWITCH OFF your phone
 Penalty: treat for the whole class

 Ask the questions


7 Get Connected
 Contacts
 Quratulain.ssc@stmu.edu.pk

 Link for study resources


 Google Drive:
https://drive.google.com/drive/u/1/folders/1R8MOSt6MBC7
Ke1-3GVF7zGSI4iyfpha7
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability

 Automated data collection tools, database systems, Web,


computerized society

 Major sources of abundant data

 Business: Web, e-commerce, transactions, stocks, …

 Science: Remote sensing, bioinformatics, scientific simulation, …

 Society and everyone: news, digital cameras,

 We are drowning in data, but starving for knowledge!

 “Necessity is the mother of invention”—Data mining—Automated


analysis of massive data sets
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
What Is Data Mining?
 Alternative name

 Knowledge discovery in databases (KDD)

 Watch out: Is everything “data mining”?

 Query processing

 Expert systems or statistical programs

 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) patterns or knowledge
from huge amount of data
What Is Data Mining?
Let’s start data mining with a interesting statement.

 The statement, given by Donald Rumsfeld, Defense Secretary of the


USA in an interview, is as under.

 As we know, there are known knowns. There are things we know that
we know like you know your names, your parent’s names. We also
know there are known unknowns.

 That is to say, we know that there are some things we do not know like
what one is thinking about you, what you will eat after six days, what
will be result of a lottery and so on.

 But there are also unknown unknowns, the ones we don't know that
we don't know. Are they beneficial if you know? Or it is harmful no to
know them?
What Is Data Mining?
There are also unknown knowns, things we'd like to know, but
don't know, but know someone who can doctor them and pass
them off as known knowns. To associate Rumsfeld’s above
quotation with data mining, we identify four core phrases as
1. Known knowns
2. Known unknowns
3. Unknown unknowns
 The items 1 3, and 4 deal with “Knowns”. Data mining has
relevance to the third point in red.
 It is an art of digging out what exactly we don’t know that we
must know in our business.
 The methodology is to first convert “unknown unkowns” into
“known unknowns” and then finally to “known knowns”.
What is Data Mining?: Slightly Informal

Tell me something that I should know. When you don’t know what you
should be knowing, how do you write SQL?

You cant!!

Tell me something that I should know i.e. you ask your DWH, data
repository that tell me something that I don’t know, or I should know.
Since we don’t know what we actually don’t know and what we must
know to know, we can’t write SQL’s for getting answers like we do in
OLTP systems.

Data mining is an exploratory approach, where browsing through data


using data mining techniques may reveal something that might be of
interest to the user as information that was unknown previously. Hence,
in data mining we don’t know the results.
Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
 Risk analysis and management
 Forecasting, customer retention, quality control, competitive
analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
Market Analysis and Management
 Where does the data come from?
 Credit card transactions, discount coupons, customer complaint calls
 Target marketing
 Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
 Cross-market analysis
 Associations/co-relations between product sales, & prediction based
on such association
 Customer profiling
 What types of customers buy what products

 Customer requirement analysis


 Identifying the best products for different customers

 Predict what factors will attract new customers


Fraud Detection & Mining Unusual Patterns

 Approaches: Clustering & model construction for frauds,


outlier analysis
 Applications: Health care, retail, credit card service, telecomm.
 Medical insurance
 Professional patients, and ring of doctors
 Unnecessary or correlated screening tests
 Telecommunications:
 Phone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm
 Retail industry
 Analysts estimate that 38% of retail shrink is due to
dishonest employees
Other Applications
 Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover
customer preference and behavior pages, analyzing
effectiveness of Web marketing, improving Web site
organization, etc.
Data Mining: A KDD Process
 Data mining—core of knowledge Pattern Evaluation
discovery process

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Steps of a KDD Process
 Learning the application domain
 Relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction.
 Choosing functions of data mining
 Summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 Visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
Architecture: Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-
Database or
data warehouse base
Data cleaningserver
& data Filteri
integration ng
Data
Databa Warehou
ses se
Claude Shannon's Info. Theory
More Volume
 Data mining evolved as a mechanism to cater the limitations of
OLTP systems to deal massive data sets with high dimensionality,
new data types, multiple heterogeneous data resources etc.

 The conventional systems couldn’t keep pace with the ever


changing and increasing data sets.

 Data mining algorithms are built to deal high dimensionality data,


new data types (images, video etc.), complex associations
among data items, distributed data sources and associated
issues (security etc.)
How Data Mining is different?

 Traditional Database (Transactions): -- Querying data in well-


defined processes. Reliable storage
Data Mining: On What Kinds of Data?

 Relational database
 Data warehouse
 Transactional database
 Advanced database and information repository
 Spatial and temporal data
 Time-series data
 Stream data
 Multimedia database
 Text databases & WWW
Data Mining Functionalities
 Concept description: Characterization and
discrimination
 Generalize, summarize, and contrast data characteristics

 Association (correlation and causality)


 Diaper à Beer [0.5%, 75%]

 Classification and Prediction


 Construct models (functions) that describe and
distinguish classes or concepts for future prediction
 Presentation: decision-tree, classification rule, neural
network
Data Mining
Functionalities
 Cluster analysis
 Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
 Outlier analysis
 Outlier: a data object that does not comply with the
general behavior of the data
 Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of


them are interesting
 Suggested approach: Human-centered, query-based, focused mining

 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm

 Objective vs. subjective interestingness measures


 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty.
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines
Data Mining: Classification Schemes

 Different views, different classifications

 Kinds of data to be mined

 Kinds of knowledge to be discovered

 Kinds of techniques utilized

 Kinds of applications adapted


Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, WWW

 Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
Multi-Dimensional View of Data Mining
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.

 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, Web mining, etc.
OLAP Mining: Integration of Data Mining and Data Warehousing

 Data mining systems, DBMS, Data warehouse


systems coupling
 On-line analytical mining data
 Integration of mining and OLAP technologies

 Interactive mining multi-level knowledge


 Necessity of mining knowledge and patterns at different
levels of abstraction.

 Integration of multiple mining functions


 Characterized classification, first clustering and then
association
Data Mining is…
Data Mining
Data Mining
 A neural network is a series of algorithms that endeavors to
recognize underlying relationships in a set of data through a
process that mimics the way the human brain operates. In this
sense, neural networks refer to systems of neurons, either
organic or artificial in nature.

 Rule induction is an area of machine learning in which formal


rules are extracted from a set of observations. The rules
extracted may represent a full scientific model of the data, or
merely represent local patterns in the data.
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge
fusion
Major Issues in Data Mining
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy
Summary
38

 Data mining: Discovering interesting patterns from large


amounts of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
 Data mining systems and architectures
 Major issues in data mining
Tools used for Data
Mining
 Data Mining Tools
 Weka, Rapid Miner, Mini Tab etc.
 Data Warehouses
 A subject-oriented, integrated, time-variant, and non-volatile
collection of data
 Developed to support of management’s decision-making process
 Benefits of DWH [high returns on investment, substantial
competitive advantage, increased productivity of corporate
decision-makers ]
 Python / R language
Where to Find References?
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.

 Data mining and KDD


 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.

 Journal: Data Mining and Knowledge Discovery, KDD Explorations

 Database systems
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA

 Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.

 AI & Machine Learning


 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.

 Journals: Machine Learning, Artificial Intelligence, etc.

 Statistics
 Conferences: Joint Stat. Meeting, etc.

 Journals: Annals of statistics, etc.

 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.

 Journals: IEEE Trans. visualization and computer graphics, etc.


Topic to be Covered

 Introduction to Data Mining


 Data Reduction
 Clustering
 Classification
 Association Analysis
 Link analysis
 Outlier mining
 Sequence mining
 Text Mining
 Web mining
 Recommender System

You might also like