2 introductory slides

ITE2006 – Data Mining
Techniques
By
Dr.T.Ramkumar
ramkumar.thirunavukarasu@vit.ac.in
Module – 1 Introduction

2
Necessity Is the Mother of Invention
• Data explosion problem
– Automated data collection tools and mature database technology lead
to tremendous amounts of data accumulated and/or to be analyzed in
databases, data warehouses, and other information repositories
• We are drowning in data, but starving for knowledge!
• Solution:
– Data warehousing and on-line analytical processing
– Mining interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases

3
Data Mining
• It is an interdisciplinary subfield of computer
science.
• It is the computational process of discovering
patterns in large data sets (from data warehouse)
involving methods at the intersection of artificial
intelligence, machine learning, statistics, and
database systems.
• Convergence of multiple disciplines

4
Data Mining Frameworks
• Knowledge Discovery Database Process Model
(KDD)
• CRoss Industrial Standard Process for Data
Mining (CRISP-DM)
• Sample,Explore,Modify,Model and Assess
(SEMMA)

Knowledge Discovery Database Process Model
(KDD)

CRoss Industrial Standard Process for Data
Mining (CRISP-DM)

Sample,Explore,Modify,Model and Assess
(SEMMA)

8
Steps of a KDD Process
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge

9
KDD Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization

Definition of Data Mining
• The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases.
- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial,
valid, novel, potentially useful, understandable.
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging.

11
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

August 13, 2021
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization

13
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques

14
Database Processing vs. Data Mining
Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Fuzzy
– Not a subset of database

15
Query Examples
• Database
• Data Mining
– Find all customers who have purchased milk
– Find all items which are frequently purchased with milk. (association
rules)
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)

16
Data Mining Models and Tasks

17
Classification Vs. Regression

19
Classification Vs.Prediction

20
From the Perspective of Machine
Learning

21
Supervised Learning Vs. Unsupervised
Learning

22
Data Mining Tasks
• Classification maps data into predefined groups or
classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real valued
prediction variable.
• Clustering groups similar data together into clusters.
– Unsupervised learning
– Segmentation
– Partitioning

23
Basic Data Mining Tasks (cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential patterns.

August 13, 2021
Multi-Dimensional View of Data Mining
• Data to be mined
– Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.

August 13, 2021
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views lead to different classifications
– Data view: Kinds of data to be mined
– Knowledge view: Kinds of knowledge to be discovered
– Method view: Kinds of techniques utilized
– Application view: Kinds of applications adapted

August 13, 2021
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web

August 13, 2021
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
• Frequent patterns, association, correlation vs. causality
– Diaper  Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown or missing numerical values

August 13, 2021
Data Mining Functionalities (Cont…)
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of
the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses

August 13, 2021
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy

August 13, 2021
Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
• Classification
– #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,
1993.
– #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth, 1984.
– #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant
Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
– #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
Internat. Statist. Rev. 69, 385-398.
• Statistical Learning
– #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-
Verlag.
– #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
York. Association Analysis
– #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules. In VLDB '94.
– #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In SIGMOD '00.

August 13, 2021
The 18 Identified Candidates (II)
• Link Mining
– #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale
hypertextual Web search engine. In WWW-7, 1998.
– #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked
environment. SODA, 1998.
• Clustering
– #11. K-Means: MacQueen, J. B., Some methods for classification and
analysis of multivariate observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.
– #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an
efficient data clustering method for very large databases. In SIGMOD
'96.
• Bagging and Boosting
– #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-
theoretic generalization of on-line learning and an application to
boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

August 13, 2021
The 18 Identified Candidates (III)
• Sequential Patterns
– #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:
Generalizations and Performance Improvements. In Proceedings of the 5th
International Conference on Extending Database Technology, 1996.
– #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and
M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-
Projected Pattern Growth. In ICDE '01.
• Integrated Mining
– #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and
association rule mining. KDD-98.
• Rough Sets
– #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of
Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
• Graph Mining
– #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern
Mining. In ICDM '02.

August 13, 2021
Top-10 Algorithm Finally Selected
• #1: C4.5 (61 votes)
• #2: K-Means (60 votes)
• #3: SVM (58 votes)
• #4: Apriori (52 votes)
• #5: EM (48 votes)
• #6: PageRank (46 votes)
• #7: AdaBoost (45 votes)
• #7: kNN (45 votes)
• #7: Naive Bayes (45 votes)
• #10: CART (34 votes)

August 13, 2021
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
• ACM Transactions on KDD starting in 2007

August 13, 2021
Conferences and Journals on Data Mining
• KDD Conferences
– ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
– SIAM Data Mining Conf. (SDM)
– (IEEE) Int. Conf. on Data Mining
(ICDM)
– Conf. on Principles and practices
of Knowledge Discovery and Data
Mining (PKDD)
– Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
 Other related conferences
 ACM SIGMOD
 VLDB
 (IEEE) ICDE
 WWW, SIGIR
 ICML, CVPR, NIPS
 Journals
 Data Mining and Knowledge
Discovery (DAMI or DMKD)
 IEEE Trans. On Knowledge and
Data Eng. (TKDE)
 KDD Explorations
 ACM Trans. on KDD

August 13, 2021
Where to Find References? DBLP, CiteSeer, Google
• Data mining and KDD (SIGKDD: CDROM)
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
– Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
• Web and IR
– Conferences: SIGIR, WWW, CIKM, etc.
– Journals: WWW: Internet and Web Information Systems,
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.

August 13, 2021
Summary
• Data mining: Discovering interesting patterns from large amounts of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
• Data mining systems and architectures
• Major issues in data mining

2 introductory slides

More Related Content

What's hot

Similar to 2 introductory slides

Recently uploaded

2 introductory slides