KEMBAR78
2 introductory slides | PDF
ITE2006 – Data Mining
Techniques
By
Dr.T.Ramkumar
ramkumar.thirunavukarasu@vit.ac.in
Module – 1 Introduction
2
Necessity Is the Mother of Invention
• Data explosion problem
– Automated data collection tools and mature database technology lead
to tremendous amounts of data accumulated and/or to be analyzed in
databases, data warehouses, and other information repositories
• We are drowning in data, but starving for knowledge!
• Solution:
– Data warehousing and on-line analytical processing
– Mining interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
3
Data Mining
• It is an interdisciplinary subfield of computer
science.
• It is the computational process of discovering
patterns in large data sets (from data warehouse)
involving methods at the intersection of artificial
intelligence, machine learning, statistics, and
database systems.
• Convergence of multiple disciplines
4
Data Mining Frameworks
• Knowledge Discovery Database Process Model
(KDD)
• CRoss Industrial Standard Process for Data
Mining (CRISP-DM)
• Sample,Explore,Modify,Model and Assess
(SEMMA)
Knowledge Discovery Database Process Model
(KDD)
CRoss Industrial Standard Process for Data
Mining (CRISP-DM)
Sample,Explore,Modify,Model and Assess
(SEMMA)
8
Steps of a KDD Process
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
9
KDD Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization
Definition of Data Mining
• The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases.
- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial,
valid, novel, potentially useful, understandable.
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging.
11
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
August 13, 2021
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
13
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
14
Database Processing vs. Data Mining
Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Fuzzy
– Not a subset of database
15
Query Examples
• Database
• Data Mining
– Find all customers who have purchased milk
– Find all items which are frequently purchased with milk. (association
rules)
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
16
Data Mining Models and Tasks
17
Classification Vs. Regression
18
Classification
19
Classification Vs.Prediction
20
From the Perspective of Machine
Learning
21
Supervised Learning Vs. Unsupervised
Learning
22
Data Mining Tasks
• Classification maps data into predefined groups or
classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real valued
prediction variable.
• Clustering groups similar data together into clusters.
– Unsupervised learning
– Segmentation
– Partitioning
23
Basic Data Mining Tasks (cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential patterns.
August 13, 2021
Multi-Dimensional View of Data Mining
• Data to be mined
– Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
August 13, 2021
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views lead to different classifications
– Data view: Kinds of data to be mined
– Knowledge view: Kinds of knowledge to be discovered
– Method view: Kinds of techniques utilized
– Application view: Kinds of applications adapted
August 13, 2021
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
August 13, 2021
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
• Frequent patterns, association, correlation vs. causality
– Diaper  Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown or missing numerical values
August 13, 2021
Data Mining Functionalities (Cont…)
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of
the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
August 13, 2021
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
August 13, 2021
Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
• Classification
– #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,
1993.
– #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth, 1984.
– #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant
Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
– #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
Internat. Statist. Rev. 69, 385-398.
• Statistical Learning
– #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-
Verlag.
– #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
York. Association Analysis
– #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules. In VLDB '94.
– #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In SIGMOD '00.
August 13, 2021
The 18 Identified Candidates (II)
• Link Mining
– #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale
hypertextual Web search engine. In WWW-7, 1998.
– #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked
environment. SODA, 1998.
• Clustering
– #11. K-Means: MacQueen, J. B., Some methods for classification and
analysis of multivariate observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.
– #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an
efficient data clustering method for very large databases. In SIGMOD
'96.
• Bagging and Boosting
– #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-
theoretic generalization of on-line learning and an application to
boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
August 13, 2021
The 18 Identified Candidates (III)
• Sequential Patterns
– #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:
Generalizations and Performance Improvements. In Proceedings of the 5th
International Conference on Extending Database Technology, 1996.
– #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and
M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-
Projected Pattern Growth. In ICDE '01.
• Integrated Mining
– #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and
association rule mining. KDD-98.
• Rough Sets
– #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of
Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
• Graph Mining
– #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern
Mining. In ICDM '02.
August 13, 2021
Top-10 Algorithm Finally Selected
• #1: C4.5 (61 votes)
• #2: K-Means (60 votes)
• #3: SVM (58 votes)
• #4: Apriori (52 votes)
• #5: EM (48 votes)
• #6: PageRank (46 votes)
• #7: AdaBoost (45 votes)
• #7: kNN (45 votes)
• #7: Naive Bayes (45 votes)
• #10: CART (34 votes)
August 13, 2021
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
• ACM Transactions on KDD starting in 2007
August 13, 2021
Conferences and Journals on Data Mining
• KDD Conferences
– ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
– SIAM Data Mining Conf. (SDM)
– (IEEE) Int. Conf. on Data Mining
(ICDM)
– Conf. on Principles and practices
of Knowledge Discovery and Data
Mining (PKDD)
– Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
 Other related conferences
 ACM SIGMOD
 VLDB
 (IEEE) ICDE
 WWW, SIGIR
 ICML, CVPR, NIPS
 Journals
 Data Mining and Knowledge
Discovery (DAMI or DMKD)
 IEEE Trans. On Knowledge and
Data Eng. (TKDE)
 KDD Explorations
 ACM Trans. on KDD
August 13, 2021
Where to Find References? DBLP, CiteSeer, Google
• Data mining and KDD (SIGKDD: CDROM)
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
– Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
• Web and IR
– Conferences: SIGIR, WWW, CIKM, etc.
– Journals: WWW: Internet and Web Information Systems,
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.
August 13, 2021
Summary
• Data mining: Discovering interesting patterns from large amounts of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
• Data mining systems and architectures
• Major issues in data mining

2 introductory slides

  • 1.
    ITE2006 – DataMining Techniques By Dr.T.Ramkumar ramkumar.thirunavukarasu@vit.ac.in Module – 1 Introduction
  • 2.
    2 Necessity Is theMother of Invention • Data explosion problem – Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories • We are drowning in data, but starving for knowledge! • Solution: – Data warehousing and on-line analytical processing – Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
  • 3.
    3 Data Mining • Itis an interdisciplinary subfield of computer science. • It is the computational process of discovering patterns in large data sets (from data warehouse) involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. • Convergence of multiple disciplines
  • 4.
    4 Data Mining Frameworks •Knowledge Discovery Database Process Model (KDD) • CRoss Industrial Standard Process for Data Mining (CRISP-DM) • Sample,Explore,Modify,Model and Assess (SEMMA)
  • 5.
    Knowledge Discovery DatabaseProcess Model (KDD)
  • 6.
    CRoss Industrial StandardProcess for Data Mining (CRISP-DM)
  • 7.
  • 8.
    8 Steps of aKDD Process • Learning the application domain – relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation – Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining – summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge
  • 9.
    9 KDD Process Ex:Web Log • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. • Potential User Applications: – Cache prediction – Personalization
  • 10.
    Definition of DataMining • The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996) • Keywords in this definition: Process, nontrivial, valid, novel, potentially useful, understandable. • Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging.
  • 11.
    11 Data Mining andBusiness Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • 12.
    August 13, 2021 DataMining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 13.
    13 Data Mining Development •SimilarityMeasures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Neural Networks •Decision Tree Algorithms •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques
  • 14.
    14 Database Processing vs.Data Mining Processing • Query – Well defined – SQL • Query – Poorly defined – No precise query language  Data – Operational data  Output – Precise – Subset of database  Data – Not operational data  Output – Fuzzy – Not a subset of database
  • 15.
    15 Query Examples • Database •Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    20 From the Perspectiveof Machine Learning
  • 21.
    21 Supervised Learning Vs.Unsupervised Learning
  • 22.
    22 Data Mining Tasks •Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning
  • 23.
    23 Basic Data MiningTasks (cont’d) • Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns.
  • 24.
    August 13, 2021 Multi-DimensionalView of Data Mining • Data to be mined – Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • Knowledge to be mined – Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels • Techniques utilized – Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. • Applications adapted – Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 25.
    August 13, 2021 DataMining: Classification Schemes • General functionality – Descriptive data mining – Predictive data mining • Different views lead to different classifications – Data view: Kinds of data to be mined – Knowledge view: Kinds of knowledge to be discovered – Method view: Kinds of techniques utilized – Application view: Kinds of applications adapted
  • 26.
    August 13, 2021 DataMining: On What Kinds of Data? • Database-oriented data sets and applications – Relational database, data warehouse, transactional database • Advanced data sets and advanced applications – Data streams and sensor data – Time-series data, temporal data, sequence data (incl. bio-sequences) – Structure data, graphs, social networks and multi-linked data – Object-relational databases – Heterogeneous databases and legacy databases – Spatial data and spatiotemporal data – Multimedia database – Text databases – The World-Wide Web
  • 27.
    August 13, 2021 DataMining Functionalities • Multidimensional concept description: Characterization and discrimination – Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions • Frequent patterns, association, correlation vs. causality – Diaper  Beer [0.5%, 75%] (Correlation or causality?) • Classification and prediction – Construct models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown or missing numerical values
  • 28.
    August 13, 2021 DataMining Functionalities (Cont…) • Cluster analysis – Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns – Maximizing intra-class similarity & minimizing interclass similarity • Outlier analysis – Outlier: Data object that does not comply with the general behavior of the data – Noise or exception? Useful in fraud detection, rare events analysis • Trend and evolution analysis – Trend and deviation: e.g., regression analysis – Sequential pattern mining: e.g., digital camera  large SD memory – Periodicity analysis – Similarity-based analysis • Other pattern-directed or statistical analyses
  • 29.
    August 13, 2021 MajorIssues in Data Mining • Mining methodology – Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web – Performance: efficiency, effectiveness, and scalability – Pattern evaluation: the interestingness problem – Incorporation of background knowledge – Handling noise and incomplete data – Parallel, distributed and incremental mining methods – Integration of the discovered knowledge with existing one: knowledge fusion • User interaction – Data mining query languages and ad-hoc mining – Expression and visualization of data mining results – Interactive mining of knowledge at multiple levels of abstraction • Applications and social impacts – Domain-specific data mining & invisible data mining – Protection of data security, integrity, and privacy
  • 30.
    August 13, 2021 Top-10Most Popular DM Algorithms: 18 Identified Candidates (I) • Classification – #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. – #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. – #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) – #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. • Statistical Learning – #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer- Verlag. – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis – #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. – #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00.
  • 31.
    August 13, 2021 The18 Identified Candidates (II) • Link Mining – #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. – #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. • Clustering – #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. – #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. • Bagging and Boosting – #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision- theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
  • 32.
    August 13, 2021 The18 Identified Candidates (III) • Sequential Patterns – #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. – #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth. In ICDE '01. • Integrated Mining – #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. • Rough Sets – #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 • Graph Mining – #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.
  • 33.
    August 13, 2021 Top-10Algorithm Finally Selected • #1: C4.5 (61 votes) • #2: K-Means (60 votes) • #3: SVM (58 votes) • #4: Apriori (52 votes) • #5: EM (48 votes) • #6: PageRank (46 votes) • #7: AdaBoost (45 votes) • #7: kNN (45 votes) • #7: Naive Bayes (45 votes) • #10: CART (34 votes)
  • 34.
    August 13, 2021 ABrief History of Data Mining Society • 1989 IJCAI Workshop on Knowledge Discovery in Databases – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997) • ACM SIGKDD conferences since 1998 and SIGKDD Explorations • More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. • ACM Transactions on KDD starting in 2007
  • 35.
    August 13, 2021 Conferencesand Journals on Data Mining • KDD Conferences – ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) – SIAM Data Mining Conf. (SDM) – (IEEE) Int. Conf. on Data Mining (ICDM) – Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)  Other related conferences  ACM SIGMOD  VLDB  (IEEE) ICDE  WWW, SIGIR  ICML, CVPR, NIPS  Journals  Data Mining and Knowledge Discovery (DAMI or DMKD)  IEEE Trans. On Knowledge and Data Eng. (TKDE)  KDD Explorations  ACM Trans. on KDD
  • 36.
    August 13, 2021 Whereto Find References? DBLP, CiteSeer, Google • Data mining and KDD (SIGKDD: CDROM) – Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. – Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD • Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) – Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA – Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. • AI & Machine Learning – Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. – Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. • Web and IR – Conferences: SIGIR, WWW, CIKM, etc. – Journals: WWW: Internet and Web Information Systems, • Statistics – Conferences: Joint Stat. Meeting, etc. – Journals: Annals of statistics, etc. • Visualization – Conference proceedings: CHI, ACM-SIGGraph, etc. – Journals: IEEE Trans. visualization and computer graphics, etc.
  • 37.
    August 13, 2021 Summary •Data mining: Discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of information repositories • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. • Data mining systems and architectures • Major issues in data mining