Prof.
Heitor Silvério Lopes
Prof. Thiago H. Silva
Data Mining &
Knowledge
Discovery
Class 1a – Introduction &
Overview
2025
Data mining → Knowledge discovery
The purpose of D.M. is to find new, useful, and relevant knowledge hidden in
large amounts of data
The Multidisciplinarity of Data Mining
● Data mining uses concepts and methods from many areas:
○ Machine Learning
○ Databases
○ Computational Intelligence (EC, NN, FS)
○ Mathematics / Statistics
○ Programming languages
Data x Information X Knowledge
● Data:
○ Instances (objects, people, timestamps, etc)
○ Describe individual, not collective, properties, and they are:
■ Easy to collect
■ Available in large amounts and forms
■ Few useful for predictions or decision-making
● Information: We are drowning in
○ Classes (groups) of instances information,
○ Describe generic patterns, structures, principles, etc but starving for
■ Hard to obtain knowledge.
■ Few abundant John Naisbitt (1982)
■ Allow generalizations and predictions
● Knowledge
○ Regards the comprehension of something (including facts, habilities and informations)
○ Obtained by means of human perceptions or learning
Data x Information X Knowledge
Knowledge
complexity
Information
Data
Some important definitions of Data Mining
● Automatic/semi-automatic discovery of structural patterns in data (Witten et
al., 2000)
● Extraction of structured knowledge which is useful, previously unknown, non-
trivial, humanly comprehensible, from large amounts of data (Fayyad et al.,
1996)
● Desirable features of discovered knowledge:
○ Correctness
○ Generality
○ Utility
○ Comprehensibility
○ Novelty
Examples of rules discovered using data mining
● Case 1: consider a dataset of patient records from a maternity hospital.
A data-mining procedure found this rule:
Correctness ☺
IF (patient.age >) 15 AND (patient.age < 50) AND Generality ☺
(sector = “surgical clinic”) AND (surgery.type = Utility
Comprehensibility ☺
“cesarean”) THEN (patient.sex = “female”) Novelty
● Case 2: consider a dataset of pediatric oncological medical records*.
A data-mining procedure found this rule:
Correctness ☺
IF (histology.type = carcinoma) AND (patient.age < 3) Generality ☺
Utility ☺ ☺
AND (oncological.stage = 1) AND (metastasis=“no”) Comprehensibility ☺
THEN (years.survival > 5) Novelty ☺ ☺ ☺
* Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. A constrained-syntax genetic programming system for discovering
classification rules: application to medical data sets. Artificial Intelligence in Medicine, v. 30, n. 1, p. 27-48, 2004.
Life-cycle of Data Mining projects Hard
work !
Pre-processing:
Collection, formatting,
selection, data cleaning, data
integration reduction
Raw data
Data warehouse
Pattern discovery
Data mining methods
Filtered/cleaned data
Pattern
analysis and
interpretation
Knowledge !!
Motivations for Data Mining
1) VERY LARGE amount of data freely available in the internet
o E-mails and social networks
o Business and bank transactions
o Web page searches (Webscrapping!)
o Medical and biological data
o Scientific and astronomical data
Motivations for Data Mining
2) Business/commercial interest ($$$)
Critical Dilema in Data Mining
● The amount of data generated, created, stored, etc, grows exponentially
● The ability to mine, understand, and effectively use these data grows
linearly (best case!)
• Data mining may help
us to understand
large amounts of data
by extracting useful
knowledge
* https://explodingtopics.com/blog/data-generated-per-day
Tasks x Methods in Data Mining
Tasks Methods
Classification Decision trees (C4.5), Cassification rules, k-nearest-neighboors,
Random forest, Support vector machine, Bayesian classifier,
Neural network, Adaboost
Association Rules Apriori, FP-growth, Eclat, Zigzag
Regression Linear Regression, Polynomial regression, Logistic regression
Feature Selection & Principal component analysis (PCA), Chi-square, Entropy,
Dimensionality Reduction Information gain
Clustering K-means, Kohonen’s self-organized map, Density-based scan,
Hierarchical grouping, t-SNE
Data visualization * Silhouette plot, scatter plot, heatmap, box plot, clusters, t-SNE
Tasks x Methods in Data Mining
● Types of data:
○ Numerical
○ Categorical
○ Text
○ Image/video
○ Time-series/signals
● Some data types require diferent tasks, for instance:
○ Image, time-series/signals can be clustered or classified
○ Text can be classified, but may require other specific tasks (e.g. sentiment analysis)
Some open-source softwares for Data Mining
● Orange (Python): developed and maintained by the University of Ljubljana (SL)
https://orangedatamining.com/
○ Easy-to-use windows interface (visual programming), add-ons for specific tasks, allows
integration with Python code.
● Weka (Java): created and maintained by the Waikato University (NZ)
https://www.cs.waikato.ac.nz/ml/weka
○ Very large library of methods, community support
○ Not-so-user-friendly interface, Poor documentation
● Knime (Java): developed and maintained by the Konztanz Universitaet (GE)
https://www.knime.com/
● Further information: https://www.datamation.com/big-data/open-source-data-
mining-tools/