Data Mining and Business Intelligence
Overview
Introduction Technologies
Applications
By
Dr. Nora Shoaip
Lecture1
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Introduction
• Why Data Mining
• What is Data Mining
• Data Mining Applications
• Categories of Mining Techniques
Why Data Mining?
The era of Explosive Growth of Data: in the petabytes!
Automated data collection and availability: tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, transactions, stocks, …
Science: Remote sensing, bioinformatics, …
Society and everyone: news, digital cameras, social feeds
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Linear growth of data management tasks with data volumes
Massive data volumes, but still little insight!
Solution! Data mining—The automated analysis of massive data sets
3
What is Data Mining?
Data mining (knowledge discovery from data)
o Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
o Data mining: a misnomer?
• Alternative names
o Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, information
harvesting, business intelligence, etc.
• Is everything “data mining”?
o Simple search and query processing
o (Deductive) expert systems
4
Knowledge Discovery Process
Selection:
• Finding data relevant to
the task
Processing:
• Cleaning and putting
data in format suitable
for mining
Transformation
• Performing summaries,
aggregations or
consolidation
Data Mining
• Applying the data
mining algorithms to
extract knowledge
Evaluation
• Locating useful
knowledge
5
What Kinds of Data Can Be Mined?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
6
Data Mining Applications
Understanding Customer Behavior
o Market basket analysis.
o Market segmentation.
o Targeted Advertisement.
o Market Forecasting.
o Recommender Systems
7
Data Mining Applications cont…
Social Networks Mining
o Community detection.
o Friends recommendation.
o Trend Analysis.
o Event detection.
o Personality prediction.
8
Data Mining Applications cont…
Web and Text Mining
o Web usage mining.
o Web structure mining.
o Search engines.
o Email categorization.
o Fact checking.
9
Categories of Mining Techniques
Descriptive Data Mining.
Predictive Data Mining.
10
Frequent Patterns Mining
Descriptive data mining technique.
Finds commonly occurring patterns in data.
What items are frequently purchased together in your supermarket basket?
Applied on:
Transactional data (Market Basket Analysis)
Sequential Data
Graph Data
11
Clustering
Descriptive data mining.
Unsupervised learning.
Divide data into groups.
Applications:
Market segmentation
community detection.
12
Classification
Predictive data mining.
Supervised learning.
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
e.g., classify countries based on (climate),
or classify cars based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification, …
13
Know Your Data
• Data Objects & Attribute Types
• Basic Statistical Descriptions of
Data
Objects and Attributes
A data object represents an entity
Also sample, example, instance, data point, or object (in a DB : Data Tuple)
e.g. customers, students, patients, books
An attribute is a data field, representing a characteristic or feature of a data
object
Also noun attribute, dimension, feature, and variable (DB and DM, DWs, ML,
Statistics)
e.g. name, age, salary, gender, grade, …
Attribute (feature) vector A set of attributes that describe an object
15
Attribute Types:Nominal Attributes
Symbol or names of things
Each value represents category, code, or state
also referred to as categorical
e.g. hair color, marital status, customer ID
Possible to be represented as numbers (coding)
Qualitative
16
Attribute Types:Binary Attributes
Nominal with only two values representing two states or
categories: 0 or 1 (absent or present)
Also Boolean (true or false)
Qualitative
Symmetric: both states are equally valuable and have the same
weight
e.g. gender
Asymmetric: states are not equally important
e.g. medical test outcomes
17
Attribute Types:Ordinal Attributes
Qualitative
Values have a meaningful order or ranking, but magnitude
between successive values is not known
e.g. professional rank, grade, customer satisfaction
Useful for data reduction of numerical attributes
18
Attribute Types:Numeric Attributes
Quantitative
Interval-scaled: measured on a scale of equal-size units
e.g. temperature, year
Do not have a true zero point
Not possible to be expressed as multiples
Ratio-scaled: have a true zero point
A value can be expressed as a multiple of another
e.g. years of experience, weight, salary
19
Discrete vs. Continuous Attributes
Discrete Attribute: has a finite or countably infinite set of
values, integers or otherwise
e.g. hair color, smoker, medical test, Customer_ID
Customer_ID is countably infinite infinite values but
one-to-one correspondence with natural numbers
If an attribute is not discrete, it’s continuous
e.g. height, weight, age
20
Outline
Data Objects & Attribute Types
• What is an Object?
• What is an Attribute?
• Attribute Types
• Continuous vs. Discrete
Basic Statistical Descriptions of Data
• Measuring central tendency
• Measuring Data dispersion
• Basic Graphic displays
Measuring Data similarity & dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal- Binary) attributes
• Dissimilarity of Numerical Data
21
Measuring Central Tendency
22
Measuring Central Tendency
Median: middle value in set of ordered values
N is odd median is middle value of ordered set
N is even median is not unique average of two middlemost
values
Expensive to compute for large # of observations
Mode: value that occurs most frequently in the attribute values
Works for both qualitative and quantitative attributes
Data can be unimodal, bimodal, or trimodal – no mode?
23
Measuring Central Tendency
Example
Salary (in thousands of dollars), shown in increasing
order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Mean = ?
Median = ?
Mode = ?
24
Measuring Central
Tendency
Example
Salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
• Mean = 58,000
• Median = 54,000
• Mode = 52,000 and 70,000 – bimodal
25
Measuring dispersion of Data
26
Measuring dispersion
of Data
27
Measuring dispersion
of Data
Five-Number Summary:
Median (Q2), quartiles Q1 and Q3, &
smallest and largest individual
observations – in order
Boxplots: visualization technique for the five-
number summary
Whiskers terminate at min & max OR the
most extreme observations within
1.5 × IQR of the quartiles – with
remainder points (outliers) plotted
individually
28
Ex:
Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results:
•Calculate the mean, median, and standard deviation of age
and %fat.
•Draw the boxplots for age and %fat.
•Calculate the correlation coefficient. Are these two attributes
positively or negatively correlated? Compute their covariance.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
29
Solution
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
30
Solution %fat
7.8
9.5
Draw the boxplots for age and %fat. 17.8
For Age 25.9
26.5
Q1=39, median= 51, Q3=57, min=23, max=61 27.2
IQR= 57-39= 18, 1.5 IQR= 27 27.4
28.8
newMin= 39-27= 12, newMax= 57+27= 84 30.2
31.2
For Fat 31.4
Q1=26.5, median= 30.7, Q3=34.1, min=7.8, max=42.5 32.9
33.4
IQR= 34.1-26.5= 7.6, 1.5 IQR= 11.4 34.1
34.6
newMin= 26.5-11.4= 15.1, 35.7
newMax= 34.1+11.4= 45.5 41.2
42.5
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7 25
31
Visual Representations
of Data Distributions
Histograms
Scatter Plots: each pair of values is treated
as a pair of coordinates and plotted as
points in plane
X and Y are correlated if one attribute
implies the other
positive, negative, or null
(uncorrelated)
For more attributes, we use a scatter
plot matrix
32
Visual Representations
of Data Distributions
Uncorrelated data
33