KEMBAR78
Data Mining - Lecture 1 | PDF | Level Of Measurement | Data Mining
0% found this document useful (0 votes)
17 views33 pages

Data Mining - Lecture 1

Data Mining - Lecture 1

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

Data Mining - Lecture 1

Data Mining - Lecture 1

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Mining and Business Intelligence

Overview

Introduction Technologies

Applications

By
Dr. Nora Shoaip
Lecture1

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Introduction
• Why Data Mining
• What is Data Mining
• Data Mining Applications
• Categories of Mining Techniques
Why Data Mining?
 The era of Explosive Growth of Data: in the petabytes!
 Automated data collection and availability: tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, transactions, stocks, …
 Science: Remote sensing, bioinformatics, …
 Society and everyone: news, digital cameras, social feeds
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Linear growth of data management tasks with data volumes
 Massive data volumes, but still little insight!
 Solution! Data mining—The automated analysis of massive data sets

3
What is Data Mining?
Data mining (knowledge discovery from data)
o Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
o Data mining: a misnomer?
• Alternative names
o Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, information
harvesting, business intelligence, etc.
• Is everything “data mining”?
o Simple search and query processing
o (Deductive) expert systems

4
Knowledge Discovery Process
Selection:
• Finding data relevant to
the task
Processing:
• Cleaning and putting
data in format suitable
for mining
Transformation
• Performing summaries,
aggregations or
consolidation
Data Mining
• Applying the data
mining algorithms to
extract knowledge
Evaluation
• Locating useful
knowledge

5
What Kinds of Data Can Be Mined?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

6
Data Mining Applications

 Understanding Customer Behavior


o Market basket analysis.
o Market segmentation.
o Targeted Advertisement.
o Market Forecasting.
o Recommender Systems

7
Data Mining Applications cont…

 Social Networks Mining


o Community detection.
o Friends recommendation.
o Trend Analysis.
o Event detection.
o Personality prediction.

8
Data Mining Applications cont…

 Web and Text Mining


o Web usage mining.
o Web structure mining.
o Search engines.
o Email categorization.
o Fact checking.

9
Categories of Mining Techniques

 Descriptive Data Mining.

 Predictive Data Mining.

10
Frequent Patterns Mining

 Descriptive data mining technique.


 Finds commonly occurring patterns in data.
What items are frequently purchased together in your supermarket basket?
 Applied on:
 Transactional data (Market Basket Analysis)
 Sequential Data
 Graph Data

11
Clustering

 Descriptive data mining.


 Unsupervised learning.
 Divide data into groups.
 Applications:
 Market segmentation
 community detection.

12
Classification
 Predictive data mining.
 Supervised learning.
Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 e.g., classify countries based on (climate),
 or classify cars based on (gas mileage)
 Predict some unknown class labels

Typical methods
 Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification, …

13
Know Your Data
• Data Objects & Attribute Types
• Basic Statistical Descriptions of
Data
Objects and Attributes

 A data object represents an entity


 Also sample, example, instance, data point, or object (in a DB : Data Tuple)
 e.g. customers, students, patients, books
 An attribute is a data field, representing a characteristic or feature of a data
object
 Also noun attribute, dimension, feature, and variable (DB and DM, DWs, ML,
Statistics)
 e.g. name, age, salary, gender, grade, …
 Attribute (feature) vector  A set of attributes that describe an object

15
Attribute Types:Nominal Attributes

 Symbol or names of things


 Each value represents category, code, or state
 also referred to as categorical
 e.g. hair color, marital status, customer ID
 Possible to be represented as numbers (coding)
 Qualitative

16
Attribute Types:Binary Attributes
 Nominal with only two values representing two states or
categories: 0 or 1 (absent or present)
 Also Boolean (true or false)
 Qualitative
 Symmetric: both states are equally valuable and have the same
weight
 e.g. gender
 Asymmetric: states are not equally important
 e.g. medical test outcomes

17
Attribute Types:Ordinal Attributes

 Qualitative
 Values have a meaningful order or ranking, but magnitude
between successive values is not known
 e.g. professional rank, grade, customer satisfaction
 Useful for data reduction of numerical attributes

18
Attribute Types:Numeric Attributes

 Quantitative
 Interval-scaled: measured on a scale of equal-size units
 e.g. temperature, year
 Do not have a true zero point
 Not possible to be expressed as multiples
 Ratio-scaled: have a true zero point
 A value can be expressed as a multiple of another
 e.g. years of experience, weight, salary

19
Discrete vs. Continuous Attributes

 Discrete Attribute: has a finite or countably infinite set of


values, integers or otherwise
 e.g. hair color, smoker, medical test, Customer_ID
 Customer_ID is countably infinite  infinite values but
one-to-one correspondence with natural numbers
 If an attribute is not discrete, it’s continuous
 e.g. height, weight, age

20
Outline
 Data Objects & Attribute Types
• What is an Object?
• What is an Attribute?
• Attribute Types
• Continuous vs. Discrete
 Basic Statistical Descriptions of Data
• Measuring central tendency
• Measuring Data dispersion
• Basic Graphic displays
 Measuring Data similarity & dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal- Binary) attributes
• Dissimilarity of Numerical Data

21
Measuring Central Tendency

22
Measuring Central Tendency

 Median: middle value in set of ordered values


 N is odd  median is middle value of ordered set
 N is even  median is not unique  average of two middlemost
values
 Expensive to compute for large # of observations
 Mode: value that occurs most frequently in the attribute values
 Works for both qualitative and quantitative attributes
 Data can be unimodal, bimodal, or trimodal – no mode?

23
Measuring Central Tendency
Example
 Salary (in thousands of dollars), shown in increasing
order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
 Mean = ?
 Median = ?
 Mode = ?

24
Measuring Central
Tendency
Example
Salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110

• Mean = 58,000
• Median = 54,000
• Mode = 52,000 and 70,000 – bimodal

25
Measuring dispersion of Data

26
Measuring dispersion
of Data

27
Measuring dispersion
of Data
 Five-Number Summary:
 Median (Q2), quartiles Q1 and Q3, &
smallest and largest individual
observations – in order
 Boxplots: visualization technique for the five-
number summary
 Whiskers terminate at min & max OR the
most extreme observations within
1.5 × IQR of the quartiles – with
remainder points (outliers) plotted
individually
28
Ex:
Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results:
•Calculate the mean, median, and standard deviation of age
and %fat.
•Draw the boxplots for age and %fat.
•Calculate the correlation coefficient. Are these two attributes
positively or negatively correlated? Compute their covariance.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

29
Solution

Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

30
Solution %fat
7.8
9.5
Draw the boxplots for age and %fat. 17.8
For Age 25.9
26.5
 Q1=39, median= 51, Q3=57, min=23, max=61 27.2
 IQR= 57-39= 18, 1.5 IQR= 27 27.4
28.8
 newMin= 39-27= 12, newMax= 57+27= 84 30.2
31.2
For Fat 31.4
 Q1=26.5, median= 30.7, Q3=34.1, min=7.8, max=42.5 32.9
33.4
 IQR= 34.1-26.5= 7.6, 1.5 IQR= 11.4 34.1
34.6
 newMin= 26.5-11.4= 15.1, 35.7
 newMax= 34.1+11.4= 45.5 41.2
42.5

Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7 25
31
Visual Representations
of Data Distributions
 Histograms
 Scatter Plots: each pair of values is treated
as a pair of coordinates and plotted as
points in plane
 X and Y are correlated if one attribute
implies the other
 positive, negative, or null
(uncorrelated)
 For more attributes, we use a scatter
plot matrix
32
Visual Representations
of Data Distributions

Uncorrelated data

33

You might also like