01-data mining-introduction-bayero-u.pptx

Introduction to Data Mining
Why use Data Mining?
Lecturer: Abdullahi Ahamad Shehu
(M.Sc. Data Science, M.Sc. Computer Science)
Office: Faculty of Computing Extension

Contents
• Data vs. Information
• Data mining
• Methodology
• Examples: input and output
• Applications
• Generalisation as search
• Ethical and professional issues
• Summary

Data Banks
• Nowadays we collect vast amounts of data, e.g.
• Shopping lists
• Bank transactions
• Medical records
• Web logs
• Drilling information (bottom hole pressure, mud flow, porosity, permeability …)
• Pandemic data (positive cases, hospitalisations, deaths, countries, population …)
• Weather data
• Raw data is not very useful
• Huge volume of data makes it difficult to handle.

Getting Information from Data
• Information is required in order to solve problems.
• Data can be a superb source of information.
• This may be difficult to extract due to the volume of data.
• BUT once extracted, we can get an understanding of the
problem domain
• E.g.
• Customer profiles vs. what they buy
• Credit card transactions vs. fraud
• Drilling data vs. potential problem with drill (or scale or hydrate formation)

Information
• Information is required in order to solve problems.
• E.g. discover fraudulent credit card use
• Input: various data regarding the current transaction.
• Output: whether the current transaction is fraudulent or not.
• Information: extracted from records of past transactions including
whether they were fraudulent or not. How to determine fraudulent
transactions.

Data mining
• Data mining is the process of extracting information which is
implicitly stored in collections of data.
• Used to:
• Solve new problems (e.g. detect credit card fraud)
• Understand problems and their solutions (e.g. understand what
situations may lead to fraud).
• Main challenges:
• Work with large volumes of data
• Distinguish between interesting and uninteresting information
• Work with inaccurate and incomplete sets of data.

…
• Aim: find strong patterns in data
• Pattern strength is related to prediction strength
BUT
• Most patterns contained in data are not interesting
• Patterns may be
• Not always true (inexact)
• The result of chance (spurious)
• Missing data
• Inaccurate or erroneous data

Example
• Shopping
• Strong pattern – people who buy bread also buy milk
• But this is not interesting!
• Weaker pattern – men who buy nappies on a Friday also buy beer
• More interesting
• Weaker – some men buy only nappies …
• Missing data – the gender of the shopper is unknown for some
transactions
• Inaccurate data – the gender of the shopper might have been
entered incorrectly

Data Mining Requirements
Data
A (large) set of
past data
[including
outcome].
Machine learning
One or more
programs which
extract
relationships
(patterns)
between data,
i.e. information.
Evaluation
is the output
always/mostly
/rarely
correct?
What kind of
errors?

Machine Learning
• Used in data mining to obtain relationships (patterns) between data
• Learning
• Capable of changing behaviour in order to perform better
• Learning from examples
• Training data: examples used for learning
• [Validation data: examples used for tuning parameters]
• Test data: examples used to test learnt knowledge.

Data Mining
Data
mining
M
a
c
h
i
n
e
l
e
a
r
n
i
n
g
Statistics
D
a
t
a
b
a
s
e
s

Types of Data Mining
• SUPERVISED (prediction)
• Classification: predicts class for new problem. E.g
• Fraudulent transaction or not
• Fault diagnosis
• Regression: predicts numeric solution for new problem. E.g.
• House price
• Others
• Time Series: regression where measurements are taken over time.

Types of Data Mining
• UNSUPERVISED (knowledge discovery)
• Association Rules: find patterns in data
• Purchasing habits in supermarkets
• Clustering: groups data into clusters of similar cases
• Text Mining: extracts useful concepts from text data
• Others
• Summarisation: find compact definitions of data
• Deviation Detection: detects changes from norm.
• Database Segmentation: divides large DB into smaller databases which can
solve sub-problems.

Supervised Data Mining
A B C D E
Y
N
Y
Y
Y
N
A B C D E
?
Training Data 1. Input
Concept Space
New Data
5. Make Prediction
IF xxx
AND xxx
THEN xxx
Test Data
A B C D E
Y
N
Y
Y
3. Evaluate
2. Output
4. Evaluation result
of the model

84
Evaluation
• How effective and efficient is the data mining model / output at
classifying unseen data?
• The test data is used as ‘unseen data’
© The Robert
Gordon
= 27/(27 + 6)
= 57/(57 + 10)
= (27 + 57) /
(27 + 6 + 57 + 10)
Confusion
matrix

10 March 2025 18
Methodology
• A popular methodology is the
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
• Agile methodology with cycle
where
• There is no strict sequence
between stages
• Movement between states is
forward as well as backwards
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Data

Contents
• Data mining
• Applications
• Ethical and professional issues
• Summary

Simple example: contact lenses
Age Spec.
prescription
Astigmatism Tear production
rate
Lenses
?
Pre-
Presbyopic
Hypermetrope No Reduced None
Young Myope No Reduced None
Young Hypermetrope No Normal Soft
Presbyopic Myope Yes Normal Hard
…
Target: decide whether somebody needs contact lenses depending on their age,
spectacle prescription, astigmatism and tear production rate

Information re. Lenses
• Sample rule
If tear production rate = reduced
then lenses = none
else
if age = young
and astigmatism = no
then lenses = soft

Example: Shall we play?
Outlook Temp Humidity Wind Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No yes
Rainy Mild Normal No yes
Assuming 3 possible values for outlook, 3 for temperature, 2 for
humidity and 2 for wind there are
3 * 3* 2 * 2 = 36 possible combinations

Shall we play? Decision list
• If outlook = sunny
and humidity = high
then play = no
• If outlook = rainy
and wind = yes
then play = no
• If outlook = cloudy
then play = yes
• If humidity = normal
then play = yes
• If none of the above rules applies
then play = yes

Shall we play? Numeric values
Outlook Temp Humidity Wind Play?
Sunny 85 85 No No
Sunny 80 90 Yes No
Cloudy 83 86 No yes
Rainy 70 96 No yes
Requires inequalities to deal with numeric values. E.g.
if outlook = sunny
and humidity > 83
then play = no

Information presented
• May be
• Complete, i.e. covers all possibilities
• Incomplete
• Accuracy may be
• 100%, i.e. works all the time
• < 100%

Type of information
• Classification rule: predicts the value of a particular attribute
• Association rule: predicts the value of a single or a combination of
attributes. Unlike with classification, there is no target attribute to learn
• E.g. if temperature = cool
then humidity = normal
if humidity = normal
and wind = no
then play = yes

Predicting CPU performance
• Computer configurations
Cycle
time
Min
mem
Max
mem
Cache Min
channel
Max
channel
performance
125 256 6000 256 16 128 198
29 8000 32000 32 8 32 269
…
Target: calculate performance using the other attributes

Linear Regression
•Rules include a weighted function
•E.g.
• Performance =
- 55.9
+ 0.0489 cycle time
+ 0.0153 min memory
+ 0.0056 max memory
+ 0.641 cache
- 0.27 min channels
+ 1.48 max channels

Labour negotiations
Attribute Type 1 2 3 … 40
Duration
1st
wage incr.
2nd
wage incr.
3rd
wage incr.
Cost of living
adjust.
Hours/week
Pension
Standby pay
Statutory holidays
…
Acceptable
Number
%
%
%
{none,tcf,tc}
Number
{none,ret-allw,emp-
contr.}
%
Number
…
{bad, good}
1
2
?
?
none
28
none
?
11
…
good
2
4
5
?
tcf
35
?
13
15
…
good
3
4.3
4.4
?
?
38
?
?
12
…
good
2
4.5
4
?
None
40
?
?
12
…
good

Labour negotiations
Decision tree: an approximation.
Not always right
1st
year
ba
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
1st
year inc
Statutory hols
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
> 4

Labour Negotiations
Accurate decision tree for examples but overfits data
1st
year inc
bad
<= 2.5 > 2.5
Statutory holidays
1st
year inc
bad
good
good
>10 <= 10
<= 4 > 4
Hours/week
Health plan
bad bad
good
none half full
<=36 >36

Market basket data
customer beer nappies bread ...
1 yes no yes
2 yes yes no
3 no yes yes
4 no no no
 Other application examples include:
 Amazon buying habits
 Word usage in email or text communication
Unsupervised task => associate co-occurrences

Output: Association Rules
• Association rule
• If beer = yes and crisps = no then nappy = yes
• If beer = yes then nappy = yes and bread = no
Different from
• If outlook = sunny and windy = no then play = yes
• Predicted attribute changes [ not always play]
• Like classification rules BUT
• used to infer the value of any attribute (not just class)
• or a combination of attributes

Clustering Collection of Documents
Doc.ID Keywords in Title Keywords in Body text
125 {Explosive Data Gro
wth}
{Software got better, and open-source movements and
also the science of analysis became better... }
29 {Data
Mining Community's
Top Resource}
{special techniques are used to find patterns in data}
…
Unsupervised task: form clusters
• Generally the bag of words will need to be pre-processed and converted into a vector
form to enable a data mining algorithm to work with it
• Other examples: clustering of images, clustering of customers by similar buying habits
etc.

Output: Clusters
• Represent groups of instances which are similar

Applications
• Automatic estimation of organisms in zooplankton samples
• Maintenance schedules of heavy machinery.
• Autoclave layout for aircraft parts
• Automated completion of repetitive forms
• Loan decision-making
• Image screening
• …etc

Should an applicant get a loan?
• Statistical model deals with 90% cases
• 10% cases referred to loan officers
• 50% referred cases are bad
• BUT referred customers generate money!!!
• Expert gets 50% of referred cases right
• Solution: use data mining to aid decision of borderline cases

Should an applicant get a loan?
• 1000 training examples
• 20 attributes
• Extracted rules accurately predict 70% referred cases
• Much better than human expert!
• Rules could be used to explain to customers the reasons for the
company’s decision.

Detecting Oil Spills from Images
• Data: radar satellite images
• Oil spills: dark regions with changing size and shape
• BUT weather conditions can also cause this effect!!!
• So spill detection is a specialised job.
• Problems:
• very few training examples
• data is not balanced (most dark areas are NOT spills)
no
yes

Detecting Oil Spills from Images
• Normalised image used for extraction of dark regions
• 7 attributes used: size, shape, area, intensity, sharpness and
jaggedness of boundaries, proximity to other regions, info about
background in vicinity of region.
• Batch: regions from a specific image
• Adjustable false alarm rate required

Generalisation as search
• Construct space of all possible concept (target to learn) descriptions:
the concept space.
• Search through the space for a description that fits data.
Two descriptions
that fit the data

Concept space
• Set of possible concept descriptions may be enormous.
• E.g. deciding whether to play or not (the weather problem):
• 4 possibilities for outlook: sunny, overcast, rainy or not in rule.
• 4 for temperature, 3 for weather , 3 for humidity and 2 for play (outcome
so it has to be in the rule).
• 4 * 4 * 3 * 3 * 2 = 288 possibilities for each rule.
• Assumption: rule set no bigger than data set (14).
• Approx. 2.7 * 1034
different rule sets!!!!!

Enumerating concept space
• There are techniques to make enumeration more feasible.
• But
• It is rare to find only ONE acceptable description
• Find several (lots): which is best?
• Not find any (description language is not expressive enough or noisy data)
• Machine learning techniques use heuristics to narrow down the
search
• Heuristic: rule of the thumb. “Trick” which usually works. Not
guaranteed to find a (optimal) solution.

Bias
• Machine learning techniques bias search by
• Choosing a concept description language: language bias
• Selecting the order in which space is searched: search bias
• Avoiding overfitting: overfitting-avoidance bias

Language bias
• Does the language restrict the concepts which can be learnt?
• Concept: divides data into sets of examples - one for each class (solution, outcome) value.
• Universal language: can express all possible subsets of examples.
• Domain knowledge: redundant or impossible combinations of attribute values are not
considered.
• Reduction of the search space
• Disjunction (or): ensures language can represent any subset when using rules.
• Can be expressed using a separate rule for each option.
• If a or b then c → if a then c
if b then c

Search bias
• Many concept descriptions fit data
• Find best
• Simplest?
• Fit: statistically agrees with the data
• So there may be some cases where it doesn’t agree with the data.
• Best description: use heuristic to search
• it may not be optimal
• E.g. finding best rule at each stage may not give best combination of rules.
• Type of search
• Start with general description and specialise
• Start with specific description and generalise
• Overfitting avoidance bias: bias towards simple concept descriptions

10 March 2025 50
Ethical and professional issues
• GDPR
• The UK Government Data Ethics framework
• The BCS code of conduct

Data Protection
• GDPR describes how (personal) data should be used by organisations,
businesses, the government and the general public. See
ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-
protection/2018-reform-eu-data-protection-rules_en [accessed 17/09/2019] )
• It includes
• Data processing
• Data movement

Ethical Issues
• How are ethical issues dealt with?
• E.g. use applicant’s sex, religion or race in order to decide whether to give a
loan - unethical
• BUT these same attributes are OK when used in medical application
• The use of data for certain applications may pose problems
• E.g. postcode may be a strong indicator of an individual’s race.
• Data collected for a particular reason should not be used (using data mining) for a
completely different purpose without appropriate consent.
• Information mined may be surprising: red car owners are more likely to have
problems paying their car loans in France.

Ethical issues
• Anonymisation of data
• Does NOT guarantee data is “anonymous”
• E.g. Staff satisfaction questionnaire which asks for race and position
• There may be only one person of that race with that position
• E.g. 85% Americans identified by postcode, birth date and gender
• In the UK, postcode and car model may be enough to identify a person even if car model is
“common”.

Ethical issues
• Output from data mining must be carefully considered
• Arguments purely based on statistics are not sufficient
• Caveats should be put on conclusions

10 March 2025 55
The data ethics framework
• See
• https://www.gov.uk/government/publications/data-ethics-framework/data-ethics-fra
mework
[accessed 25/09/2020]
• Main principles
1. Start with clear user need and public benefit
2. Be aware of relevant legislation and codes of practice
3. Use data that is proportionate to the user need
4. Understand the limitations of the data
5. Ensure robust practices and work within your skillset
6. Make your work transparent and be accountable
7. Embed data use responsibly

10 March 2025 56
The data ethics workbook
• “Should be completed collectively by practitioners, data governance or
information assurance specialists, and subject matter experts like
service staff or policy professionals”
• Also decide how often to reassess the project with respect to the
framework principles.
• See questions to be answered at
• https://www.gov.uk/government/publications/data-ethics-workbook/data-et
hics-workbook
[accessed 25/09/2020]

10 March 2025 57
BCS professional conduct
• The British Computer Society has a professional code of conduct
available at
• https://www.bcs.org/membership/become-a-member/bcs-code-of-conduct/
[ accessed 25/09/2020]
• Principles
• Make IT for everyone
• Show what you know, learn what you don’t
• Respect the organisation or the individual you work for
• Keep IT real, keep IT professional, pass IT on.

Contents
• Data mining
• Data mining and machine learning
• Applications
• Ethical issues
• Summary

Summary
• Very valuable information can be extracted from data
• Relies on a large set of examples and machine learning techniques.
• Methodology is often agile, e.g. CRISP-DM
• Format of input and output constrain what can be learnt.
• Wide range of applications.
• Ethical issues restrict use of data for certain purposes.

01-data mining-introduction-bayero-u.pptx

More Related Content

Similar to 01-data mining-introduction-bayero-u.pptx

Recently uploaded

01-data mining-introduction-bayero-u.pptx