KEMBAR78
Data Mining Core Concept & Introduction. | PDF
Data
Mining
By Engr. Ahsan Shah
Example Explanation
🛒Market Basket Rule: “If a
customer buys bread, they also
buy butter.”
Association pattern
📈Sales increase during
holidays.
Trend pattern
👩‍⚕️Certain symptoms often
appear together in patients.
Correlation pattern
💳Unusually high transaction =
possible fraud.
Anomaly pattern
WhatisData
Mining?
Data Mining is the process of discovering useful patterns,
relationships, and insights from large sets of data using
statistical,mathematical,andcomputationaltechniques.
In Data Mining, a pattern means a useful, meaningful, and valid
relationshiporstructurefoundindata.
These patterns help understand behavior, predict outcomes,
andmakebusinessdecisions.
Transaction ID Items Bought
T1 Milk, Bread, Butter
T2 Milk, Bread
T3 Bread, Butter
T4
Milk, Bread,
Butter, Eggs
AssociationPatterns
(DependencyRules)
Showsrelationshipsbetweenitemsorvariables.
Commoninmarketbasketanalysis.
If{Milk,Bread}→{Butter}(Support=30%,Confidence=80%)
30%oftransactionsincludemilk,bread,andbutter;
80%ofcustomerswhobuymilkandbreadalsobuybutter.
TechniqueUsed
AprioriAlgorithm
FP-GrowthAlgorithm
WhatistheAprioriAlgorithm?
TheAprioriAlgorithmisaclassicdataminingalgorithmused
tofindfrequentitemsetsandgenerateassociationrulesfrom
largetransactionaldatabases.
Findpatternslike
“IfacustomerbuysMilkandBread,theyalsobuy
Butter.”
Before
understandinghow
Aprioriworks,you
mustknowthese
threeimportant
terms:
1.Itemset
Asetofitemsboughttogetherinatransaction.
Example:{Milk,Bread}
2.Support
Measureshowfrequentlyanitemset
appearsinthedataset.
Support(A→B)=Totaltransactions
/
Transactionscontaining(A∪B)​
If2outof4transactionscontain{Milk,Bread},
Support=2/4=0.5=50%
Whatdoestheformulamean?
ItisusedtomeasuretheconfidenceoftheruleA⇒B,whichasks:
"HowlikelyisBtobetruewhenAistrue?"
Support(A):
TheproportionoftransactionsinthedatasetthatcontainitemsetA.
Support(A∪B):
TheproportionoftransactionsthatcontainbothAandBtogether.
Confidence(A→B):
ThelikelihoodthatatransactioncontainingAalsocontainsB.
It'scalculatedby:
Confidence(A→B) =
TransactionswithbothAandB​
Transactionswith A
Supposeyouhave100transactionsfromastore:
40containmilk(A)
30containbothmilkandbread(A∪B)
Support(A)=40/100=0.40
Support(A∪B)=30/100=0.30
Confidence(A→B)=0.30/0.40=0.75
So,75%ofthecustomerswhoboughtmilkalsoboughtbread.
Theconfidenceformula
helpsquantifythe
strengthofarulein
associationmining.A
higherconfidence
meansastrongerrule—
i.e.,ifAhappens,Bis
morelikelytohappen.
3.Confidence
Candidate Support (%)
{Milk, Bread} 75%
{Milk, Butter} 50%
{Bread, Butter} 75%
Candidate Support (%)
{Milk, Bread,
Butter}
50%
Rule Support Confidence Lift
Milk → Bread 75% 100% 1
Bread → Butter 75% 75% 1
Milk, Bread →
Butter
50% 66% >1
4.Lift
✅IfLift>1→positiverelationship
✅IfLift=1→norelationship
✅IfLift<1→negativerelationship
StepsintheAprioriAlgorithm
Step1:SetMinimumSupportandConfidence
Decidethresholds(e.g.,support≥50%,confidence≥60%).
Step2:FindAllFrequent1-Itemsets
Countsupportforeachsingleitem.
Keeponlyitemsmeetingtheminimumsupportthreshold.
Step3:GenerateCandidate2-Itemsets
Step4:GenerateCandidate3-Itemsets
Combinefrequent2-itemsets.
Step5:GenerateAssociationRules
Step Description Example Output
1 Choose min support/confidence Support ≥ 50%, Confidence ≥ 60%
2 Generate 1-itemsets {Milk}, {Bread}, {Butter}
3 Generate 2-itemsets {Milk, Bread}, {Bread, Butter}
4 Generate 3-itemsets {Milk, Bread, Butter}
5 Generate rules Milk & Bread → Butter
Domain Application
🛍️Retail
Market Basket Analysis (find products bought
together)
🏦Banking Detecting services used together
🌐Web Usage Mining Finding patterns in website navigation
🧬Healthcare Identifying symptom-disease relationships
📧Marketing Cross-selling and recommendation systems
Real-LifeApplications
ofApriori
Advantages
andLimitations
Advantages Limitations
Simpleandeasytoimplement
Workswellforsmalltomediumdata
Providesclearrules
Highcomputationcostforlargedatasets
Generatestoomanycandidatesets
Doesn’thandlenumericorcontinuousdataeasily
Transaction ID Items Bought
T1 Milk, Bread, Butter
T2 Bread, Butter
T3 Milk, Bread
T4 Milk, Bread, Butter
FP-Growth
Algorithm?
Insteadofgeneratingcandidateitemsetsonebyone(likeApriori),
FP-GrowthusesacompactdatastructurecalledtheFP-Tree
(FrequentPatternTree).
Point01
Scanthetransactiondatabaseoncetofindfrequentitems.
Sortitemsineachtransactionbytheirfrequency.
Buildatreestructurethatstoresitemsandtheiroccurrence
counts.
Step1:BuildtheFP-Tree
FP-Growth (Frequent Pattern Growth) is a data mining
algorithm used to find frequent itemsets in large datasets
just like the Apriori algorithm, but faster and more
efficient.
Step2:ExtractFrequentItemsets
Startingfromthebottomofthetree,recursivelyfindprefix
paths(patterns).
GenerateconditionalFP-Treesforeachitem.
Combinethemtoformfrequentitemsets.
Limitation Description
🧩Complex Tree Structure
Can be hard to
understand and
🧮Memory Usage
May grow large for
sparse data (many
🧠Not Easy for Dynamic
Data
Tree must be
rebuilt if data
Root
├──Bread(4)
│ ├──Milk(3)
│ └──Butter(3)
Then,frequentpatternsareextractedsuchas:
{Bread}
{Bread,Milk}
{Bread,Butter}
{Bread,Milk,Butter}
Step2:KDDProcess
(KnowledgeDiscoveryin
Database)
DataCleaning–Removenoiseandmissingvalues
DataIntegration–Combinedatafrommultiplesources
DataSelection–Chooserelevantdata
DataTransformation–Convertintosuitableformat
DataMining–Applyalgorithmsandfindpatterns
PatternEvaluation–Identifymeaningfulresults
KnowledgePresentation–Visualizeresultsforusers
Beforemining,rawdatamustbecleaned,organized,andtransformed.
ThisprocessiscalledDataPreprocessing.
Step3:PreparingtheData
(DataPreprocessing)
StepsinDataPreparation:
1.DataCleaning
Removeduplicate,missing,orinconsistentdata.
Example:Replacingblankvalueswiththeaverageofthecolumn.
2.DataIntegration
Combinedatafrommultiplesources(e.g.,sales+customer+locationdata).
Helpscreateaunifieddataset.
3.DataSelection
Chooseonlyrelevantattributesformining.
Example:Forpredictingsales,choose“price,”“discount,”“region”butnot“employeeage
4.DataTransformation
Convertdataintoasuitableformatforanalysis.
Includes:
Normalization:Scalevaluesintoafixedrange(e.g.,0–1)
Aggregation:Summarizedata(e.g.,weekly→monthlysales)
Encoding:Converttextdataintonumbersforalgorithms.
Tomakethedatasetaccurate,consistent,and
readyforminingalgorithms.
Poordataquality=poorresults,nomatterhow
goodthealgorithmis.

Data Mining Core Concept & Introduction.