KEMBAR78
Duplicate Removal in Web Databases | PDF | Databases | Statistical Classification
0% found this document useful (0 votes)
39 views12 pages

Duplicate Removal in Web Databases

This document proposes a method for removing duplicate records from multiple web databases using unlabeled learning and two classifiers that collaborate iteratively. The first classifier, called WCSS, calculates the similarity between records by comparing the weighted components and sums the similarities. The second classifier, Naive Bayes, is used to set a threshold for classification by comparing the ratio of probabilities for a record belonging to each class. The method was able to remove 75% of duplicate records from multiple web databases without requiring predefined rules or representative training data. Future work could aim to remove more than 75% of duplicates using other classifiers.

Uploaded by

aravinthcse
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views12 pages

Duplicate Removal in Web Databases

This document proposes a method for removing duplicate records from multiple web databases using unlabeled learning and two classifiers that collaborate iteratively. The first classifier, called WCSS, calculates the similarity between records by comparing the weighted components and sums the similarities. The second classifier, Naive Bayes, is used to set a threshold for classification by comparing the ratio of probabilities for a record belonging to each class. The method was able to remove 75% of duplicate records from multiple web databases without requiring predefined rules or representative training data. Future work could aim to remove more than 75% of duplicates using other classifiers.

Uploaded by

aravinthcse
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

DUPLICATE REMOVAL OF MULTIPLE WEB DATABASE

By R.Pravin Kumar, R.Aravindan B.E-CSE Oxford College of Engineering, Trichy-9.

Existing Problem
Duplicate removal from multiple web databases DATA BASE 1
QUERY QUERY INTERFACE WEB SERVER

DATA BASE 2

DATA BASE 3

Example
Universal Samsung Galaxy Poorviga QUERY INTERFACE WEB SERVER Rs.30000

Rs.25000

CellMall

Rs.20000

Existing Method Hand-Coding or offlinelearning approaches


full data set is not available beforehand ,good representative data for training are hard to obtain even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set. Eg : Price weights differ for each new query

Proposed Method
Unlabeled Learning
Employing two classifiers that collaborate in an iterative manner to eliminate the duplicates.
Weight component similarity summing classifier

Two Classifiers
Naive Bayes for threshold

Comparing the weights of the record fields in calculating the similarity between two records.

WCSS Classifier
The similarity between two records will be in (0, 1). The similarity between two duplicate records should be close to 1. The similarity for two non duplicate records should be close to 0. The sum of all component weights is equal to 1

Naive Bayes for threshold


Classication is then achieved by comparing this ratio with a threshold, t. This means that the ratio can be calculated as P(1|x) / P(0|x) = f (x|1)P(1) / f (x|0)P(0)

. It is dened P(i|x) as the probability that an

object with measurement vector x = (x1, . . . , x p ) belongs to class i, then any monotonic function of P(i|x) would make a suitable score.

Advantages
No predefined rules focus on techniques for adjusting the weights of the record No partial learning Employs two classifiers focuses on Web databases from the various domain

Conclusion
Two Classifiers are used to identify the duplicate pairs from all potential duplicate pairs iteratively. It provides data for detecting duplicates over the query results of multiple Web databases. Moreover 75% of duplicate records removed from multiple webdatabase by implementing this method.

Future Enhancement
It Is possible to implementation of morethan 75% of unwanted records removed from multiple web database by using some other classifiers. Easy to handle the non-duplicate records from multiple web database

Reference
1. S. Sarawagi and A. Bhamidipaty, (2002 )Interactive Deduplication Using Active Learning, Proc. ACM SIGKDD, pp. 269-278. 2. S. Chaudhuri, V. Ganti, and R. Motwani,( 2005) Robust. Identified Identification Of Fuzzy Duplicates, Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-876. 3. P. Christen and K. Goiser,(2007) Quality and Complexity Measures For DataLinkage and De-duplication, Quality Measures in Data mining F. Gui- llet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer. 4. Weifeng Su, Jiying Wang, and Frederick H. Lochovsky,( April 2010)IEEE Transactions on knowledgement and data engineeringvol.22,pp.578-589.

THANK YOU

You might also like