DUPLICATE REMOVAL OF MULTIPLE WEB DATABASE
By R.Pravin Kumar, R.Aravindan B.E-CSE Oxford College of Engineering, Trichy-9.
Existing Problem
Duplicate removal from multiple web databases DATA BASE 1
QUERY QUERY INTERFACE WEB SERVER
DATA BASE 2
DATA BASE 3
Example
Universal Samsung Galaxy Poorviga QUERY INTERFACE WEB SERVER Rs.30000
Rs.25000
CellMall
Rs.20000
Existing Method Hand-Coding or offlinelearning approaches
full data set is not available beforehand ,good representative data for training are hard to obtain even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set. Eg : Price weights differ for each new query
Proposed Method
Unlabeled Learning
Employing two classifiers that collaborate in an iterative manner to eliminate the duplicates.
Weight component similarity summing classifier
Two Classifiers
Naive Bayes for threshold
Comparing the weights of the record fields in calculating the similarity between two records.
WCSS Classifier
The similarity between two records will be in (0, 1). The similarity between two duplicate records should be close to 1. The similarity for two non duplicate records should be close to 0. The sum of all component weights is equal to 1
Naive Bayes for threshold
Classication is then achieved by comparing this ratio with a threshold, t. This means that the ratio can be calculated as P(1|x) / P(0|x) = f (x|1)P(1) / f (x|0)P(0)
. It is dened P(i|x) as the probability that an
object with measurement vector x = (x1, . . . , x p ) belongs to class i, then any monotonic function of P(i|x) would make a suitable score.
Advantages
No predefined rules focus on techniques for adjusting the weights of the record No partial learning Employs two classifiers focuses on Web databases from the various domain
Conclusion
Two Classifiers are used to identify the duplicate pairs from all potential duplicate pairs iteratively. It provides data for detecting duplicates over the query results of multiple Web databases. Moreover 75% of duplicate records removed from multiple webdatabase by implementing this method.
Future Enhancement
It Is possible to implementation of morethan 75% of unwanted records removed from multiple web database by using some other classifiers. Easy to handle the non-duplicate records from multiple web database
Reference
1. S. Sarawagi and A. Bhamidipaty, (2002 )Interactive Deduplication Using Active Learning, Proc. ACM SIGKDD, pp. 269-278. 2. S. Chaudhuri, V. Ganti, and R. Motwani,( 2005) Robust. Identified Identification Of Fuzzy Duplicates, Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-876. 3. P. Christen and K. Goiser,(2007) Quality and Complexity Measures For DataLinkage and De-duplication, Quality Measures in Data mining F. Gui- llet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer. 4. Weifeng Su, Jiying Wang, and Frederick H. Lochovsky,( April 2010)IEEE Transactions on knowledgement and data engineeringvol.22,pp.578-589.
THANK YOU