Crowd-based deduplication: An adaptive approach

Sibo Wang, Xiaokui Xiao, Chun Hee Lee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

59 Scopus citations

Abstract

Data deduplication stands as a building block for data integration and data cleaning. The state-of-the-art techniques focus on how to exploit crowdsourcing to improve the accuracy of deduplication. However, they either incur significant overheads on the crowd or offer inferior accuracy. This paper presents ACD, a new crowd-based algorithm for data deduplication. The basic idea of ACD is to adopt correlation clustering (which is a classic machine-based algorithm for data deduplication) under a crowd-based setting. We propose non-trivial techniques to reduce the time required in performing correlation clustering with the crowd, and devise methods to postprocess the results of correlation clustering for better accuracy of deduplication. With extensive experiments on the Amazon Mechanical Turk, we demonstrate that ACD outperforms the states of the art by offering a high precision of deduplication while incurring moderate crowdsourcing overheads.

Original languageEnglish
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1263-1277
Number of pages15
ISBN (Electronic)9781450327589
DOIs
StatePublished - 27 May 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31 May 20154 Jun 2015

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
Volume2015-May
ISSN (Print)0730-8078

Conference

ConferenceACM SIGMOD International Conference on Management of Data, SIGMOD 2015
Country/TerritoryAustralia
CityMelbourne
Period31/05/154/06/15

Keywords

  • Correlating clustering
  • Crowdsourcing
  • Data deduplication

Fingerprint

Dive into the research topics of 'Crowd-based deduplication: An adaptive approach'. Together they form a unique fingerprint.

Cite this