Dedupe me

Project Info

Project Description

Machine learning based entity resolution to the rescue

Data Story

Databases somehow always end up with duplicate entries but we can solve that using machine learning based entity resolution (a.k.a record linkage, fuzzy matching, etc).

Entity resolution typical requires:
1) Deduplication (removal of exact copies of records)
2) Record Linkage (records that may reference the same business)
3) Canonicalization (ensuring data with more than one representation are in a standardised form)

Only steps 1 and 2 were addressed during this challenge of which out of 47404 records, 1920 unique businesses were identified using csvdedupe (https://github.com/dedupeio/csvdedupe)

Perhaps you can even use this during form filling and validation to reduce any further duplicates.

NB. Using Excel for step 1, and csvdedupe for step 2 which is simply a CLI program the only evidence of work is the training data generated by the program.

Evidence of Work

Video

Homepage