Dedupe me

Project Info

TinyIdeas  💡 thumbnail

Project Description

Machine learning based entity resolution to the rescue

Data Story

Databases somehow always end up with duplicate entries but we can solve that using machine learning based entity resolution (a.k.a record linkage, fuzzy matching, etc).

Entity resolution typical requires:
1) Deduplication (removal of exact copies of records)
2) Record Linkage (records that may reference the same business)
3) Canonicalization (ensuring data with more than one representation are in a standardised form)

Only steps 1 and 2 were addressed during this challenge of which out of 47404 records, 1920 unique businesses were identified using csvdedupe (

Perhaps you can even use this during form filling and validation to reduce any further duplicates.

NB. Using Excel for step 1, and csvdedupe for step 2 which is simply a CLI program the only evidence of work is the training data generated by the program.

Evidence of Work



Team DataSets

ABR lookup

Data Set

IP Australia Govhack 2018 sample data

Data Set

Challenge Entries

Matching Applicants

How could the same business applicant be identified across multiple datasets, and over time? How could we do this in new, or interesting ways?

Go to Challenge | 8 teams have entered this challenge.

Bounty: Finding all the like needles in the haystack

We are looking for your best and brightest ideas to help us identify the same business applicant across datasets and over time

Go to Challenge | 5 teams have entered this challenge.