2017 © Pedro Peláez
 

library deduper-laravel

Deduplicate your tables using Artisan

image

anthonyvipond/deduper-laravel

Deduplicate your tables using Artisan

  • Monday, December 1, 2014
  • by AnthonyVipond
  • Repository
  • 0 Watchers
  • 0 Stars
  • 3 Installations
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 1 Versions
  • 0 % Grown

The README.md

MySQL DB Dedupe and Remap Tool

Purpose

  • Removes duplicate records from your database and remaps foreign keys in other tables, (*1)

  • You can easily define the uniqueness of a row using one or more columns, (*2)

  • It works well on large tables (10M+ rows) as well., (*3)

  • Designed to run directly on production tables, (*4)

Installation

The recommended way is through composer., (*5)

Edit composer.json and add:, (*6)

{
    "require": {
        "anthonyvipond/deduper": "dev-master"
    }
}

And install dependencies:, (*7)

    composer install

Copy config/database.php.sample to config/database.php and fill it out, (*8)

You should now be able to use the program from the command line (where dlr file is stored), (*9)

php dlr

Purpose

Deduplicating Tables

Suppose you have the following people table:, (*10)

id name
2 Mary
3 Joseph
5 Mary
6 mary
7 Joseph
php dlr dedupe tableName columnName

i.e., (*11)

php dlr dedupe people name

Your original table will not be touched, and you will get this table people_uniques, (*12)

id name
2 Mary
3 Joseph

You will also get this table people_removes, (*13)

id name new_id
5 Mary null
6 Joseph null
7 Joseph null

But what if you have a table where the uniqueness of defined over three columns? No problem., (*14)

id firstname lastname birthday
2 Mary Smith 1991-01-01
3 Joseph Parker 1984-02-02
5 Mary Kate 1981-08-08
6 mary kate 2001-03-03
7 Joseph Parker 1984-02-02

Seperate the columns with a : in the second argument:, (*15)

php dlr dedupe people firstname:lastname:birthday

You will get a new table people_uniques, (*16)

id firstname lastname birthday
2 Mary Smith 1991-01-01
3 Joseph Parker 1984-02-02
5 Mary Kate 1981-08-08
6 mary kate 2001-03-03

And another table people_removes, (*17)

id firstname lastname birthday new_id
7 Joseph Parker 1984-02-02 null

You can continue to deduplicate on different columns., (*18)

Your uniques table will get smaller, and your removes table will get bigger., (*19)

Take another look at the last stage our tables were in., (*20)

Let's keep deduplicating further on new rules..., (*21)

php dlr dedupe tableName firstname:lastname

Now people_uniques is like this:, (*22)

id firstname lastname birthday
2 Mary Smith 1991-01-01
3 Joseph Parker 1984-02-02
5 Mary Kate 1981-08-08

And people_removes is like this:, (*23)

id firstname lastname birthday new_id
7 Joseph Parker 1984-02-02 null
6 mary kate 2001-03-03 null

Linking

The next stage is add the new ids to the removes table, (*24)

php dlr link uniquesTable removesTable col1:col2:col3

ie, (*25)

php dlr link people_uniques people_removes firstname:lastname:birthday

When doing linking, you should pass in the same columns as you did when deduping, (*26)

If you ran the dedupe command several times with different combinations, you want to link from the least specific to the most., (*27)

ie, (*28)

php dlr link people_uniques people_removes lastname:placeOfBirth

php dlr link people_uniques people_removes firstname:lastname:birthday

This way the more specific and higher quality groupings overwrite the lower quality ones, (*29)

If you ran the dedupe command multiple times on different rules, you may end up with a small percentage of records that weren't linked after running the link command, (*30)

You can pass the --fillerMode option to fill the rest with of the new_id with ids, (*31)

Check how many new_id have not been remapped after each run:, (*32)

SELECT count(1) FROM table_removes WHERE new_id IS NULL;

ie, (*33)

php dlr link people_uniques people_removes firstname:lastname --fillerMode=true

Remapping

After you run dedupe you will have table_uniques and table_removes, as well as your original table., (*34)

The removes table needs to be present for remapping to work., (*35)

It won't be written to but needs to be read from., (*36)

Suppose you have this teams table:, (*37)

id team
2 Knicks
3 Knicks
4 Lakers
5 Knicks

And a champions table that needs remapping, (*38)

id team_id
2 3
3 2
4 5
5 2

And the teams_uniques table (remember, you deduped already), (*39)

id team
2 Knicks
4 Lakers

And you also have this teams_removes table which is used for remapping:, (*40)

id team new_id
3 Knicks 2
5 Knicks 2

You can now remap the foreign keys on other tables pointing to teams.id, (*41)

php dlr remap remapTable removesTable foreignKey

i.e., (*42)

php dlr remap champions teams_removes team_id

The champions table now looks like this:, (*43)

id team_id
2 2
3 2
4 2
5 2

You should backup your remap table prior to running the remap command., (*44)

If your remapping doesn't finish the first time, just run it again. It won't hurt anything., (*45)

Swapping in the new table

Going back to the people table example..., (*46)

Finish remapping foreign keys for all tables that point to people.id, (*47)

Now for the final coup-de-grace!, (*48)

RENAME TABLE table TO table_bak;
RENAME TABLE table_uniques TO table;
DROP TABLE table_bak -- optional

Congrats! You've deduped and remapped your table., (*49)

Contribution Guidelines

  • Post an issue!
  • Fork and pull.

Notes

  • For the time being, your original table with duplicates must have an id column
  • Only works on MySQL, but I'm open to adding more support

The Versions

01/12 2014

dev-master

9999999-dev

Deduplicate your tables using Artisan

  Sources   Download

MIT

The Requires

 

The Development Requires

by Anthony Vipond

laravel artisan deduplicate database remove duplicates from database