dev-master
9999999-devDeduplicate your tables using Artisan
MIT
The Requires
- php >=5.4.0
- illuminate/support 4.2.*
The Development Requires
by Anthony Vipond
laravel artisan deduplicate database remove duplicates from database
Deduplicate your tables using Artisan
Removes duplicate records from your database and remaps foreign keys in other tables, (*1)
You can easily define the uniqueness of a row using one or more columns, (*2)
It works well on large tables (10M+ rows) as well., (*3)
Designed to run directly on production tables, (*4)
The recommended way is through composer., (*5)
Edit composer.json
and add:, (*6)
{ "require": { "anthonyvipond/deduper": "dev-master" } }
And install dependencies:, (*7)
composer install
Copy config/database.php.sample
to config/database.php
and fill it out, (*8)
You should now be able to use the program from the command line (where dlr file is stored), (*9)
php dlr
Suppose you have the following people
table:, (*10)
id | name |
---|---|
2 | Mary |
3 | Joseph |
5 | Mary |
6 | mary |
7 | Joseph |
php dlr dedupe tableName columnName
i.e., (*11)
php dlr dedupe people name
Your original table will not be touched, and you will get this table people_uniques
, (*12)
id | name |
---|---|
2 | Mary |
3 | Joseph |
You will also get this table people_removes
, (*13)
id | name | new_id |
---|---|---|
5 | Mary | null |
6 | Joseph | null |
7 | Joseph | null |
But what if you have a table where the uniqueness of defined over three columns? No problem., (*14)
id | firstname | lastname | birthday |
---|---|---|---|
2 | Mary | Smith | 1991-01-01 |
3 | Joseph | Parker | 1984-02-02 |
5 | Mary | Kate | 1981-08-08 |
6 | mary | kate | 2001-03-03 |
7 | Joseph | Parker | 1984-02-02 |
Seperate the columns with a :
in the second argument:, (*15)
php dlr dedupe people firstname:lastname:birthday
You will get a new table people_uniques
, (*16)
id | firstname | lastname | birthday |
---|---|---|---|
2 | Mary | Smith | 1991-01-01 |
3 | Joseph | Parker | 1984-02-02 |
5 | Mary | Kate | 1981-08-08 |
6 | mary | kate | 2001-03-03 |
And another table people_removes
, (*17)
id | firstname | lastname | birthday | new_id |
---|---|---|---|---|
7 | Joseph | Parker | 1984-02-02 | null |
You can continue to deduplicate on different columns., (*18)
Your uniques
table will get smaller, and your removes
table will get bigger., (*19)
Take another look at the last stage our tables were in., (*20)
Let's keep deduplicating further on new rules..., (*21)
php dlr dedupe tableName firstname:lastname
Now people_uniques
is like this:, (*22)
id | firstname | lastname | birthday |
---|---|---|---|
2 | Mary | Smith | 1991-01-01 |
3 | Joseph | Parker | 1984-02-02 |
5 | Mary | Kate | 1981-08-08 |
And people_removes
is like this:, (*23)
id | firstname | lastname | birthday | new_id |
---|---|---|---|---|
7 | Joseph | Parker | 1984-02-02 | null |
6 | mary | kate | 2001-03-03 | null |
The next stage is add the new ids to the removes
table, (*24)
php dlr link uniquesTable removesTable col1:col2:col3
ie, (*25)
php dlr link people_uniques people_removes firstname:lastname:birthday
When doing linking, you should pass in the same columns as you did when deduping, (*26)
If you ran the dedupe command several times with different combinations, you want to link from the least specific to the most., (*27)
ie, (*28)
php dlr link people_uniques people_removes lastname:placeOfBirth php dlr link people_uniques people_removes firstname:lastname:birthday
This way the more specific and higher quality groupings overwrite the lower quality ones, (*29)
If you ran the dedupe
command multiple times on different rules, you may end up with a small percentage of records that weren't linked after running the link
command, (*30)
You can pass the --fillerMode
option to fill the rest with of the new_id
with ids, (*31)
Check how many new_id
have not been remapped after each run:, (*32)
SELECT count(1) FROM table_removes WHERE new_id IS NULL;
ie, (*33)
php dlr link people_uniques people_removes firstname:lastname --fillerMode=true
After you run dedupe
you will have table_uniques and table_removes, as well as your original table., (*34)
The removes table needs to be present for remapping to work., (*35)
It won't be written to but needs to be read from., (*36)
Suppose you have this teams
table:, (*37)
id | team |
---|---|
2 | Knicks |
3 | Knicks |
4 | Lakers |
5 | Knicks |
And a champions
table that needs remapping, (*38)
id | team_id |
---|---|
2 | 3 |
3 | 2 |
4 | 5 |
5 | 2 |
And the teams_uniques
table (remember, you deduped already), (*39)
id | team |
---|---|
2 | Knicks |
4 | Lakers |
And you also have this teams_removes
table which is used for remapping:, (*40)
id | team | new_id |
---|---|---|
3 | Knicks | 2 |
5 | Knicks | 2 |
You can now remap the foreign keys on other tables pointing to teams.id
, (*41)
php dlr remap remapTable removesTable foreignKey
i.e., (*42)
php dlr remap champions teams_removes team_id
The champions
table now looks like this:, (*43)
id | team_id |
---|---|
2 | 2 |
3 | 2 |
4 | 2 |
5 | 2 |
You should backup your remap table prior to running the remap
command., (*44)
If your remapping doesn't finish the first time, just run it again. It won't hurt anything., (*45)
Going back to the people
table example..., (*46)
Finish remapping foreign keys for all tables that point to people.id
, (*47)
Now for the final coup-de-grace!, (*48)
RENAME TABLE table TO table_bak; RENAME TABLE table_uniques TO table; DROP TABLE table_bak -- optional
Congrats! You've deduped and remapped your table., (*49)
id
columnDeduplicate your tables using Artisan
MIT
laravel artisan deduplicate database remove duplicates from database