dev-master
9999999-devDeduplicate your tables using Artisan
MIT
The Requires
- php >=5.4.0
- illuminate/support 4.2.*
The Development Requires
by Anthony Vipond
laravel artisan deduplicate database remove duplicates from database
Wallogit.com
2017 © Pedro Peláez
Deduplicate your tables using Artisan
Removes duplicate records from your database and remaps foreign keys in other tables, (*1)
You can easily define the uniqueness of a row using one or more columns, (*2)
It works well on large tables (10M+ rows) as well., (*3)
Designed to run directly on production tables, (*4)
The recommended way is through composer., (*5)
Edit composer.json and add:, (*6)
{
"require": {
"anthonyvipond/deduper": "dev-master"
}
}
And install dependencies:, (*7)
composer install
Copy config/database.php.sample to config/database.php and fill it out, (*8)
You should now be able to use the program from the command line (where dlr file is stored), (*9)
php dlr
Suppose you have the following people table:, (*10)
| id | name |
|---|---|
| 2 | Mary |
| 3 | Joseph |
| 5 | Mary |
| 6 | mary |
| 7 | Joseph |
php dlr dedupe tableName columnName
i.e., (*11)
php dlr dedupe people name
Your original table will not be touched, and you will get this table people_uniques, (*12)
| id | name |
|---|---|
| 2 | Mary |
| 3 | Joseph |
You will also get this table people_removes, (*13)
| id | name | new_id |
|---|---|---|
| 5 | Mary | null |
| 6 | Joseph | null |
| 7 | Joseph | null |
But what if you have a table where the uniqueness of defined over three columns? No problem., (*14)
| id | firstname | lastname | birthday |
|---|---|---|---|
| 2 | Mary | Smith | 1991-01-01 |
| 3 | Joseph | Parker | 1984-02-02 |
| 5 | Mary | Kate | 1981-08-08 |
| 6 | mary | kate | 2001-03-03 |
| 7 | Joseph | Parker | 1984-02-02 |
Seperate the columns with a : in the second argument:, (*15)
php dlr dedupe people firstname:lastname:birthday
You will get a new table people_uniques, (*16)
| id | firstname | lastname | birthday |
|---|---|---|---|
| 2 | Mary | Smith | 1991-01-01 |
| 3 | Joseph | Parker | 1984-02-02 |
| 5 | Mary | Kate | 1981-08-08 |
| 6 | mary | kate | 2001-03-03 |
And another table people_removes, (*17)
| id | firstname | lastname | birthday | new_id |
|---|---|---|---|---|
| 7 | Joseph | Parker | 1984-02-02 | null |
You can continue to deduplicate on different columns., (*18)
Your uniques table will get smaller, and your removes table will get bigger., (*19)
Take another look at the last stage our tables were in., (*20)
Let's keep deduplicating further on new rules..., (*21)
php dlr dedupe tableName firstname:lastname
Now people_uniques is like this:, (*22)
| id | firstname | lastname | birthday |
|---|---|---|---|
| 2 | Mary | Smith | 1991-01-01 |
| 3 | Joseph | Parker | 1984-02-02 |
| 5 | Mary | Kate | 1981-08-08 |
And people_removes is like this:, (*23)
| id | firstname | lastname | birthday | new_id |
|---|---|---|---|---|
| 7 | Joseph | Parker | 1984-02-02 | null |
| 6 | mary | kate | 2001-03-03 | null |
The next stage is add the new ids to the removes table, (*24)
php dlr link uniquesTable removesTable col1:col2:col3
ie, (*25)
php dlr link people_uniques people_removes firstname:lastname:birthday
When doing linking, you should pass in the same columns as you did when deduping, (*26)
If you ran the dedupe command several times with different combinations, you want to link from the least specific to the most., (*27)
ie, (*28)
php dlr link people_uniques people_removes lastname:placeOfBirth php dlr link people_uniques people_removes firstname:lastname:birthday
This way the more specific and higher quality groupings overwrite the lower quality ones, (*29)
If you ran the dedupe command multiple times on different rules, you may end up with a small percentage of records that weren't linked after running the link command, (*30)
You can pass the --fillerMode option to fill the rest with of the new_id with ids, (*31)
Check how many new_id have not been remapped after each run:, (*32)
SELECT count(1) FROM table_removes WHERE new_id IS NULL;
ie, (*33)
php dlr link people_uniques people_removes firstname:lastname --fillerMode=true
After you run dedupe you will have table_uniques and table_removes, as well as your original table., (*34)
The removes table needs to be present for remapping to work., (*35)
It won't be written to but needs to be read from., (*36)
Suppose you have this teams table:, (*37)
| id | team |
|---|---|
| 2 | Knicks |
| 3 | Knicks |
| 4 | Lakers |
| 5 | Knicks |
And a champions table that needs remapping, (*38)
| id | team_id |
|---|---|
| 2 | 3 |
| 3 | 2 |
| 4 | 5 |
| 5 | 2 |
And the teams_uniques table (remember, you deduped already), (*39)
| id | team |
|---|---|
| 2 | Knicks |
| 4 | Lakers |
And you also have this teams_removes table which is used for remapping:, (*40)
| id | team | new_id |
|---|---|---|
| 3 | Knicks | 2 |
| 5 | Knicks | 2 |
You can now remap the foreign keys on other tables pointing to teams.id, (*41)
php dlr remap remapTable removesTable foreignKey
i.e., (*42)
php dlr remap champions teams_removes team_id
The champions table now looks like this:, (*43)
| id | team_id |
|---|---|
| 2 | 2 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
You should backup your remap table prior to running the remap command., (*44)
If your remapping doesn't finish the first time, just run it again. It won't hurt anything., (*45)
Going back to the people table example..., (*46)
Finish remapping foreign keys for all tables that point to people.id, (*47)
Now for the final coup-de-grace!, (*48)
RENAME TABLE table TO table_bak; RENAME TABLE table_uniques TO table; DROP TABLE table_bak -- optional
Congrats! You've deduped and remapped your table., (*49)
id columnDeduplicate your tables using Artisan
MIT
laravel artisan deduplicate database remove duplicates from database