MySQL bulk insert, how to filter out duplicate data?
The reason for working overtime is to go online, to solve the problem of duplicate data in the online database, and to find a bug in the program, which is easy to solve. The problem is to correct the duplicate data online.
There are 6 tables in the online library with duplicate data, 2 of which are relatively large, one is 960,000+ and the other is 300,000+. Because the same problem has been dealt with before, I directly brought the last Python deduplication script. It is very simple, that is, connect to the database, find out the duplicate data, and delete it cyclically.
Emmmm, but this efficiency is really too low, one per second, about 20,000 duplicate data, and the estimated time is about 8 hours. . .
There is a problem with blindly relying on the things of the predecessors without thinking for yourself! Always think about how it was possible before and why it is not possible now, this is also a problem! I found that I was really in a bad state recently, and I lost the desire to explore and seek knowledge. Today is a wake-up call, and I feel quite lost.
Closer to home, the following describes the deduplication steps in detail.
SELECT name,count( 1 )
FROM
student
GROUP BY
NAME
HAVING
count( 1 ) > 1;
output:
name count(1) cat 2 dog 2
The data whose names are cat and dog are duplicated, and there are two duplicate data for each;
DELETE
FROM
student
WHERE
NAME IN (
SELECT NAME
FROM
student
GROUP BY
NAME
HAVING
count( 1 ) > 1)
Error:
1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s
The reason is: this table is queried while updating this table, and this table is updated while querying this table, which can be understood as a deadlock. MySQL does not support this kind of operation of updating and querying the same table
Solution: Query the columns of data to be updated as a third-party table, and then filter the updates.
SELECT
*
FROM
student
WHERE
id NOT IN (
SELECT
t.id
FROM
( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t
)
What does it mean? First, group by name to find out the data with the smallest id. These data are the tinder we want to leave, and then we can find out that the id is not in it, which is the duplicate data we want to delete.
It is recommended to do the actual combat project of Spring Cloud:
https://github.com/YunaiV/onemall
Start deduplication, leaving only one
Very simple, just replace the select with delete
DELETE
FROM
student
WHERE
id NOT IN (
SELECT
t.id
FROM
( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t
)
900,000+ tables execute super fast.
0 Comments