Remove duplicates using only a MySQL query?

S

8

18

I have a table with the following columns:

URL_ID    
URL_ADDR    
URL_Time

I want to remove duplicates on the URL_ADDR column using a MySQL query.

Is it possible to do such a thing without using any programming?

Shut answered 1/8, 2010 at 21:55 Comment(2)

SQL is programming... – Nurseryman 1/8, 2010 at 22:4

How does the URLs in URL_ADDR looks like? Do they all have the same format? i.e: with or without www. – Votyak 1/8, 2010 at 22:4

D

38

Consider the following test case:

CREATE TABLE mytb (url_id int, url_addr varchar(100));

INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');

Where our test table now contains:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      4 | www.google.com    |
|      5 | www.cnn.com       |
|      6 | www.apple.com     |
+--------+-------------------+
5 rows in set (0.00 sec)

Then we can use the multiple-table DELETE syntax as follows:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);

... which will delete duplicate entries, leaving only the first url based on url_id:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      5 | www.cnn.com       |
+--------+-------------------+
3 rows in set (0.00 sec)

UPDATE - Further to new comments above:

If the duplicate URLs will not have the same format, you may want to apply the REPLACE() function to remove www. or http:// parts. For example:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') = 
                   REPLACE(t1.url_addr, 'www.', '') AND 
                   t2.url_id > t1.url_id);

Dever answered 1/8, 2010 at 22:4 Comment(3)

So, a JOIN won't trigger the 1093 error? Granted, a JOIN isn't a subquery but being self-referential is the real issue.... – Nurseryman 1/8, 2010 at 22:9

@OMG: It looks like it doesn't :) – Dever 1/8, 2010 at 22:13

@OMG Poines: RE the 1093 error, I also tried an UPDATE using a self-referential JOIN and another using a subquery that selects from the same table. The JOIN test succeeds while the subquery one fails with the 1093 error. The same seems to happen for DELETE. – Dever 1/8, 2010 at 22:29

N

9

You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.

ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);

Nehemiah answered 1/8, 2010 at 22:4 Comment(1)

ALTER IGNORE is no longer supported as of MySQL 5.7.4: https://mcmap.net/q/246918/-mysql-alter-ignore-table-add-unique-what-will-be-truncated – Canton 10/6, 2017 at 14:38

I

5

This will leave the ones with the highest URL_ID for a particular URL_ADDR

DELETE FROM table
WHERE URL_ID NOT IN 
    (SELECT ID FROM 
       (SELECT MAX(URL_ID) AS ID 
        FROM table 
        WHERE URL_ID IS NOT NULL
        GROUP BY URL_ADDR ) X)   /*Sounds like you would need to GROUP BY a 
                                   calculated form - e.g. using REPLACE to 
                                  strip out www see Daniel's answer*/

(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")

Ibby answered 1/8, 2010 at 22:2 Comment(1)

@Vilx: That's why you bury it in a another subquery. – Nurseryman 1/8, 2010 at 22:7

S

4

Well, you could always:

create a temporary table;
INSERT INTO ... SELECT DISTINCT into the temp table from original table;
clear original table
INSERT INTO ... SELECT into the original table from the temp table
drop temp table.

It's clumsy and awkward, and requires several queries (not to mention privileges), but it will do the trick if you don't find another solution.

Sustenance answered 1/8, 2010 at 22:3 Comment(0)

S

2

You can group by on the URL_ADDR which will effectively give you only distinct values in the URL_ADDR field.

select 
 URL_ID
 URL_ADDR
 URL_Time
from
 some_table
group by
 URL_ADDR

Enjoy!

Soughtafter answered 1/8, 2010 at 22:1 Comment(1)

This is exactly what I was looking for! – Crownwork 28/1, 2021 at 9:55

A

1

Daniel Vassallo How to for multiple column?

DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);

maybe like this?

Amparoampelopsis answered 23/10, 2012 at 4:15 Comment(0)

S

1

 DELETE FROM `your_table`
 WHERE 
    `unique_id` IN (
    SELECT 
        `unique_id`
    FROM (
        SELECT 
            `unique_id_in_table`,
            ROW_NUMBER() OVER (
                PARTITION BY `group_by_column`
                ORDER BY `group_by_column`) AS row_num
        FROM 
            `your_table`
    ) t
     WHERE row_num > 1
 );

Where:

your_table is table name
group_by_column is column you want to be unique
unique_id_in_table is id which should be unique for each row

Sherlynsherm answered 16/9, 2022 at 12:1 Comment(0)

G

0

This will work provided that your URL_ID column is unique.

DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
    SELECT URL_ADDR, MAX(URL_ID) MaxURLId 
    FROM url
    GROUP BY URL_ADDR
    HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)

Giulio answered 1/8, 2010 at 22:7 Comment(0)

Recommended topics

Hot tags