Delete Duplicate rows in Vertica database
Asked Answered
B

6

7

Vertica allows duplicates to be inserted into the tables. I can view those using the 'analyze_constraints' function. How to delete duplicate rows from Vertica tables?

Blade answered 19/6, 2013 at 20:0 Comment(0)
F
9

You should try to avoid/limit using DELETE with a large number of records. The following approach should be more effective:

Step 1 Create a new table with the same structure / projections as the one containing duplicates:

create table mytable_new like mytable including projections ;

Step 2 Insert into this new table de-duplicated rows:

insert /* +direct */ into mytable_new select <column list> from (
    select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;

Step 3 rename the original table (the one containing dups):

alter table mytable rename to mytable_orig ;

Step 4 rename the new table:

alter table mytable_new rename to mytable ;

That's all.

Furmenty answered 28/12, 2015 at 11:57 Comment(2)
This is very slow. I have 43 lakh rows in my table.The query keep on running. @FurmentyPorshaport
@kushagramittal In the Vertica blogs a quicker method is described using the DISTINCT function. Please find it at vertica.com/blog/…Bricklayer
G
4

The answer of Mauro is correct, but there is an error in the sql of step 2. So, the complete way of working by avoiding DELETE should then be as follows:

Step 1 Create a new table with the same structure / projections as the one containing duplicates:

create table mytable_new like mytable including projections ;

Step 2 Insert into this new table de-duplicated rows:

insert /* +direct */ into mytable_new select <column list> from (
            select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
    ) a where a.rownum = 1 ;

Step 3 rename the original table (the one containing dups):

alter table mytable rename to mytable_orig ;

Step 4 rename the new table:

alter table mytable_new rename to mytable ;
Gnomic answered 29/11, 2016 at 8:26 Comment(0)
C
3

Off the top of my head, and not a great answer so let's let this be the final word, you can delete both and insert one back in.

Coccidioidomycosis answered 20/6, 2013 at 18:11 Comment(0)
C
2

You can delete duplicates by Vertica tables by creating a temporary table and generating pseudo row_ids. Here are few steps, especially if you are removing duplicates from very large and wide tables. In the example below, i assume, k1 and k2 rows have more than 1 duplicates. For more info see here.

-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc  ;

-- Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');     -- where, say, k1 has n and k2 has m exact dups

-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    
--  Sanity test.  Should have 1 row each of k1 & k2 rows above

-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');

-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;
Cle answered 17/3, 2015 at 22:11 Comment(0)
C
1

Step1: Create a intermediate table to port/load the data from original table along with row number. Here in below sample, porting data from Table1 to Table2 along with row_num column

select * into Table2 from (select *, ROW_NUMBER() OVER(PARTITION BY A,B order by C)as row_num from Table1 ) A;

Step2: Delete data from Table1 using earlier created Table2 in above step

DELETE FROM Table1 WHERE  EXISTS (SELECT NULL FROM Table2  
                                  where Table2.A=Table1.A 
                                  and Table2.B=Table1.B 
                                  and row_num > 1);

Step3: Drop table create in first step1 i.e Table2

Drop Table Table2;  
Counselor answered 20/8, 2021 at 17:15 Comment(0)
N
-2

You should have a look at this answer from the PostgreSQL wiki which also works for Vertica:

DELETE
FROM
    tablename
WHERE
    id IN(
        SELECT
            id
        FROM
            (
                SELECT
                    id,
                    ROW_NUMBER() OVER(
                        partition BY column1,
                        column2,
                        column3
                    ORDER BY
                        id
                    ) AS rnum
                FROM
                    tablename
            ) t
        WHERE
            t.rnum > 1
    );

It deletes all duplicate entries but the one with the lowest id.

Neonate answered 26/11, 2015 at 10:4 Comment(1)
It will delete all recs if ids are sameArmidaarmiger

© 2022 - 2024 — McMap. All rights reserved.