Copy data from Amazon S3 to Redshift and avoid duplicate rows
Asked Answered
P

6

14

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.

Polysemy answered 29/3, 2013 at 10:23 Comment(0)
M
7

My solution is to run a 'delete' command before 'copy' on the table. In my use case, each time I need to copy the records of a daily snapshot to redshift table, thus I can use the following 'delete' command to ensure duplicated records are deleted, then run the 'copy' command.

DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

Maceio answered 12/8, 2013 at 15:58 Comment(2)
I have used a similar solution.Polysemy
I don’t get it.Cene
S
17

As user1045047 mentioned, Amazon Redshift doesn't support unique constraints, so I had been looking for the way to delete duplicate records from a table with a delete statement. Finally, I found out a reasonable way.

Amazon Redshift supports creating an IDENTITY column that is stored an auto-generated unique number. http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

The following sql is for PostgreSQL to delete duplicated records with OID that is unique column, and you can use this sql by replacing OID with the identity column.

DELETE FROM duplicated_table WHERE OID > (
 SELECT MIN(OID) FROM duplicated_table d2
  WHERE column1 = d2.dupl_column1
  AND column2 = d2.column2
);

Here is an example that I tested on my Amazon Redshift cluster.

create table auto_id_table (auto_id int IDENTITY, name varchar, age int);

insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Bob', 20);  
insert into auto_id_table (name, age) values('Matt', 24); 

select * from auto_id_table order by auto_id; 
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       2 | John |  18
       3 | John |  18
       4 | John |  18
       5 | John |  18
       6 | Bob  |  20
       7 | Bob  |  20
       8 | Matt |  24    
(8 rows) 

delete from auto_id_table where auto_id > (
  select min(auto_id) from auto_id_table d
    where auto_id_table.name = d.name
    and auto_id_table.age = d.age
);

select * from auto_id_table order by auto_id;
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       6 | Bob  |  20
       8 | Matt |  24
(3 rows)

Also it works with COPY command like this.

  • auto_id_table.csv

    John,18
    Bob,20
    Matt,24
    
  • copy sql

    copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ','; 
    

The advantage of this way is that you don't need to run DDL statements. However it doesn't work with existing tables that do not have an identity column because an identity column cannot be added to an existing table. The only way to delete duplicated records with existing tables is migrating all records like this. (same as user1045047's answer)

insert into temp_table (select distinct from original_table);
drop table original_table;
alter table temp_table rename to original_table;
Souvenir answered 11/7, 2013 at 7:38 Comment(4)
It's not clear to me how your copy sql indicates that it handles duplicates properly. For the data set that you're loading, audo_id_table.csv, it simply has 3 unique rows, no?Roti
The point is that you can delete duplicated records easily. Even if you run copy auto_id_table three times and get three duplicated records, those duplicated records will be deleted with the above delete from auto_id_table.... query.Souvenir
With this approach, do you need to run the VACUUM command? I read here: If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column definedRoti
Running VACUUM and ANALYZE is better after this, but not necessary because VACUUM takes some costs. In my case, since the number of duplicate records are small, I schedule running VACUUM once a day, so I don't run it with this query.Souvenir
C
12

Mmm..

What about just never loading data into your master table directly.

Steps to avoid duplication:

  1. begin transaction
  2. bulk load into a temp staging table
  3. delete from master table where rows = staging table rows
  4. insert into master table from staging table (merge)
  5. drop staging table
  6. end transaction.

this is also super somewhat fast, and recommended by redshift docs.

Clemmy answered 26/11, 2014 at 15:25 Comment(4)
Can you please elaborate on the "this is also super fast" part of this answer? We have a table that is many billions of rows, and we use this method as recommended by the Amazon docs, but unless we are doing something wrong, it is certainly not super fast. This is by far the bottleneck of our ETL process.Gallicanism
I guess super fast is quite subjective. I have only worked in the millions of rows, not many billions...but there isn't a faster process than this to ensure no duplicate records that I have heard of yet. (Interested to hear though if you have something)Clemmy
Problem with this is when staging table has duplicate records.Aryan
Yeah, if staging table gets duplicate rows you'll insert duplicates. What we do is write our merge sql with that in mind...more or less group by the composite key and do any necessary aggregates or first_value to avoid duplicates when inserting.Clemmy
M
7

My solution is to run a 'delete' command before 'copy' on the table. In my use case, each time I need to copy the records of a daily snapshot to redshift table, thus I can use the following 'delete' command to ensure duplicated records are deleted, then run the 'copy' command.

DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

Maceio answered 12/8, 2013 at 15:58 Comment(2)
I have used a similar solution.Polysemy
I don’t get it.Cene
B
4

Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).

The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.

Insert into temp_originalTable (Select Distinct from originalTable)

Drop table originalTable

Alter table temp_originalTable rename to originalTable

Biafra answered 12/6, 2013 at 18:21 Comment(2)
We cannot implement this while we add new records to the table. We checked the table first before inserting it.In case records exist we would delete the row prior to inserting.Polysemy
Have you considered using a transaction?Okwu
D
2

There's another solution to really avoid data duplication although it's not as straightforward as removing duplicated data once inserted. The copy command has the manifest option to specify which files you want to copy

copy customer
from 's3://mybucket/cust.manifest' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest; 

you can build a lambda that generates a new manifest file every time before you run the copy command. That lambda will compare the files already copied with the new files arrived and will create a new manifest with only the new files so that you will never ingest the same file twice

Daladier answered 21/12, 2021 at 11:58 Comment(0)
D
0

We remove duplicates weekly, but you could also do this during the load transaction as mentioned by @Kyle. Also, this does require the existence of an autogenerated ID column as an eventual target of the delete :

DELETE FROM <your table> WHERE ID NOT IN ( 
   SELECT ID FROM (
      SELECT *, ROW_NUMBER() OVER 
         ( PARTITION BY <your constraint columns> ORDER BY ID ASC ) DUPLICATES 
      FROM REQUESTS 
   ) WHERE DUPLICATES=1
); COMMIT; 

for example:

CREATE TABLE IF NOT EXISTS public.requests
(
    id BIGINT NOT NULL DEFAULT "identity"(1, 0, '1,1'::text) ENCODE delta
    kaid VARCHAR(50)   NOT NULL 
    ,eid VARCHAR(50)   NOT NULL ENCODE text32k
    ,aid VARCHAR(100)  NOT NULL ENCODE text32k
    ,sid VARCHAR(100)  NOT NULL ENCODE zstd
    ,rid VARCHAR(100)  NOT NULL ENCODE zstd
    ,"ts" TIMESTAMP WITHOUT TIME ZONE NOT NULL  ENCODE delta32k
    ,rtype VARCHAR(50) NOT NULL ENCODE bytedict
    ,stype VARCHAR(25)          ENCODE bytedict
    ,sver VARCHAR(50)  NOT NULL ENCODE text255
    ,dmacd INTEGER              ENCODE delta32k
    ,reqnum INTEGER    NOT NULL ENCODE delta32k
    ,did VARCHAR(255)           ENCODE zstd
    ,"region" VARCHAR(10)       ENCODE lzo
)
DISTSTYLE EVEN
SORTKEY (kaid, eid, aid, "ts")
;

. . . 

DELETE FROM REQUESTS WHERE ID NOT IN ( 
   SELECT ID FROM (
      SELECT *, ROW_NUMBER() OVER 
         ( PARTITION BY DID,RID,RTYPE,TS ORDER BY ID ASC ) DUPLICATES 
      FROM REQUESTS 
   ) WHERE DUPLICATES=1
); COMMIT; 
Didier answered 7/3, 2021 at 1:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.