Vertica: Data validation of duplicate/primary key
Asked Answered
H

3

7

I'm trying to create a validation procedure during a load that checks to make sure data isn't duplicated. Vertica doesn't support this natively:

Vertica checks for constraint violations when queries are run, not when data is loaded. To detect constraint violations as part of the load process, use a COPY (page 667) statement with the NO COMMIT option. By loading data without committing it, you can run a post-load check of your data using the ANALYZE_CONSTRAINTS function. If the function finds constraint violations, you can roll back the load because you have not committed it.

The problem is that I can't figure out how to do this programmatically. I suspect that I need a stored procedure, but I'm not familiar with the stored procedure syntax/limitations for vertica. Can you help? Here's what I have:

-- Create a new table. "id" is auto-incremented and "name" must be unique
CREATE TABLE IF NOT EXISTS my_table (
id IDENTITY
, name varchar(50) UNIQUE NOT NULL
, type varchar(20)
, description varchar(200)
);

--Insert a record
begin; 
copy my_table from stdin
abort on error
NO COMMIT; -- this begins the load
name1|type1|description1 --this is the load
\. -- this closes the load
commit;

-- insert the duplicate record
begin; 
copy my_table from stdin
abort on error
NO COMMIT; -- this begins the load
name1|type1|description1 --this is the load
\. -- this closes the load
commit; -- Surprisingly, the load executes successfully! What's going on?!?!

-- Check constraints. We see that there is a failed constraints:
select analyze_constraints('my_table');

My thinking is to do some conditional logic. Psudo-code is below. Can you help me prepare it for Vertica?

Begin
load data
if (select count(*) from (select analyze_constraints('my_table')) sub) == 0:
commit
else rollback
Housemaster answered 28/9, 2012 at 23:10 Comment(0)
C
4
-- Start by Setting Vertica up to rollback and return an error code 
-- if an error is encountered.

\set ON_ERROR_STOP on

-- Load Data here (code omitted since you already have this)


-- Raise an Error condition by selecting 1/0 if any rows were rejected
-- during the load
SELECT
         GET_NUM_REJECTED_ROWS() AS NumRejectedRows
        ,GET_NUM_ACCEPTED_ROWS() AS NumAcceptedRows
;

SELECT 1 / (1-SIGN(GET_NUM_REJECTED_ROWS()));


-- Raise an Error condition if there are duplicates in my_table
SELECT 1 / ( 1 - SIGN( COUNT(*) ) )
FROM ( SELECT  name1,type1,description1
         FROM MY_TABLE
       GROUP BY 1,2,3
       HAVING COUNT(*) > 1 ) AS T1 ;

-- Raise an Error if primary key constraint is violated.
SELECT 1 / ( 1 - SIGN( COUNT(*) ) )
FROM (SELECT  ANALYZE_CONSTRAINTS ('my_table')) AS T1;

COMMIT;    
Cyndy answered 29/9, 2012 at 12:45 Comment(2)
Doug, thanks for this -- I was totally unaware of the GET_NUM_REJECTED_ROWS function until now (or the load_streams table, for that matter). Thanks!Housemaster
\set ON_ERROR_STOP on doesn't work in vertica 5.1: Error: ERROR: syntax error at or near "\" and if I do following: SET ON_ERROR_STOP TO 'on' I get Error: ERROR: Unsupported SET option ON_ERROR_STOPLutestring
A
1

Vertica doesn't have stored procedures. You'll need to do it programmatically in some fashion outside of Vertica.

The psuedo logic you have is good; just implement it in something (JAVA, C++, ect). You won't need to execute a 'rollback'. When loading the data (with NO COMMIT), it's not going to be committed until you execute a COMMIT statement.

Amianthus answered 29/9, 2012 at 1:24 Comment(0)
K
0

What about using ansi merge?

-Upload the data using the fast bulk loader to a temp table -Merge from the temp table into the base table

Kriemhild answered 30/9, 2012 at 14:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.