Bulk insert from csv file - Ignore rows with errors - SQL Server
Asked Answered
B

2

8

I am trying to import data from a csv file to SQL Server. There are thousands of entries in the csv file and we have a lot of rows with incorrect data in it.

Some of the rows in the CSV File are:

`"ID"|"EmpID"|"FName"|"LName"|"Gender"|"DateOfBirth"
"1"|"90043041961"|"ABCD"|"TEST"|"F"|"1848-05-05 00:00:00.000"
"1"|"10010161961"|"XYZ"|"TEST"|"F"|"1888-12-12 00:00:00.000"
.
.
..
..
....
"4"|"75101141821PPKKLL"|"LLKK"|"F"|"1925-09-09 00:00:00.000"|""
"4"|"32041401961UUYYTT"|"PPLL"|"M"|"1920-01-01 00:00:00.000"|""
.
.....
"25"|"00468132034"|"FGTT"|"OOOO"|"F"|"1922-11-11 00:00:00.000"
"25"|"00468132034"|"KKKK"|"PPPP"|"F"|"1922-11-11 00:00:00.000"

Creating the TestTable and trying to insert data (from csv file) into it:

create table TestTable
(
     ID varchar(5),
     EmpID varchar(25),
     FName varchar(25),
     LName varchar(25),
     Gender varchar(5),
     DateOfirthB varchar(30)
);

I am using the following script to import data from csv file to the TestTable in SQL Server:

bulk insert TestTable
from 'C:\TestData.csv'
with 
   (firstrow = 2,
    DATAFILETYPE='char',
    FIELDTERMINATOR= '"|"',
    ROWTERMINATOR = '\n',
    ERRORFILE ='C:\ImportErrors.csv',
    MAXERRORS = 0,
    TABLOCK
   );

Errors:

Msg 4863, Level 16, State 1, Line 1
Bulk load data conversion error (truncation) for row 32763, column 5 (Gender).

Msg 4863, Level 16, State 1, Line 1
Bulk load data conversion error (truncation) for row 32764, column 5 (Gender).

Is there any way to ignore the rows (in the csv file) which can not be added for some or other reason and insert the one's which have the correct syntax?

Thanks

PS: I can not use SSIS. Only allowed to use SQL

Biotite answered 15/11, 2016 at 22:55 Comment(7)
I know this doesn't answer your question, but I honestly think you'd be better off trying to use SSIS to do this rather than raw sqlSlam
@Slam -- I wish I could use SSIS but can not. I can only use SQL for this...Thanks for the advice thoughBiotite
The usual process is to first import your data into a more forgiving staging table (i.e. all fields are NVARCHAR(4000)). Then use SQL to clean and move the data into the final table. Personally I prefer this method to SSIS myselfRevolutionist
@Nick.McDermaid -- Ok.. I use NVARCHAR(4000) and import the data to a staging table and then clean it... Sounds good.. However there are rows in the csv table like "4"|"75101141821PPKKLL"|"LLKK"|"F"|"1905-05-05 00:00:00.000"|"" where I have an extra column entry.. how do I ignore those?Biotite
Do you mean to say you have an extra pipe so it looks like there are too many columns? You import the entire row into just one column and again use SQL to clean it and split the columns out. Sounds dumb but it works. In fact if you were to use SSIS you would have the same problem - if the text driver can't identify columns correctly you need to import the entire row and split it manuallyRevolutionist
Oh I see it isn't an extra pipe, so you still have six columns but the data is in the wrong spot. You just need to think of some logic to identify it and clean it. You could use a LIKE pattern to identify in each row which column the date is in. Column5 LIKE '____-__-__ 00:00:00.000' shows all rows with a date in column 5Revolutionist
@Nick.McDermaid -- Sounds good.. Let me try that.. Thanks againBiotite
H
5

I deal with different CSV Files that I receive from different sources on a weekly basis, so of the data is nice and clean and others are a nightmare. So this is how I handle the CSV Fields I receive, I hope it helps you. You will still need to add some data validation to handle malformed data.

SET NOCOUNT ON
GO

-- Create Staging Table
    IF OBJECT_ID(N'TempDB..#ImportData', N'U') IS NOT NULL
        DROP TABLE #ImportData

    CREATE TABLE #ImportData(CSV NVARCHAR(MAX))

-- Insert the CSV Data
    BULK INSERT #ImportData
        FROM 'C:\TestData.csv' 

-- Add Control Columns
    ALTER TABLE #ImportData 
        ADD ID INT IDENTITY(1, 1)

    ALTER TABLE #ImportData 
        ADD Malformed BIT DEFAULT(0)

-- Declare Variables
    DECLARE @Deliminator NVARCHAR(5) = '|', @ID INT = 0, @DDL NVARCHAR(MAX)
    DECLARE @NumberCols INT = (SELECT LEN(CSV) - LEN(REPLACE(CSV, @Deliminator, '')) FROM  #ImportData WHERE ID = 1)

-- Flag Malformed Rows
    UPDATE #ImportData
        SET Malformed = CASE WHEN LEN(CSV) - LEN(REPLACE(CSV, @Deliminator, '')) != @NumberCols THEN 1 ELSE 0 END

-- Create Second Staging Table
    IF OBJECT_ID(N'TestTable', N'U') IS NOT NULL
        DROP TABLE TestTable

    CREATE table TestTable
        (ID varchar(4000),
        EmpID varchar(4000),
        FName varchar(4000),
        LName varchar(4000),
        Gender varchar(4000),
        DateOfirthB varchar(4000));

-- Insert CSV Rows
    WHILE(1 = 1)
        BEGIN
            SELECT TOP 1
                @ID = ID
                ,@DDL = 'INSERT INTO TestTable(ID, EmpID, FName, LName, Gender, DateOfirthB)' + CHAR(13) + CHAR(10) + REPLICATE(CHAR(9), 1)
                            + 'VALUES' -- + CHAR(13) + CHAR(10) + REPLICATE(CHAR(9), 2)
                                    + '(' + DDL + ')'
            FROM
                (
                    SELECT 
                        ID
                        ,DDL = '''' + REPLACE(REPLACE(REPLACE(CSV, '''', ''''''), @Deliminator, ''','''), '"', '')  + ''''
                    FROM  
                        #ImportData 
                    WHERE 
                        ID > 1
                        AND Malformed = 0) D
            WHERE
                ID > @ID
            ORDER BY
                ID

            IF @@ROWCOUNT = 0 BREAK

            EXEC sp_executesql @DDL
        END

-- Clean Up
    IF OBJECT_ID(N'TempDB..#ImportData', N'U') IS NOT NULL
        DROP TABLE #ImportData

-- View Results
    SELECT * FROM dbo.TestTable
Hohenstaufen answered 15/11, 2016 at 23:30 Comment(3)
psst: it's not @Deliminator its @Delimiter. But awesome script regardless.Revolutionist
I purposely misspell words in code that I post, this way if I see it posted on other sites I know where it comes from ;-), thanks!Hohenstaufen
@MarkKram -- Thanks muchBiotite
U
3

Since the OP stated "[...] insert the the one's which have the correct syntax", I wonder why nobody suggested to modify the MAXERRORS clause. Despite not all errors can be masqueraded, it works well for conversion ones. Therefore, my suggestion is using MAXERRORS=999 in place of MAXERRORS=0 (as per orinal example).

Underpay answered 23/11, 2020 at 16:18 Comment(1)
This should be a comment not an answer, please update your post with a detailed solution,Statuette

© 2022 - 2024 — McMap. All rights reserved.