linux awk comparing two csv files and creating a new file with a flag

Asked 2/3, 2012 at 4:58 Answered 2/3, 2012 at 6:26

I have 2 CSV files that i need to compare and get the difference to a newly formatted file. The samples are given below.

OLD file

DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq

NEW file

DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee

Output file

I want to compare the old and new CSV files and to find the changes that has effected in the new file and UPDATE a FLAG to denote these changes

U - if the new file record is UPDATED D - if a record existing in the old file is deleted in the new file N - if a record existing in the new file is not available in the old file

the sample output file is this.

DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N

I used diff command but it will repeat the UPDATED record too which is not I want.

 DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
 DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
 DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
  ---
 DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
 DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
 5a5
 DTL,77777777,7777777777777777,77777777777,N,N,ee,ee

I used an AWK single line command to filter out my records as well

 awk 'NR==FNR{A[$1];next}!($1 in A)' FS=: old.csv new.csv

the problem with this is is doesnt get me the records only belonging to the OLD file. which is

DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd

I initiated an driven bash script as well to ahieve this but didnt find much help with a good example.

 myscript.awk

BEGIN { 
        FS = ","    # input field seperator 
        OFS = ","   # output field seperator
}

NR > 1 {
    #flag 
    # N - new record  D- Deleted U - Updated

id = $1
    name = $2
    flag = 'N'

   # This prints the columns in the new order. The commas tell Awk to use the     character set in OFS
    print id,name,flag
}

 >> awk -f  myscript.awk  old.csv new.csv > formatted.csv

Errecart answered 2/3, 2012 at 4:58 Comment(1)

the simple right way is described here and works: https://mcmap.net/q/186310/-how-to-get-the-difference-only-additions-between-two-files-in-linux – Surgery 25/2, 2016 at 13:25

This might work for you:

diff  -W999 --side-by-side OLD NEW |
sed '/^[^\t]*\t\s*|\t\(.*\)/{s//\1 U/;b};/^\([^\t]*\)\t*\s*<$/{s//\1 D/;b};/^.*>\t\(.*\)/{s//\1 N/;b};d'
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N

an awk solution along the same lines:

diff -W999 --side-by-side OLD NEW |
awk '/[|][\t]/{split($0,a,"[|][\t]");print a[2]" U"};/[\t] *<$/{split($0,a,"[\t]* *<$");print a[1]" D"};/>[\t]/{split($0,a,">[\t]");print a[2]" N"}'
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N

Lakenyalaker answered 2/3, 2012 at 6:26 Comment(7)

Thanks guys for the very valuable replies .. When I used my data set with the above mentioned values, the awk solution worked but with the following data set the script failed. – Errecart 4/3, 2012 at 7:4

DTL,11111112,1111111111111110,11111111112,Y,N,abcdefghijklmnopqrstuvwxyz123457,abcdefghijklmnopqrstuvwxyz654322 DTL,11111113,1111111111111111,11111111113,Y,N,abcdefghijklmnopqrstuvwxyz123458,abcdefghijklmnopqrstuvwxyz654323 DTL,11111114,1111111111111112,11111111114,Y,N,abcdefghijklmnopqrstuvwxyz123459,abcdefghijklmnopqrstuvwxyz654324 DTL,11111115,1111111111111123,11111111115,Y,N,abcdefghijklmnopqrstuvwxyz123460,abcdefghijklmnopqrstuvwxyz654325 – Errecart 4/3, 2012 at 7:5

as I see it has the same column numbers but doesnt show the desired results> I am kind of new to bash scripting..so pls let me know what I am missing in this script. the diff in the 2 data sets is only the last 2 data columns which represents key fields with alphanumeric values. – Errecart 4/3, 2012 at 7:6

The problem was the column width needed to be increased (-W999) and the deletion regexp needed tweaking. I've amended both solutions. If the csv record is longer then 999 characters, amend the -W switch accordingly. – Lakenyalaker 4/3, 2012 at 10:56

Many thanks potong your answer really helped me. now I have created a .sh script file to obtain my csv files based on dates, and pass the csv files to process it using diff and awk> I have testing this with my dummy data sets and everything works fine. When I have csv files with 10000 records in each, the script acts strangely. I added 10001 st record for a csv and executed the script and expected to get 1 diff but I got 10001 diffs. When I reduced the record count to 9999 in one csv, I got the diff as 1 as expected. – Errecart 5/3, 2012 at 5:17

I repeated this without any issue for records numbering 5, 3223,7887,9998 etc and it was working fine. Is this may be an issue with memory or any limitation on awk ..pls give your thoughts..thanking you for your valuable response – Errecart 5/3, 2012 at 5:18

I was able to perform the task with this script with files less than 10000 records.but if the 2 file's f ordering is not identical , this will not work.. anyway thanks for all the feedback – Errecart 6/3, 2012 at 21:55

A good starting point would probably be:

 diff -e OLD NEW

This outputs:

 5a
 DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
 .
 1,3c
 DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
 DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc

Meaning that it Added a record on line 5 (5a) and changed the records on lines 1 and 3 (1,3c).

If you can't use this format as-is (which would be good to use a standard) then you would need to write a script which converts it to the format that you describe.

Cementum answered 2/3, 2012 at 5:33 Comment(0)

Recommended topics

Hot tags