extracting unique values between 2 sets/files

Asked 17/1, 2011 at 19:56 Answered 25/11, 2021 at 6:55

Solved linux perl bash scripting command-line

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

text file 2 contains:

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

How do I do this from the command line?

many thanks!

Radiative answered 17/1, 2011 at 19:56 Comment(3)

Is it homework? If positive, please tag it as so. – Spicule 17/1, 2011 at 19:57

what is the separator of values? – Forego 17/1, 2011 at 19:59

good catch! each value is on its own line; so newline sep. – Radiative 17/1, 2011 at 20:20

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7

Explanation of how the code works:

If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

Ammonite answered 17/1, 2011 at 20:14 Comment(11)

sweet! this works great but what if the values are each on a separate line, not separated by space as in my example (i actually had them on a new line but SO formatted them on the same line)? – Radiative 17/1, 2011 at 20:22

@Radiative my code will work for both cases, but if each number is on a separate line, you can completely remove the RS="[ \n]" to make the code shorter. Also, welcome to SO. – Ammonite 17/1, 2011 at 20:26

@Radiative by the way, to prevent SO from formatting your code, highlight it then either press CTRL+K or click the { } icon. I edited your question already with this change. – Ammonite 17/1, 2011 at 20:28

@SiegeX, consider adding some spaces into the code sample, please. Snippet is hard to understand for lacking of them. One may think that you are playing code golf or your spacebar is broken) – Heptamerous 9/2, 2016 at 22:27

@Heptamerous that's what the Edit button is for, if you think you have a better way to represent it that others can benefit from feel free to make the edit and put it up to a vote – Ammonite 11/2, 2016 at 16:48

Just for the record, this technique though it works will create a[$0] mem position for all second file entries that do not exist in first file. If the second file has a million of lines this solution is not the best. $0 in a method on the other hand, checks for existance but does NOT create an extra mem position for second file entries. – Hord 11/4, 2017 at 18:50

@GeorgeVasiliou The main reason I don't use $0 in a is because order is not guaranteed to be preserved and that is important more often than not. – Ammonite 15/4, 2017 at 17:17

When using this in scripts, make sure to first check that file1 is not empty (0 bytes long), because if it is, you will get an empty result instead of the expected contents of file2. (Cause: FNR==NR will apply to file2 then.) – Ikey 20/10, 2019 at 5:15

@Ammonite Why is order not guaranteed to be preserved with $0 in a? – Rarebit 3/6, 2020 at 21:45

@Rarebit you're right, it doesn't. I must have been thinking about the for(i in a) construct back in '17. Thanks for keeping me honest, I will update the answer. – Ammonite 3/6, 2020 at 22:30

@Ammonite Thanks! And for(i in a) not preserving order just sounds like an accident waiting to happen... – Rarebit 3/6, 2020 at 22:45

Using some lesser-known utilities:

sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

Natalee answered 17/1, 2011 at 20:33 Comment(3)

Good call on these utilities. You can combine this to a much simpler form and remove the need for temp files: comm -13 <(sort file1) <(sort file2) I still prefer awk only because it runs a single process instead of 3 as it doesn't require sorted files. This can make a big difference on large files. – Ammonite 17/1, 2011 at 20:45

join can also be used for this. – Felicita 17/1, 2011 at 22:48

@Ammonite - I personally prefer the version with 3 commands - that way if I need to tweak on of the commands (or for example, get updated file1) I don't need to re-run the WHOLE thing; which can be a benefit for really large files. Also, the syntax you provided sounds like bash, it may not work on other shells (/bin/sh or csh derivatives) – Providing 18/1, 2011 at 2:43

I was wondering which of the following solutions was the "fastest" for "larger" files:

awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2

Results of my benchmarks in short:

Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
  321599   321599  8098710 file1
  321603   321603  8098794 file2

Typical results of fastest runs

awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004

BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2

Plastered answered 29/9, 2013 at 17:34 Comment(1)

Excellent benchmarking, results, and optimizations. Thank you! – Whirl 31/10, 2014 at 4:27

How about:

diff file_1 file_2 | grep '^>' | cut -c 3-

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.

Prior answered 5/5, 2014 at 18:14 Comment(0)

with grep:

grep -F -x -v -f file_1 file_2

Feathers answered 17/1, 2011 at 20:48 Comment(2)

This leads to wrong results, as can be shown if a . (dot) is added to file_1. grep -F -x -v -f file_1 file_2 is indeed correct. – Plastered 29/9, 2013 at 12:45

@xebeche: Thx! Corrected the code line according to your suggestion. – Feathers 11/12, 2013 at 10:50

here's another awk solution

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7

Changeful answered 17/1, 2011 at 23:58 Comment(5)

What are the rules where you can use () in lieu of {}? I'm assuming this isn't a gawk'ism because you tend to use gawk when that's the case. – Ammonite 19/1, 2011 at 18:43

as you know, awk syntax consists of /pattern/{action}. ((!$0 in a)) is the "pattern" part. {action} is printing by default. Its like you can do NR==1 (for example). – Changeful 20/1, 2011 at 0:19

I guess I'm more curious about the double set of parens, why isn't (!$0 in a) sufficient? Btw, if you prefix your comment with @username then username actually gets a notification that there is a comment to them pending, otherwise they won't. The @username prefix isn't necessary only if you are the person who wrote the question and/or answer people are commenting on. So technically I didn't need to do it for this comment to you. – Ammonite 20/1, 2011 at 18:26

@SiegeX, no the double parenthesis doesn't matter in this case. Its a habit of mine. the double parenthesis is needed though, if there are more conditions. – Changeful 21/1, 2011 at 0:19

This solution is the only one who checks existance in array without creating an extra mem position. – Hord 11/4, 2017 at 18:43

$ cat file1 file1 file2 | sort | uniq -u
6
7

uniq -- report or filter out repeated lines in a file

... Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first.
-u      Only output lines that are not repeated in the input.

Print file1 twice to make sure all entries from file1 are skipped by uniq -u .

Brigitta answered 25/11, 2021 at 6:55 Comment(0)

cat file1 file2 | sort -u > unique

Telluric answered 28/7, 2020 at 17:23 Comment(1)

sort -u will keep one occurrence of each duplicate. In the question example, 1, 2, and 3 would still be printed in addition to the desired values of 6 and 7 . – Brigitta 25/11, 2021 at 6:48

-1

If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}

This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

Endrin answered 17/1, 2011 at 20:14 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags