Remove only exact number of repeat matches between two files

Asked 20/10, 2023 at 11:13 Answered 21/10, 2023 at 20:43

I want to get the remaining difference between two files that have redundant entries.

File1.txt:

Data1
Data1
Data2
Data2
Data3
Data3
Data3
Data3
Data4
Data5
Data6
Data6

and

File2.txt:

Data1
Data2
Data2
Data3
Data3
Data4
Data5
Data6

Finalfile.txt:

Data1
Data3
Data3
Data6

In other words: if an entry shows up n times in file 1 and m times in file 2 then, the final file should contain the n-m entries. Ie: See there are four entries of Data3 in File1.txt and only two entries in File2.txt, therefore the Finalfile.txt has 2 occurances of Data3.

I've tried:

grep -v -f File1.txt File2.txt > Finalfile.txt

but it give the absolute differences.

Peduncle answered 20/10, 2023 at 11:13 Comment(7)

Why is Data4 shown in final output when both files have exactly one line of Data4 ? – Fifth 20/10, 2023 at 11:28

Are the files sorted (as your examples suggest)? If not, would it matter if the output is? – Mchenry 20/10, 2023 at 11:28

comm -23 File{1,2}.txt or, if your files are not sorted, comm -23 <(sort File1.txt) <(sort File2.txt). See my answer for the details. – Browne 20/10, 2023 at 14:15

You described what to output if "foo" appears more times in file1 than in file2 but what should be output if "foo" appears more times in file2 than in file1? – Jurisprudent 20/10, 2023 at 14:58

@anubhava, youre right, ive edit it now. – Peduncle 23/10, 2023 at 8:23

@Mchenry not sorted and sorting does not matter. – Peduncle 23/10, 2023 at 8:23

@EdMorton in my particular situation, File2 will always have less than File1. – Peduncle 23/10, 2023 at 8:24

You may use this 2 pass awk solution:

awk '
NR == FNR {
   ++fq[$1]
   next
}
{
   --fq[$1]
}
END {
   for (s in fq)
      for (i = 1; i <= fq[s]; ++i)
         print s
}' file1 file2

Data1
Data3
Data3
Data6

Fifth answered 20/10, 2023 at 11:29 Comment(0)

another minimalist awk

algorithm is multiset difference: file1 \ file2. Doesn't require sorted order.

$ awk 'NR==FNR{a[$1]++; next} --a[$1]<0' file2 file1

Data1
Data3
Data3
Data6

Note that in your desired output Data4 should not be there!

Hypothesis answered 21/10, 2023 at 20:43 Comment(1)

clean and nice ++ – Selfacting 22/10, 2023 at 9:20

If your files are sorted you can try:

$ comm -23 File1.txt File2.txt

Data1
Data3
Data3
Data6

By default comm prints 3 columns:

lines unique to FILE1,
lines unique to FILE2,
lines that appear in both files.

The -23 option suppresses the lines unique to FILE2 and the lines that appear in both files. See man comm for the details. Note that if a line appears more times in File2.txt than in File1.txt it will not be printed.

If your files are not sorted yet you can try:

$ sort File1.txt | comm -23 - <(sort File2.txt)

The first comm input file is - (the standard input), that is, the standard output of sort File1.txt. The second is the <(sort File2.txt) process substitution, that presents the standard output of sort File2.txt as a regular named file.

Browne answered 20/10, 2023 at 12:48 Comment(1)

nice solution specially comm thnx ++ – Selfacting 22/10, 2023 at 9:20

Use a combination of sort and diff with grep:

diff <(sort test1.txt) <(sort test2.txt) | grep -Po '^< \K.*'

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.

^ : Beginning of the line.
\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.

See also:

Griseofulvin answered 20/10, 2023 at 13:13 Comment(2)

If you use sed -n 's/^< //p' instead of grep -Po '^< \K.*' then it'll work on any Unix box rather than just those that have GNU grep. – Jurisprudent 20/10, 2023 at 15:32

reinventing comm... – Hypothesis 23/10, 2023 at 12:33

With your shown samples please only, try following awk code.

awk '
FNR==NR{
  file1Count[$1]++
  next
}
{
  file2Count[$1]++
}
END{
  for(i in file1Count){
     diff=file1Count[i]-file2Count[i]
     if(diff>0){
        while(++j<=diff){
           print i
        }
        j=0
     }
  }
}
' file1.txt file2.txt

Capsulize answered 20/10, 2023 at 13:32 Comment(0)

Here is a Ruby to do that:

ruby -lne 'BEGIN{cnt1,cnt2=[0,1].map{|_| Hash.new {|h,k| h[k] = 0} } }
if $<.file.lineno == $<.lineno then
    cnt1[$_]+=1
else
    cnt2[$_]+=1
end

END{
puts cnt1.select{|k,v| cnt2.has_key?(k) && cnt1[k]>cnt2[k]}.
    map{|k,v| ([k]*(cnt1[k]-cnt2[k])).join("\n")}  
}' file1.txt file2.txt

Prints:

Data1
Data3
Data3
Data6

Seesaw answered 20/10, 2023 at 15:26 Comment(0)

You can also solve this with Perl :

perl -lane '
         BEGIN{%fq=()};
         $ARGV eq "File1.txt" ? ++$fq{$F[0]} : --$fq{$F[0]};
         END{foreach $s (sort keys %fq)
         {for($i=1;$i<=$fq{$s};++$i){print $s}}}' File1.txt File2.txt

Output

Data1
Data3
Data3
Data6

Selfacting answered 20/10, 2023 at 19:57 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags