Find common lines to multiple files
Asked Answered
P

5

6

I have nearly 200 files and I want to find lines that are common to all 200 files,the lines are like this:

HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1
HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/1
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/2
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/1
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/2

is there a way to do it in a batch way?

Peaceful answered 13/2, 2020 at 14:15 Comment(2)
On SO we do encourage users to add their efforts which they have put in order to solve their own problems so please do add them in your question and let us know then.Ubangi
Also please do mention file format in which you want to traverse and check details.Ubangi
F
4
awk '(NR==FNR){a[$0]=1;next}
     (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
     ($0 in a) { a[$0]=1 }
     END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200

This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

  1. (NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
  2. (FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
  3. ($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
  4. END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.
Forego answered 13/2, 2020 at 14:36 Comment(6)
ok, restored. For the OP: I recommend to use awk as shown here. But maybe the use of grep and comm is still interesting for educational purposes.Gannes
Thanks, I tried something similar..could you explain what all those after awk mean? Also do I have to list all my 200 files?Peaceful
yes thanks hek2mgl a lot, I do not use this style a lot, so it is definitely very useful...Peaceful
@Peaceful You can use shell expansion to avoid listing out all 200 files manually. E.g. if the files are named file1.dat file2.dat, you can do awk '<CODE>' file*.dat. The shell will expand the file names before awk is invoked.Extravagancy
@Peaceful sorry there was a tiny bug in the code. This is now fixed.Forego
@Peaceful I have modified the code to handle duplicate lines.Forego
G
7

I don't think there is a unix command which you could just use for the task. But you could create a little shell script around the comm and grep commands as shown in the following example:

#!/bin/bash    

# Prepare 200 (small) test files
rm data-*.txt
for i in {1..200} ; do
    echo "${i}" >> "data-${i}.txt"
    # common line
    echo "foo common line" >> "data-${i}.txt"
done

# Get the common lines between file1 and file2.
# file1 and file2 can be random files out of the set,
# ideally they are the smallest ones
comm -12 data-1.txt data-2.txt > common_lines

# Now grep through the remaining files for those lines
for file in data-{3..100}.txt ; do
    # For each remaining file reduce the common_lines to those
    # which are found in that file
    grep -Fxf common_lines "${file}" > tmp_common_lines \
        && mv tmp_common_lines > common_lines
done

# Print the common lines
cat common_lines

The same approach can be used for bigger files. It will take longer but the memory consumption stays linear.

Gannes answered 13/2, 2020 at 14:35 Comment(0)
F
4
awk '(NR==FNR){a[$0]=1;next}
     (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
     ($0 in a) { a[$0]=1 }
     END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200

This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

  1. (NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
  2. (FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
  3. ($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
  4. END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.
Forego answered 13/2, 2020 at 14:36 Comment(6)
ok, restored. For the OP: I recommend to use awk as shown here. But maybe the use of grep and comm is still interesting for educational purposes.Gannes
Thanks, I tried something similar..could you explain what all those after awk mean? Also do I have to list all my 200 files?Peaceful
yes thanks hek2mgl a lot, I do not use this style a lot, so it is definitely very useful...Peaceful
@Peaceful You can use shell expansion to avoid listing out all 200 files manually. E.g. if the files are named file1.dat file2.dat, you can do awk '<CODE>' file*.dat. The shell will expand the file names before awk is invoked.Extravagancy
@Peaceful sorry there was a tiny bug in the code. This is now fixed.Forego
@Peaceful I have modified the code to handle duplicate lines.Forego
U
4

Could you please try following. Fair warning, this will be memory consuming, since data is getting stored into an array.

awk '
FNR==1{
  file++
}
{
  a[$0]++
}
END{
 for(i in a){
   if(a[i]==file){
     print "Line " i " is found in all "file " files."
   }
 }
}' file1 file2 ....file200
Ubangi answered 13/2, 2020 at 14:56 Comment(5)
I checked this is working, probably my files dont have any lines in common... :/Peaceful
This could be problematic. Imagine you have 200 files of 1GB each, but not a single line in common. You will attempt to store 200GB of data in your array a.Forego
@kvantour, added a warning in it, if memory is enough then this should be simplest one IMHO.Ubangi
Stumbled on this solution. Worked best. I had smaller files though, but was a pain to find common URLs in all files. Worked like a charm. Thank you for the code.Geometric
@nav33n, Your welcome, cheers and happy learning.Ubangi
T
0

My approach would be to generate a super-file that has a column at the start for filename and line number, then the corresponding line of content, sort this file on the content column.

Grep could generate the first part of this, especially if you can exclude some part of the file

Twitch answered 13/2, 2020 at 20:6 Comment(0)
J
0

This one-liner should work:

sort files/* | uniq -d > onlycommonlines.txt
Jacks answered 24/6 at 1:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.