Find common lines to multiple files

P

5

6

I have nearly 200 files and I want to find lines that are common to all 200 files,the lines are like this:

HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1
HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/1
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/2
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/1
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/2

is there a way to do it in a batch way?

Peaceful answered 13/2, 2020 at 14:15 Comment(2)

On SO we do encourage users to add their efforts which they have put in order to solve their own problems so please do add them in your question and let us know then. – Ubangi 13/2, 2020 at 14:17

Also please do mention file format in which you want to traverse and check details. – Ubangi 13/2, 2020 at 14:30

F

4

awk '(NR==FNR){a[$0]=1;next}
     (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
     ($0 in a) { a[$0]=1 }
     END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200

This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

(NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
(FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.

Forego answered 13/2, 2020 at 14:36 Comment(6)

ok, restored. For the OP: I recommend to use awk as shown here. But maybe the use of grep and comm is still interesting for educational purposes. – Gannes 13/2, 2020 at 14:45

Thanks, I tried something similar..could you explain what all those after awk mean? Also do I have to list all my 200 files? – Peaceful 13/2, 2020 at 14:47

yes thanks hek2mgl a lot, I do not use this style a lot, so it is definitely very useful... – Peaceful 13/2, 2020 at 15:3

@Peaceful You can use shell expansion to avoid listing out all 200 files manually. E.g. if the files are named file1.dat file2.dat, you can do awk '<CODE>' file*.dat. The shell will expand the file names before awk is invoked. – Extravagancy 13/2, 2020 at 15:34

@Peaceful sorry there was a tiny bug in the code. This is now fixed. – Forego 13/2, 2020 at 15:50

@Peaceful I have modified the code to handle duplicate lines. – Forego 13/2, 2020 at 16:10

G

7

I don't think there is a unix command which you could just use for the task. But you could create a little shell script around the comm and grep commands as shown in the following example:

#!/bin/bash    

# Prepare 200 (small) test files
rm data-*.txt
for i in {1..200} ; do
    echo "${i}" >> "data-${i}.txt"
    # common line
    echo "foo common line" >> "data-${i}.txt"
done

# Get the common lines between file1 and file2.
# file1 and file2 can be random files out of the set,
# ideally they are the smallest ones
comm -12 data-1.txt data-2.txt > common_lines

# Now grep through the remaining files for those lines
for file in data-{3..100}.txt ; do
    # For each remaining file reduce the common_lines to those
    # which are found in that file
    grep -Fxf common_lines "${file}" > tmp_common_lines \
        && mv tmp_common_lines > common_lines
done

# Print the common lines
cat common_lines

The same approach can be used for bigger files. It will take longer but the memory consumption stays linear.

Gannes answered 13/2, 2020 at 14:35 Comment(0)

F

4

awk '(NR==FNR){a[$0]=1;next}
     (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
     ($0 in a) { a[$0]=1 }
     END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200

This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

(NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
(FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.

Forego answered 13/2, 2020 at 14:36 Comment(6)

ok, restored. For the OP: I recommend to use awk as shown here. But maybe the use of grep and comm is still interesting for educational purposes. – Gannes 13/2, 2020 at 14:45

Thanks, I tried something similar..could you explain what all those after awk mean? Also do I have to list all my 200 files? – Peaceful 13/2, 2020 at 14:47

yes thanks hek2mgl a lot, I do not use this style a lot, so it is definitely very useful... – Peaceful 13/2, 2020 at 15:3

@Peaceful You can use shell expansion to avoid listing out all 200 files manually. E.g. if the files are named file1.dat file2.dat, you can do awk '<CODE>' file*.dat. The shell will expand the file names before awk is invoked. – Extravagancy 13/2, 2020 at 15:34

@Peaceful sorry there was a tiny bug in the code. This is now fixed. – Forego 13/2, 2020 at 15:50

@Peaceful I have modified the code to handle duplicate lines. – Forego 13/2, 2020 at 16:10

U

4

Could you please try following. Fair warning, this will be memory consuming, since data is getting stored into an array.

awk '
FNR==1{
  file++
}
{
  a[$0]++
}
END{
 for(i in a){
   if(a[i]==file){
     print "Line " i " is found in all "file " files."
   }
 }
}' file1 file2 ....file200

Ubangi answered 13/2, 2020 at 14:56 Comment(5)

I checked this is working, probably my files dont have any lines in common... :/ – Peaceful 13/2, 2020 at 15:27

This could be problematic. Imagine you have 200 files of 1GB each, but not a single line in common. You will attempt to store 200GB of data in your array a. – Forego 13/2, 2020 at 15:53

@kvantour, added a warning in it, if memory is enough then this should be simplest one IMHO. – Ubangi 13/2, 2020 at 15:55

Stumbled on this solution. Worked best. I had smaller files though, but was a pain to find common URLs in all files. Worked like a charm. Thank you for the code. – Geometric 26/9, 2022 at 0:21

@nav33n, Your welcome, cheers and happy learning. – Ubangi 26/9, 2022 at 16:49

T

0

My approach would be to generate a super-file that has a column at the start for filename and line number, then the corresponding line of content, sort this file on the content column.

Grep could generate the first part of this, especially if you can exclude some part of the file

Twitch answered 13/2, 2020 at 20:6 Comment(0)

J

0

This one-liner should work:

sort files/* | uniq -d > onlycommonlines.txt

Jacks answered 24/6 at 1:53 Comment(0)

Recommended topics

Hot tags