Fastest way to find lines from a large file in another file

Asked 2/1, 2023 at 14:12 Answered 2/1, 2023 at 17:36

I am using grep in a while loop to find lines from one file in another file and saving the output to a new file. My file is quite large (226 million lines) and the script is taking forever (12 days and counting). Do you have a suggestion to speed it up, perhaps there is a better way rather than grep?

(I also need the preceding line for the output, therefore grep -B 1.)

Here is my code:

#!/bin/bash

while IFS= read -r line; do
  grep -B 1 $line K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
done <21mercounts.bf.trimmedreads.diff.kmers

Update:

The input file with the lines to look for is 4.7 GB and 226 mio lines and looks like this:

AAAGAAAAAAAAAGCTAAAAT
ATCTCGACGCTCATCTCAGCA
GTTCGTCGGAGAGGAGAGAAC
GAGGACTATAAAATTGTCGCA
GGCTTCAATAATTTGTATAAC
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
AAAAAACTTACCTTAAAAAGT
TTAGTACACAATATCTCCCAA

The file to look in is 26 GB and 2 billion lines and looks like this:

>264638
AAAAAAAAAAAAAAAAAAAAA
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

The expected output would be this:

>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

Grijalva answered 2/1, 2023 at 14:12 Comment(5)

Yes of course, I have updated the question with more info on the files. – Grijalva 2/1, 2023 at 14:56

Faster might be to read a chunk (say, 1GB) and look for all lines in that. Then the next chunk, until done. – Serin 2/1, 2023 at 14:59

Do the sequences in the FASTA file span multiple lines? If so then you'll have to switch to a tool than can rebuild the complete sequences before searching for a match – Corry 2/1, 2023 at 16:55

@ElsaSverris : my suggestion - create a small regex, say, just for first 3-4 letters that might match your "needles", then pipe those down to another awk to regex scan for just letters 5-8, for instance (or do it in the same awk instance). After that, you've narrowed the search window significantly - and NOW you can do your sort + grep routine….. don't pre-sort all the hay in the haystack for no reason – Psychoneurosis 2/1, 2023 at 19:18

You could also dump the data into SQLite3 and let the indexes do the heavy lifting. – Issacissachar 4/1, 2023 at 19:59

You can try this grep -f command without shell loop and using a fixed string search:

grep -B1 -Ff 21mercounts.bf.trimmedreads.diff.kmers \
 K33.21mercounts.bf.trimmedreads.dumps.fa > 21mercounts.bf.trimmedreads.diff.kmers.K33

Baseboard answered 2/1, 2023 at 14:20 Comment(3)

This will load 226 million lines into grep's memory as a lookup table, effectively much like Sundeep's awk solution (with similar memory caveats), but with a little more work scanning for the start of the key string in each letter of the target line. I suspect adding a -x would speed it up if it can work, but I've also seengrep really lose efficiency with large search files on some systems. YMMV. – Eleanoreleanora 2/1, 2023 at 16:35

fwiw, a 200MB *.kmers file (9 million lines @ 21 chars per line) grep -f requires ~1,200 MB of memory (per /usr/bin/time -v) for a factor of ~6; for OP's 4.7GB file this equates to ~28GB of RAM – Rahm 2/1, 2023 at 16:52

Thank you so much for that suggestion, it solved my problem, and only took 155 minutes. – Grijalva 2/1, 2023 at 20:51

There are quite a few tools (e.g. ripgrep) and options (-f, -F, and -x) to speed up your basic approach. But all of them are basically the same slow approach as you are using now, "only" sped up by a huge but still constant factor. For your problem and input sizes, I'd recommend to change the approach altogether. There are many different ways to tackle your problem. First, lets define some variables to estimate the speedup of those approaches:

Problem

A 26 GB haystack file with h = 1 million entries (description, sequence) = 2 billion lines, e.g.

>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
...

A 4.7 GB needles file with n = 226 million lines, each of length m = 21, e.g.

GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
...

For all needles, we want to extract the corresponding entries in the haystack (if they exist).

Solutions

We assume n < h and a constant m. Therefore O(n+h) = O(h), O(m)=O(1) and so on.
Our goal is to minimize the number of times we have to iterate the the biggest file (= the haystack).

Naive – O(h·n) time

Currently, you are using the naive approach. For each needle, the entire haystack is searched once.

Put needles into data structure; search haystack once – O(h) time

Store all needles in a data structure which has a fast contains() operation. Then iterate the haystack and call needles.contains(haystackEntry) for each entry, to decide whether it is something you are searching for.

Currently, your "data structure" is a list, which takes O(1) time to "build" (because it is already in that form) , but O(n) time to query once!
Below data structures take O(n) time to populate and O(1) time to query once, resulting in O(n + h·1) = O(h) time in total.

Tries (= prefix trees) can be expressed as a regexes, so you can stick with grep. E.g. the needles ABC, ABX, and XBC can be stored in the Trie regex ^(AB(C|X)|XBC). But converting the list of needles to such a Trie regex is a bit complicated in bash.
Hash maps are available in awk, see sundeep's answer. But putting 4.7 GB of raw data in such a structure in memory is probably not very efficient if even possible (depends on your available memory. The hash map needs to be many times bigger than the raw data).

Either way, data structures and bash don't mix very well. And even if we switched to a better language, we would have to re-build or store/load the structure each time the program runs. Therefore it is easier and nearly as efficient to ...

Sort everything; search haystack once – O( h·log(h) + h ) time

First sort the haystack and the needles, then iterate the haystack only once.

Take the first needle and search the haystack from the beginning. When reaching a haystack entry that would have to be sorted behind the the current needle, take the next needle and continue the search from your current location.

This can be done easily in bash. Here we use GNU coreutils to make processing a bit easier, faster, and safer:

export LC_ALL=C  # speeds up sorting
mem=66%    # Max. memory to be used while sorting. More is better.
sep=$'\f'  # A character not appearing in your data.

paste -d"$sep" - -  < haystack > haystack2

sort -S"$mem" -o needles2 needles
sort -S"$mem" -t"$sep" -k2,2 -o haystack2 haystack2

# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -22 -o2.1,2.2 --nocheck-order needles2 haystack2 |
tr "$sep" \\n

This changes the order of the output. If you need the output in the original order, use a Schwartzian transform (= decorate-sort-undecorate): Before sorting the needles/haystack, store their line numbers. Drag those along through the entire process. At the end, sort the found entries by those line numbers. Finally, remove the line numbers and print the result.

export LC_ALL=C  # speeds up sorting
mem=66%    # Max. memory to be used while sorting. More is better.
sep=$'\f'  # A character not appearing in your data.

nl -ba -d '' -s"$sep" needles > needles2
paste -d"$sep" - -  < haystack | nl -ba -d '' -s"$sep" > haystack2

sort -t"$sep" -k2,2 -S"$mem" -o needles2 needles2
sort -t"$sep" -k3,3 -S"$mem" -o haystack2 haystack2

# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -12 -23 -o1.1,2.1,2.2,2.3 --nocheck-order needles2 haystack2 > result
sort -t"$sep" -k1,2n -S"$mem" -o result result
cut -d"$sep" -f3- result | tr "$sep" \\n

Hessler answered 2/1, 2023 at 17:36 Comment(0)

If preserving the original order is not required, using GNU uniq and GNU sed:

{ cat 21mercounts.bf.trimmedreads.diff.kmers
  sed -n 'x;n;G;s/\n//p' K33.21mercounts.bf.trimmedreads.dumps.fa
} | LC_ALL=C sort | uniq -w21 -D |
sed -n 's/\(.*\)>\(.*\)/>\2\n\1/p' > 21mercounts.bf.trimmedreads.diff.kmers.K33

Pianissimo answered 2/1, 2023 at 15:41 Comment(2)

Nice idea! Took me a while to understand, so here are hints for the reader: sed -n 'x;n;G;s/\n//p' processes pairs of lines from the big file. For each pair, the lines are swapped and written in a single line, e.g. the result for OPs example is AAAAA…AA>264638, AAAGA…AAT>1, …. After that we basically do a join, but under the assumption that each sequence (e.g. ATCT...) is exactly 21 characters long and unique within each file. Therefore, we use uniq -D (print duplicate lines) instead of join. Not sure if uniq is faster, but it certainly is easier to write. – Hessler 2/1, 2023 at 21:37

The original order can be restored by decorating/sorting+undecorating the lines with/by their original line numbers before/after the actual processing. See end of this answer for an example. – Hessler 2/1, 2023 at 21:44

Here's a solution using awk. Not sure if it will be faster than grep or ripgrep, but it is possible due to hash-based lookup. This assumes your RAM is big enough to load the first file (4.7 GB and 226 mio lines).

$ awk 'NR==FNR{a[$1]; next} $0 in a{print p; print} {p=$0}' f1 f2
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

mawk is usually the fastest option, but I have come across examples where gawk is faster, especially for arrays like in this command. If you can install frawk, that can give you even faster results. Command needs to be slightly modified:

frawk 'NR==FNR{a[$1]; next} $0 in a{print p; print $0} {p=$0}' f1 f2

Hulsey answered 2/1, 2023 at 15:0 Comment(3)

fwiw, in my environment (GNU awk v 5.1.1) I'm seeing a[] array memory usage running ~8.5 times the size of the input file; for a 200MB file (900K lines; 21 chars per line like OP's file) the a[] array requires ~1,700 MB of RAM, so ~8.5 times the size of the input file; for OP's input file of 4.7GB this means the a[] array will require ~40GB of RAM – Rahm 2/1, 2023 at 15:50

If you have that kind of resources available, this is probably the fastest AND simplest solution - reads each file only once, doesn't require sorting, etc - and I have worked places where this was just what you do, so the machines could handle it. Generally, though, I've had to find more mechanical solutions. – Eleanoreleanora 2/1, 2023 at 16:25

typo re: my previous comment ... the 200MB file is 9 million lines @ 21 chars (+\n) per line – Rahm 2/1, 2023 at 16:49

Any time I deal with files this big, I almost always end up sorting them. Sorts are slow, but take a lot less time that your while read loop that is scanning 2 billion lines 226 million times.

sort 4GB>4gb.srt

and

sed '/>/{N;s/\n/ /}' 26GB |sort -t' ' -k2 >25gb.srt

which will produce a file like this:

>264638 AAAAAAAAAAAAAAAAAAAAA
>1 AAAGAAAAAAAAAGCTAAAAT
>13 AATCATTTTCCGCTGGAGAGA
>1 ATCTCGACGCTCATCTCAGCA
>38 ATTCAATAAATAATAAATTAA
>2 GAGGACTATAAAATTGTCGCA
>1 GGCTTCAATAATTTGTATAAC
>1 GTTCGTCGGAGAGGAGAGAAC
>28 TCTTTTCAGGAGTAATAACAA

Now you only have to read through each file once.

$ cat tst
awk 'BEGIN{ getline key < "4gb.srt"; }
 $2  < key { next; }
 $2  > key { while ($2 > key){ getline key < "4gb.srt"; } }
 $2 == key {  $0=gensub(/ /,"\n",1); print }' 25gb.srt

$ ./tst
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
>1
GTTCGTCGGAGAGGAGAGAAC

The ordering is different from yours, but otherwise does that work?

(Try some tests with smaller files first...)

addendum

Please c.f. Socowi's better implementation, but I was asked to explain the awk, so -

First, see above where I parsed the larger "haystraw" file to single lines sorted on the key field, which will be $2 in awk, and the smaller "needles" file in the same order. Making a few (not necessarily safe) assumptions, I ran through both once.

BEGIN{ getline key < "4gb.srt"; }

This just initializes the first "needle" into a variable called key by reading from the appropriate file.

Then as awk reads each line of the "haystraw" file, it automatically parses it into the fields - since we stacked them, the first field is the previous line of the original haystack, and the second field is the value to check, so we do our comparisons between key and $2.

 $2  < key { next; } # skip ahead to next key/needle

If the current straw is less than the needle, throw it away and grab the next one.

 $2  > key { while ($2 > key){ getline key < "4gb.srt"; } }

If the current straw is greater than the needle, then the needle wasn't in the file. The next one might not be either, so we grab needles in sequential order and compare then until they catch up.

There's actually a potential bug here - it's not confirming that something was read and could hang in an endless loop when the needles run out. This section should have been something like -

 $2  > key { while ( ($2 > key) { if( 0 == getline key < "4gb.srt" ) key = "ZZZZZZZZZZZZZZZZZZZZZZ"; } }

Finally,

 $2 == key {  $0=gensub(/ /,"\n",1); print }' 25gb.srt

If they match, reinsert the newline between the previous record and the matching line, and print them both.

There really should also have been an END{ close("4gb.srt") } as well.

Eleanoreleanora answered 2/1, 2023 at 17:14 Comment(4)

One more thing -- do NOT include the sorting in your code. Do that in advance, ONCE, so that if anything goes wrong you don't have to wait for it to do it again. Sorting that many records isn't going to be super fast, so fire it off and go work on your test scripts with small samples while it runs. You can delete the files when you no longer need them, but you don't want to have to go through that sort over and over if ANYTHING goes wrong. – Eleanoreleanora 2/1, 2023 at 17:23

any chance you can explain the tst code? – Undefined 2/1, 2023 at 17:34

@Undefined Looks like the awk script in ./tst is a (slow) re-implementation of the join command. I explained this approach here. Quote: Take the first needle and search the haystack from the beginning. When reaching a haystack entry that would have to be sorted behind the the current needle, take the next needle and continue the search from the current location. – Hessler 2/1, 2023 at 19:37

Yep. Less familiar with join so I use the multitool pliers instead of the appropriately sized wrench. Gonna go sort my toolbox and do some shopping. :) – Eleanoreleanora 3/1, 2023 at 16:49

grep can search for many patterns (given in a separate file) simultaneously, so reading K33.21mercounts.bf.trimmedreads.dumps.fa will only be done once. Something like the following might work:

#!/bin/bash

grep --f 21mercounts.bf.trimmedreads.diff.kmers -B 1 K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;

However, it probably requires lots of RAM

Spavined answered 2/1, 2023 at 15:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++