There are quite a few tools (e.g. ripgrep
) and options (-f
, -F
, and -x
) to speed up your basic approach. But all of them are basically the same slow approach as you are using now, "only" sped up by a huge but still constant factor. For your problem and input sizes, I'd recommend to change the approach altogether. There are many different ways to tackle your problem. First, lets define some variables to estimate the speedup of those approaches:
Problem
A 26 GB haystack file with h = 1 million entries (description, sequence) = 2 billion lines, e.g.
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
...
A 4.7 GB needles file with n = 226 million lines, each of length m = 21, e.g.
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
...
For all needles, we want to extract the corresponding entries in the haystack (if they exist).
Solutions
We assume n < h and a constant m. Therefore O(n+h) = O(h), O(m)=O(1) and so on.
Our goal is to minimize the number of times we have to iterate the the biggest file (= the haystack).
Naive – O(h·n) time
Currently, you are using the naive approach. For each needle, the entire haystack is searched once.
Put needles into data structure; search haystack once – O(h) time
Store all needles in a data structure which has a fast contains()
operation.
Then iterate the haystack and call needles.contains(haystackEntry)
for each entry, to decide whether it is something you are searching for.
Currently, your "data structure" is a list, which takes O(1) time to "build" (because it is already in that form) , but O(n) time to query once!
Below data structures take O(n) time to populate and O(1) time to query once, resulting in O(n + h·1) = O(h) time in total.
- Tries (= prefix trees) can be expressed as a regexes, so you can stick with
grep
. E.g. the needles ABC
, ABX
, and XBC
can be stored in the Trie regex ^(AB(C|X)|XBC)
. But converting the list of needles to such a Trie regex is a bit complicated in bash.
- Hash maps are available in
awk
, see sundeep's answer. But putting 4.7 GB of raw data in such a structure in memory is probably not very efficient if even possible (depends on your available memory. The hash map needs to be many times bigger than the raw data).
Either way, data structures and bash don't mix very well. And even if we switched to a better language, we would have to re-build or store/load the structure each time the program runs. Therefore it is easier and nearly as efficient to ...
Sort everything; search haystack once – O( h·log(h) + h ) time
First sort the haystack and the needles, then iterate the haystack only once.
Take the first needle and search the haystack from the beginning. When reaching a haystack entry that would have to be sorted behind the the current needle, take the next needle and continue the search from your current location.
This can be done easily in bash. Here we use GNU coreutils to make processing a bit easier, faster, and safer:
export LC_ALL=C # speeds up sorting
mem=66% # Max. memory to be used while sorting. More is better.
sep=$'\f' # A character not appearing in your data.
paste -d"$sep" - - < haystack > haystack2
sort -S"$mem" -o needles2 needles
sort -S"$mem" -t"$sep" -k2,2 -o haystack2 haystack2
# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -22 -o2.1,2.2 --nocheck-order needles2 haystack2 |
tr "$sep" \\n
This changes the order of the output. If you need the output in the original order, use a Schwartzian transform (= decorate-sort-undecorate): Before sorting the needles/haystack, store their line numbers. Drag those along through the entire process. At the end, sort the found entries by those line numbers. Finally, remove the line numbers and print the result.
export LC_ALL=C # speeds up sorting
mem=66% # Max. memory to be used while sorting. More is better.
sep=$'\f' # A character not appearing in your data.
nl -ba -d '' -s"$sep" needles > needles2
paste -d"$sep" - - < haystack | nl -ba -d '' -s"$sep" > haystack2
sort -t"$sep" -k2,2 -S"$mem" -o needles2 needles2
sort -t"$sep" -k3,3 -S"$mem" -o haystack2 haystack2
# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -12 -23 -o1.1,2.1,2.2,2.3 --nocheck-order needles2 haystack2 > result
sort -t"$sep" -k1,2n -S"$mem" -o result result
cut -d"$sep" -f3- result | tr "$sep" \\n
regex
, say, just for first 3-4 letters that might match your "needles", then pipe those down to anotherawk
toregex
scan for just letters 5-8, for instance (or do it in the sameawk
instance). After that, you've narrowed the search window significantly - and NOW you can do yoursort + grep
routine….. don't pre-sort all the hay in the haystack for no reason – Psychoneurosis