Well okay, if we've switched from comments to answers... ;-)
Here's an awk one-liner that does the same as DavidO's perl one-liner, but in awk. Awk is smaller and possibly leaner than Perl. But there are a few different implementations of awk. I have no idea whether yours will perform better than others, or than perl. You'll need to benchmark.
awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){n=1}}} n' file1 file2
What does (should) this do?
The first part of the awk script matches only lines in file1 (where the record number in the current file equals the record number in total), and populates the array. The second part (which runs on subsequent files) steps through each item in the array and sees if it can be used as a regexp to match the current input line.
The second block of code starts with an "n", which was set either to 0 or 1 in the previous block. In awk, "1" evaluates as true, and a missing curly-bracket block is considered equivalent to {print}
, so if the previous block found a match, this one will print the current line.
If file1 contains strings instead of regexps, then you can change this to make it run faster by replacing the first comparison with if(index($0,i))...
.
Use with caution. Your mileage may vary. Created in a facility that may contain nuts.
grep
fail? And are the patterns in file1 actually regular expressions, or are they just strings to match? – Chanterellegrep
is actually generally pretty efficient... – PhoneyAck
is NOT required. It's just I'm looking for a faster way thangrep
since dealing with millions of records is really a pain usinggrep
. File1 can be either regex or strings. I just want it be fast. Do you happen to know better tools? Thanks! – Ligaturefgrep
will be faster than regulargrep
since it won't invoke the regex engine. – Phoneygrep
taking you? I wasn't saying I know a secret to making it go faster, I was saying that usually ifgrep
is slow when you're only parsing a single haystack file, you're likely not going to find a faster option. – Phoneyfgrep
is identical togrep -F
on most platforms. That's why I asked that, above. If you can restrict your file1 to strings rather than regexps, grep may be the most efficient tool you can find, without writing one from scratch yourself. – Chanterelleawk
script that would load file1 into the index of an array, then match against array lookups. No idea if that would be faster than grep, but it's something you could benchmark for comparison to grep using a subset of your data. – ChanterelleAck
is advocated to be faster thangrep
. Check out betterthangrep.com – Ligature