Obtain patterns in one file from another using ack or awk or better way than grep?
Asked Answered
L

5

6

Is there a way to obtain patterns in one file (a list of patterns) from another file using ack as the -f option in grep? I see there is an -f option in ack but it's different with the -f in grep.

Perhaps an example will give you a better idea. Suppose I have file1:

file1:
a
c
e

And file2:

file2:
a  1
b  2
c  3
d  4
e  5

And I want to obtain all the patterns in file1 from file2 to give:

a  1
c  3
e  5

Can ack do this? Otherwise, is there a better way to handle the job (such like awk or using hash) because I have millions of records in both files and really need an efficient way to complete? Thanks!

Ligature answered 30/3, 2012 at 4:18 Comment(15)
Is using Ack for this an absolute requirement, or are other tools available as well? How does grep fail? And are the patterns in file1 actually regular expressions, or are they just strings to match?Chanterelle
grep is actually generally pretty efficient...Phoney
Hi Ghoti, thanks for asking. Actually Ack is NOT required. It's just I'm looking for a faster way than grep since dealing with millions of records is really a pain using grep. File1 can be either regex or strings. I just want it be fast. Do you happen to know better tools? Thanks!Ligature
Dealing with millions of records is going to be a pain with most tools. You can only stream and parse data so fast. Do note, however, that if the things you're looking for are fixed strings (rather than actual regexes), fgrep will be faster than regular grep since it won't invoke the regex engine.Phoney
@Phoney Can you give more hint? How to stream and parse fast?Ligature
How long is your grep taking you? I wasn't saying I know a secret to making it go faster, I was saying that usually if grep is slow when you're only parsing a single haystack file, you're likely not going to find a faster option.Phoney
But fgrep is identical to grep -F on most platforms. That's why I asked that, above. If you can restrict your file1 to strings rather than regexps, grep may be the most efficient tool you can find, without writing one from scratch yourself.Chanterelle
Heck, if you have enough RAM, you could write a small awk script that would load file1 into the index of an array, then match against array lookups. No idea if that would be faster than grep, but it's something you could benchmark for comparison to grep using a subset of your data.Chanterelle
Where do you suppose grep -F is wasting cycles? (hint; blocking for input from your data source.) You're not CPU bound.Exarch
I'm trying %hash now and hopefully it'll be more "eco".Ligature
So this question is about awk, right, not some new language called "ack"?Ay
No. Ack is advocated to be faster than grep. Check out betterthangrep.comLigature
I believe it's considered faster at searching source trees because it ignores VCS directories. You should benchmark its performance in your case, because I suspect anything written in C (like grep) will tend to be faster than the same thing written in an interpreted language like Perl.Chanterelle
ack's speedup is not only ignoring VCS directories, but also in ignoring files that are non source code. The C/Perl speed difference is minimal because Perl's regexes are highly optimized and when you're mostly I/O bound anyway.Dilatation
It's a few years later, and "the next big thing" in searching seems to be The Silver Searcher. Check out geoff.greer.fm/ag and betterthanack.com for details.Chanterelle
E
8

Here's a Perl one-liner that uses a hash to hold the set of wanted keys from file1 for O(1) (amortized time) lookups per iteration over the lines of file2. So it will run in O(m+n) time, where m is number of lines in your key set, and n is the number of lines in the file you're testing.

perl -ne'BEGIN{open K,shift@ARGV;chomp(@a=<K>);@hash{@a}=()}m/^(\p{alpha}+)\s/&&exists$hash{$1}&&print' tkeys file2

The key set will be held in memory while file2 is tested line by line against the keys.

Here's the same thing using Perl's -a command line option:

perl -ane'BEGIN{open G,shift@ARGV;chomp(@a=<G>);@h{@a}=();}exists$h{$F[0]}&&print' tkeys file2

The second version is probably a little easier on the eyes. ;)

One thing you have to remember here is that it's more likely that you're IO bound than processor bound. So the goal should be to minimize IO use. When the entire lookup key set is held in a hash that offers O(1) amortized lookups. The advantage this solution may have over other solutions is that some (slower) solutions will have to run through your key file (file1) one time for each line of file2. That sort of solution will be O(m*n) where m is the size of your key file, and n is the size of file2. On the other hand, this hash approach provides O(m+n) time. That's a magnitude of difference. It benefits by eliminating linear searches through the key-set, and further benefits by reading the keys via IO only one time.

Exarch answered 30/3, 2012 at 4:49 Comment(2)
Compared hash using Perl with awk. Must say hash is just Super Faster than awk and even grep.Ligature
Only the second version worked for me, but it worked, and it was much faster than any other solution I tried.Intubate
C
6

Well okay, if we've switched from comments to answers... ;-)

Here's an awk one-liner that does the same as DavidO's perl one-liner, but in awk. Awk is smaller and possibly leaner than Perl. But there are a few different implementations of awk. I have no idea whether yours will perform better than others, or than perl. You'll need to benchmark.

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){n=1}}} n' file1 file2

What does (should) this do?

The first part of the awk script matches only lines in file1 (where the record number in the current file equals the record number in total), and populates the array. The second part (which runs on subsequent files) steps through each item in the array and sees if it can be used as a regexp to match the current input line.

The second block of code starts with an "n", which was set either to 0 or 1 in the previous block. In awk, "1" evaluates as true, and a missing curly-bracket block is considered equivalent to {print}, so if the previous block found a match, this one will print the current line.

If file1 contains strings instead of regexps, then you can change this to make it run faster by replacing the first comparison with if(index($0,i))....

Use with caution. Your mileage may vary. Created in a facility that may contain nuts.

Chanterelle answered 30/3, 2012 at 5:1 Comment(4)
Thanks. But this prints the first line in file2 three times as the example above: a 1 a 1 a 1Ligature
My bad, it was comparing against the value of the array instead of the index. I've corrected the code and refactored it a little and even tested it. And it's shorter now!Chanterelle
You can change the 2 part of your awk script from {n=0;for(i in a){if($0~i){n=1}}} n to {for (i in a) if ($0 ~ i) {print; break}} -- use the 'break' to stop the 'for' loop once you found a match, and explicitly use 'print' for readability.Selenaselenate
@glennjackman - good call, that might be a noticeable optimization if the array is really large. It eliminates the n variable too, which I like.Chanterelle
V
1
nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4

tested:

pearl.384> cat file3
a
c
e
pearl.385> cat file4
a  1 
b  2 
c  3 
d  4 
e  5
pearl.386> nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4
a  1 
c  3 
e  5
pearl.387>
Virgate answered 30/3, 2012 at 5:41 Comment(3)
If this works at all, it only matches if file1 and file2 have identical lines, not if file1 contains substrings or regexps to be matched, as in the OP's sample data. Did you test this?Chanterelle
yeah didnt see teh question properly.fixed it.Virgate
I see your fix. While it works with the example data, I'm not sure how well it would handle general cases. Anyway, downvote removed.Chanterelle
C
1

TXR may be another option for handling your requirements. I'm too new to it to write what you need in it, but the author is a frequent contributor to StackOverflow. While I'm certain that you can do what you need with TXR, but I'm not certain it would perform better. You'd need to test.

Worth a look, if you're interested in an entire language devoted to pattern matching. :)

Chanterelle answered 30/3, 2012 at 6:13 Comment(6)
Why dont you add this to your earlier answer.its strange for a same person to giove two answers for a single question!!.and also this is not an answer.this should be a comment in the first placeVirgate
It's not part of the earlier answer. That was awk. This is a reference to TXR. Completely different. To be sure, without actual code it doesn't deserve an upvote, but what misinformation do you think it provides?Chanterelle
does not provide any misinformation but no harm in adding this to your earlier answer.but for sure not as a second answer!you can always add as a second option for the first answer.Virgate
Thanks for suggesting. It can be a potential solution anyhow. Gave an upvote to ease the quarrel.Ligature
@peter, no harm in answering twice if the answers are differentSelenaselenate
@peter - if the OP decides to use TXR to solve his problem, do you think it would be appropriate for him to pick as the "best answer" one that has most of its focus on AWK? No. Different solutions should be posted as different answers.Interchange
I
1

You can convert the file into a regex for ack with tr. I used sed to remove the trailing pipe character.

ack "`tr '\n' '|' < patts | sed 's/.$//'`"

Note you need a couple of processes for this so the awk solution is probably more efficient, but this is quite easy to remember.

Irreverent answered 20/6, 2013 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.