What is the most efficient case-insensitive grep usage?
Asked Answered
I

1

42

My objective is to match email addresses that belong to the Yahoo! family of domains. In *nix systems (I will be using Ubuntu), what are the benefits and drawbacks to any one of these methods for matching the pattern?

And if there is another, more elegant solution that I haven't been capable of imagining, please share.

Here they are:

  • Use grep with option -i:

grep -Ei "@(yahoo|(y|rocket)mail|geocities)\.com"

  • Translate characters to all upper case or lower case then grep:

tr [:upper:] [:lower:] < /path/to/file.txt | grep -E "@(yahoo|(y|rocket)mail|geocities)\.com"

  • Include a character set for each character in the pattern (the below would of course not match something like "@rOcketmail.com", but you get the idea of what it would become if I checked each character for case):

grep -E "@([yY]ahoo|([yY]|[rR]ocket)[mM]ail|[gG]eo[cC]ities)\.[cC][oO][mM]" /path/to/file.txt

Inextinguishable answered 7/4, 2014 at 22:49 Comment(3)
This wouldn't be difficult to test. Have you tried it?Hitchcock
Did you try benchmarking? I suspect that your first sample will be fastest. I expect that this problem is more likely to be throttled by file I/O than processing speed... since it's linear in the size of the input. Beware of micro-optimization.Softcover
One thing you might want to keep in mind is that capturing groups can be expensive. If you don't need to return the grouped values, consider using (?:) instead.Thracian
I
46

grep -i turned out to be significantly slower than translating to lowers before grepping, so I ended up using a variation of #2.

Thanks @mike-w for reminding me that a simple test goes a long way.

Inextinguishable answered 21/4, 2014 at 21:36 Comment(3)
And thank you for sharing the results of your tests with us all!Priming
Would you define 'significant'? If one way took 10 seconds, and the other took 30 seconds, while being 'significant' it would allow to make our own judgment call based on server load, directory transversal, time to create the regex, etc. on which method to try.Japha
I'm not going to revisit the test at this point in time, but you make a valid point and it would have been nice to quantify the differenceInextinguishable

© 2022 - 2024 — McMap. All rights reserved.