After seeing some of these great answers, @arantius's comment (regarding timing $x
vs x^
vs (?!x)x
) on the currently accepted answer made me want to time some of the solutions given so far.
Using @arantius's 275k line standard, I ran the following tests in Python (v3.5.2, IPython 6.2.1).
TL;DR: 'x^'
and 'x\by'
are the fastest by a factor of at least ~16, and contrary to @arantius's finding, (?!x)x
was among the slowest (~37 times slower). So the speed question is certainly implementation dependent. Test it yourself on your intended system before committing if speed is important to you.
UPDATE: There is apparently a large discrepancy between timing 'x^'
and 'a^'
. Please see this question for more info, and the previous edit for the slower timings with a
instead of x
.
In [1]: import re
In [2]: with open('/tmp/longfile.txt') as f:
...: longfile = f.read()
...:
In [3]: len(re.findall('\n',longfile))
Out[3]: 275000
In [4]: len(longfile)
Out[4]: 24733175
In [5]: for regex in ('x^','.^','$x','$.','$x^','$.^','$^','(?!x)x','(?!)','(?=x)y','(?=x)(?!x)',r'x\by',r'x\bx',r'^\b$'
...: ,r'\B\b',r'\ZNEVERMATCH\A',r'\Z\A'):
...: print('-'*72)
...: print(regex)
...: %timeit re.search(regex,longfile)
...:
------------------------------------------------------------------------
x^
6.98 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
.^
155 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x
111 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.
111 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x^
112 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.^
113 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$^
111 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?!x)x
257 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?!)
203 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?=x)y
204 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?=x)(?!x)
210 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
x\by
7.41 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
x\bx
7.42 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
^\b$
108 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\B\b
387 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
\ZNEVERMATCH\A
112 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\Z\A
112 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The first time I ran this, I forgot to r
aw the last 3 expressions, so '\b'
was interpreted as '\x08'
, the backspace character. However, to my surprise, 'a\x08c'
was faster than the previous fastest result! To be fair, it will still match that text, but I thought it was still worth noting because I'm not sure why it's faster.
In [6]: for regex in ('x\by','x\bx','^\b$','\B\b'):
...: print('-'*72)
...: print(regex, repr(regex))
...: %timeit re.search(regex,longfile)
...: print(re.search(regex,longfile))
...:
------------------------------------------------------------------------
y 'x\x08y'
5.32 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
x 'x\x08x'
5.34 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
$ '^\x08$'
122 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
None
------------------------------------------------------------------------
\ '\\B\x08'
300 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
None
My test file was created using a formula for " ...Readable Contents And No Duplicate Lines" (on Ubuntu 16.04):
$ ruby -e 'a=STDIN.readlines;275000.times do;b=[];rand(20).times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > /tmp/longfile.txt
$ head -n5 /tmp/longfile.txt
unavailable speedometer's garbling Zambia subcontracted fullbacks Belmont mantra's
pizzicatos carotids bitch Hernandez renovate leopard Knuth coarsen
Ramada flu occupies drippings peaces siroccos Bartók upside twiggier configurable perpetuates tapering pint paralyzed
vibraphone stoppered weirdest dispute clergy's getup perusal fork
nighties resurgence chafe
complexity
tag? I cannot see how it applies here. – Padauks/(?(?{ defined $ENV{FOO} })foo|(*F))/bar/g
, "substitute bar for foo if $FOO, otherwise do nothing" – Corot{min,max}
and{min,}
regex quantifiers. – Remake