At the request of Karl Knechtel, who started a bounty with the suggestion of speed testing an approach using str.partition
and str.rpartition
against solutions offered by existing answers, I'm conducting a benchmark test as follows.
Note that @papirrin's answer, which does not return a string, and @positivetypical's answer, a duplicate to @jon's, are not included in the test:
def AshwiniChaudhary_split_rsplit(s):
s = s.split('\n', 1)[-1]
if s.find('\n') == -1:
return ''
return s.rsplit('\n', 1)[0]
def BenjaminSpiegl_find_rfind(s):
return s[s.find('\n')+1:s.rfind('\n')]
def jon_split_slice_join(s):
return '\n'.join(s.split('\n')[1:-1])
def Knechtel_partition_rpartition(s):
return s.partition('\n')[2].rpartition('\n')[0]
funcs = [
AshwiniChaudhary_split_rsplit,
BenjaminSpiegl_find_rfind,
jon_split_slice_join,
Knechtel_partition_rpartition
]
s = '\n'.join(['x' * 80] * (10_000_000 // 80))
# Correctness
for n in range(1, 15):
expect = None
for f in funcs:
result = f(s)
if expect is None:
expect = result
else:
assert result == expect, (n, f.__name__)
# Speed
from time import perf_counter_ns
from statistics import mean, stdev
ts = {f: [] for f in funcs}
for _ in range(10):
for f in funcs:
t0 = perf_counter_ns()
f(s)
ts[f].append(perf_counter_ns() - t0)
for f in funcs:
print(f'{f.__name__} {mean(ts[f]) / 1000:.0f}µs ± {stdev(ts[f]) / 1000:.0f}µs')
This outputs, on ATO, the following result:
AshwiniChaudhary_split_rsplit 4304µs ± 764µs
BenjaminSpiegl_find_rfind 1862µs ± 178µs
jon_split_slice_join 31340µs ± 1827µs
Knechtel_partition_rpartition 4270µs ± 166µs
The result shows that:
- The performance of
str.partition
and str.rpartition
is about equivalent to that of str.split
with maxsplit=1
, but the approach using str.partition
and str.rpartition
has a slight edge because they guarantee the number of items in the returning sequence and does not require an if
statement testing for the edge case of a single line input needed by the approach using str.split
- Slicing the string at indices indentified by
str.find
and str.rfind
is more than twice as fast as the two approaches above because it copies the large string only once during the slice, while the other two approaches copy the bulk of the string twice, on top of having to create additional sequence and string objects
- Splitting the string into a large list of small strings is extremely costly due to the number of objects that need to be created
It's worth noting that for small inputs, the approach using str.partition
and str.rpartition
is actually the fastest due to its smaller overhead.
Here's the result of the same benchmark with a 1000-character input instead:
AshwiniChaudhary_split_rsplit 1145ns ± 431ns
BenjaminSpiegl_find_rfind 788ns ± 199ns
jon_split_slice_join 2079ns ± 2504ns
Knechtel_partition_rpartition 697ns ± 123ns
my_string = file_obj.read()
to retrieve the string? Also, do you need all lines present in memory at a time, or is just one line at a time okay? – Leopold