Fastest way to remove first and last lines from a Python string
Asked Answered
B

7

15

I have a python script that, for various reasons, has a variable that is a fairly large string, say 10mb long. This string contains multiple lines.

What is the fastest way to remove the first and last lines of this string? Due to the size of the string, the faster the operation, the better; there is an emphasis on speed. The program returns a slightly smaller string, sans the first and last lines.

"\n".join(s.split("\n")[1:-1]) is the easiest way to do this, but it's extremely slow because the split() function copies the object in memory, and the join() copies it again.

Example string:

*** START OF DATA ***
data
data
data
*** END OF DATA ***

Extra credit: Have this program not choke if there is no data in between; this is optional, since for my case there shouldn't be a string with no data in between.

Barquentine answered 25/1, 2015 at 7:45 Comment(1)
Do you have control over how the string enters your program, eg: are you doing my_string = file_obj.read() to retrieve the string? Also, do you need all lines present in memory at a time, or is just one line at a time okay?Leopold
M
17

First split at '\n' once and then check if the string at last index contains '\n', if yes str.rsplit at '\n' once and pick the item at 0th index otherwise return an empty string:

def solve(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don't split at all, find the index of '\n' from either end and slice the string:

>>> def solve_fast(s):
    ind1 = s.find('\n')
    ind2 = s.rfind('\n')
    return s[ind1+1:ind2]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve_fast(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve_fast(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop
Must answered 25/1, 2015 at 7:57 Comment(0)
L
16

Consider a string s that is something like this:

s = "line1\nline2\nline3\nline4\nline5"

The following code...

s[s.find('\n')+1:s.rfind('\n')]

...produces the output:

'line2\nline3\nline4'

And, thus, is the shortest code to remove the first and the last line of a string. I do not think that the .find and .rfind methods do anything but search for a given string. Try out the speed!

Landsturm answered 7/1, 2016 at 10:48 Comment(0)
U
2

At the request of Karl Knechtel, who started a bounty with the suggestion of speed testing an approach using str.partition and str.rpartition against solutions offered by existing answers, I'm conducting a benchmark test as follows.

Note that @papirrin's answer, which does not return a string, and @positivetypical's answer, a duplicate to @jon's, are not included in the test:

def AshwiniChaudhary_split_rsplit(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]

def BenjaminSpiegl_find_rfind(s):
    return s[s.find('\n')+1:s.rfind('\n')]

def jon_split_slice_join(s):
    return '\n'.join(s.split('\n')[1:-1])

def Knechtel_partition_rpartition(s):
    return s.partition('\n')[2].rpartition('\n')[0]

funcs = [
    AshwiniChaudhary_split_rsplit,
    BenjaminSpiegl_find_rfind,
    jon_split_slice_join,
    Knechtel_partition_rpartition
]

s = '\n'.join(['x' * 80] * (10_000_000 // 80))

# Correctness
for n in range(1, 15):
    expect = None
    for f in funcs:
        result = f(s)
        if expect is None:
            expect = result
        else:
            assert result == expect, (n, f.__name__)

# Speed
from time import perf_counter_ns
from statistics import mean, stdev

ts = {f: [] for f in funcs}
for _ in range(10):
    for f in funcs:
        t0 = perf_counter_ns()
        f(s)
        ts[f].append(perf_counter_ns() - t0)
for f in funcs:
    print(f'{f.__name__} {mean(ts[f]) / 1000:.0f}µs ± {stdev(ts[f]) / 1000:.0f}µs')

This outputs, on ATO, the following result:

AshwiniChaudhary_split_rsplit 4304µs ± 764µs
BenjaminSpiegl_find_rfind 1862µs ± 178µs
jon_split_slice_join 31340µs ± 1827µs
Knechtel_partition_rpartition 4270µs ± 166µs

The result shows that:

  • The performance of str.partition and str.rpartition is about equivalent to that of str.split with maxsplit=1, but the approach using str.partition and str.rpartition has a slight edge because they guarantee the number of items in the returning sequence and does not require an if statement testing for the edge case of a single line input needed by the approach using str.split
  • Slicing the string at indices indentified by str.find and str.rfind is more than twice as fast as the two approaches above because it copies the large string only once during the slice, while the other two approaches copy the bulk of the string twice, on top of having to create additional sequence and string objects
  • Splitting the string into a large list of small strings is extremely costly due to the number of objects that need to be created

It's worth noting that for small inputs, the approach using str.partition and str.rpartition is actually the fastest due to its smaller overhead.

Here's the result of the same benchmark with a 1000-character input instead:

AshwiniChaudhary_split_rsplit 1145ns ± 431ns
BenjaminSpiegl_find_rfind 788ns ± 199ns
jon_split_slice_join 2079ns ± 2504ns
Knechtel_partition_rpartition 697ns ± 123ns
Ungovernable answered 21/5 at 2:49 Comment(7)
Interesting. I know that partition can slightly outperform a single split in some other circumstances so it seemed worth looking at. (Maybe it just avoids some O(1) overhead that becomes irrelevant when slicing a long string.) I actually hadn't noticed there was a pure find-based solution offered.Britannia
Yeah, very true that str.partition has a much lower O(1) overhead. I've updated my answer with such a note then.Ungovernable
I'm actually impressed. I didn't expect to outcompete the lower-level approach that minimizes string creation. But partition/rpartition do benefit from being able to scan and slice with the same C loop, I guess. Anyway, I'll award the bounty when the system allows me to. Feel free to mark comments NLN.Britannia
The answer seems to produce similar results as mine. Not sure why you don't use stdlib timeit, it does the loop and the statistics for you?Orts
@Orts timeit doesn't offer a good way to programmatically produce its user-friendly CLI output (including units and stdev) via its API while I'd like the code I post here to be directly executable with a Python interpreter.Ungovernable
@Ungovernable You can actually just call timeit.main() from a script, using inspect.getsource(f) as the "--setup" arg, it works fine. But I see no problem with using the IPython magics (those can also be written directly in a script, provided you execute the script with an IPython interpreter).Orts
@Orts Yes, I know about timeit.main() but I find the usage ugly. And not everyone has IPython installed (I don't). I'll just wait for the CLI functionalities to be made available from the API with github.com/python/cpython/issues/119164 .Ungovernable
A
1

Another method is to split the data at newlines and then rejoin everything but the first and last line:

>>> s = '*** START OF DATA *** \n\
... data\n\
... data\n\
... data\n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
'data\ndata\ndata'

This works fine with no data:

>>> s = '*** START OF DATA *** \n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
''
Amicable answered 9/6, 2016 at 21:36 Comment(1)
As noted by the OP, this will be very slow on large data.Logic
O
1

This answer addresses the bounty from Karl Knechtel. I reproduce the bounty text here, because it will disappear once the bounty period ends:

I suspect, but do not know, that an approach using .partition and .rpartition methods would be faster than anything offered so far (and avoid conditional logic). I'm looking for an answer that implements and explains this, and/or an answer showing timing results for the various approaches suggested.

Setup:

import random
import string

size = 10 * 1024 * 1024  # 10 mb
population = string.ascii_lowercase + "\n"
middle = "".join(random.choices(population, k=size))

# sample data
s = f"""\
*** START OF DATA ***
{middle}
*** END OF DATA ***"""


def naive_split_join(s):
    """https://mcmap.net/q/762076/-fastest-way-to-remove-first-and-last-lines-from-a-python-string"""
    return "\n".join(s.split("\n")[1:-1])


def solve(s):
    """https://stackoverflow.com/a/28134394"""
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]


def solve_fast(s):
    """https://stackoverflow.com/a/28134394"""
    ind1 = s.find("\n")
    ind2 = s.rfind("\n")
    return s[ind1 + 1 : ind2]


def partition(s):
    head, sep, s = s.partition("\n")
    s, sep, tail = s.rpartition("\n")
    return s


# correctness test
for f in naive_split_join, solve, solve_fast, partition:
    assert f(s) == middle, f"{f} is incorrect"

Timing:

I'm using Python 3.12.3 from python.org running on an M1 2020 MacBook Air here, and profiling with the timeit magic in an IPython 8.24.0 REPL:

>>> timeit naive_split_join(s)
26.1 ms ± 85.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> timeit solve(s)
931 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> timeit solve_fast(s)
578 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> timeit partition(s)
925 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The approach from the top-voted answer is faster than this partition-based approach by a significant margin.

Orts answered 21/5 at 2:9 Comment(0)
A
0

Depending on the way that your use case will consume the string, the faster way to remove it may be by not removing it.

If you plan to access the lines in the string sequentially you can build a generator that skip the first and last line while yielding each line as is being consumed rather than building a new set of copies of all the lines altogether.

An ad-hoc way to avoid the first and last line is to iterate over the string without generating unnecessary copies is by keeping track of three subsequent lines and only returning the 2nd one, this way the iteration will conclude before reaching the last line without requiring to know the position of the last line break.

The following function should give you the desired output:

def split_generator(s):
  # Keep track of start/end positions for three lines
  start_prev = end_prev = 0
  start = end = 0
  start_next = end_next = 0

  nr_lines = 0

  for idx, c in enumerate(s):
    if c == '\n':
      nr_lines += 1

      start_prev = start
      end_prev = end
      start = start_next
      end = end_next
      start_next = end_next
      end_next = idx

      if nr_lines >= 3:
        yield s[(start + 1) : end]

  # Handle the case when input string does not finish on "\n"
  if s[-1] != '\n' and nr_lines >= 2:
    yield s[(start_next+1):end_next]

You cant test it with:

print("1st example")
for filtered_strs in split_generator('first\nsecond\nthird'):
  print(filtered_strs)

print("2nd example")
for filtered_strs in split_generator('first\nsecond\nthird\n'):
  print(filtered_strs)

print("3rd example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth'):
  print(filtered_strs)

print("4th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\n'):
  print(filtered_strs)

print("5th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\nfifth'):
  print(filtered_strs)

Will generates the output:

1st example
second
2nd example
second
3rd example
second
third
4th example
second
third
5th example
second
third
fourth

Note that the biggest advantage of this approach is that will only create one new line at the time and will take virtually no time to generate the first line of output (rather than wait for all the lines to be found before proceeding further) but, again, that may be useful or not depending on your use case.

Archaeology answered 26/1, 2015 at 9:9 Comment(0)
C
0

for the sake of completeness, a regex solution

def my_func(s):
    import re
    regex=re.compile(r'\w+\n(.+)\n\w+', re.DOTALL)
    return re.fullmatch(regex, s).groups(0)[0]

looks fast enough with @blhsing ATO's page

Capsular answered 22/5 at 10:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.