Why is string creation so slow in Julia?

L

5

16

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.

This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?

This is some reproducible code:

function chop(s; nc=80)
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

s = "A"^500000

chop(s)

It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]

Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.

I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl

Can someone help me understand why this code above is so slow nonetheless?

Weirdly, the below is much faster, just by adding s = collect(s):

function chop(s; nc=80)
    s = collect(s) #this line is new
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

Leningrad answered 18/5, 2022 at 15:1 Comment(2)

"maybe quadratic.... It seems that this row is where most of the time is spent:" Well, how do you expect nr to grow in terms of the length of the string? How long do you expect s[l(i):r(i)] to take - what's the time complexity of the functions, and how long do you expect the slice to be in big-O terms? Can you see how to test those assumptions? Profiling is how you find the bottleneck, but more analysis is required to understand it. – Eleaseeleatic 19/5, 2022 at 0:37

"Time seems to increase more than linearly, maybe quadratic." One good way to get a sense of the actual time complexity is to try the function at varying problem sizes (you will likely want a geometric sequence) and graph the timing results (log-linear and log-log plots are sometimes helpful). Unfortunately I don't know Julia so I can't really tell you any more. I assume l(i) = 1+(nc*(i-1)) is some kind of lambda-like syntax - rather elegant, if it works like I imagine it does. – Eleaseeleatic 19/5, 2022 at 0:39

H

16

My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):

chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
    join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')

The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.

How the solution works

How does the given solution work:

findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
finally all is joined with \n as separator.

What is wrong in the OP solutions

First attempt:

the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)

Second attempt:

doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String

What would be a fast solution

If you want something faster than regex and correct you can use this code:

function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
    @assert nc > 0
    isempty(s) && return s
    sz = sizeof(s)
    cu = codeunits(s)
    buf_sz = sz + div(sz, nc)
    buf = Vector{UInt8}(undef, buf_sz)
    start = 1
    buf_loc = 1
    while true
        stop = min(nextind(s, start, nc), sz + 1)
        copyto!(buf, buf_loc, cu, start, stop - start)
        buf_loc += stop - start
        if stop == sz + 1
            resize!(buf, buf_loc - 1)
            break
        else
            start = stop
            buf[buf_loc] = UInt8('\n')
            buf_loc += 1
        end
    end
    return String(buf)
end

Harman answered 18/5, 2022 at 18:13 Comment(8)

Can you maybe also explain the titular question "Why is string creation so slow in Julia?" and explain why OPs approach is so slow and why yours is faster? – Andres 19/5, 2022 at 9:51

It was already commented above. The crucial problem is String(s[l(i):r(i)]) in OP code. It causes two allocations per loop iteration. One allocation is s[l(i):r(i)] which creates a new Vector{Char} and the second is a call to String which allocates a new string. In my approach findall creates ranges (which do not allocate) and then SubString is a view of the original string. – Duo 19/5, 2022 at 10:3

Comments are ephemeral, they should not contain important information, that information should be part of an actual answer. Your explanation of why your code works should also be made part of the answer. – Andres 19/5, 2022 at 13:31

I considered my answer as a side information to already given longer answers that is too long for a comment. I will expand the answer to cover all the aspects involved given you would find it useful. – Duo 19/5, 2022 at 14:2

@Polygnome: I think they're referring to jling's answer which explains the copying. So not ephemeral, but yes, sort order can change over time with voting. An answer that just wants to assume readers have already seen other answers should explicitly say so, like "As [@jling explained](link), Julia strings are immutable so they alloc & copy." Credit to the other user for answering that part of the question, and link their answer for more detail if you don't want to at least briefly explain it in your own words. – Richmal 19/5, 2022 at 14:28

TL:DR: I agree with @Polygnome: SO answers shouldn't totally ignore parts of the question without at least saying "see other answers for the xyz part". – Richmal 19/5, 2022 at 14:29

In the first sentence of my original answer I gave the reference to the answer I was adding information to; with the benefit of hindsight I agree that indeed I could have added a link as you suggest and use a better wording for this. Now I have expanded my answer per your suggestions. – Duo 19/5, 2022 at 14:40

Your answer is spot on regarding what's wrong with the proposed solutions in the question. length of string is o(n) because of unicode, and surprisingly, the result is not memoized. – Leningrad 19/5, 2022 at 22:32

D

16

String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.

Dibromide answered 18/5, 2022 at 15:8 Comment(0)

H

16