Why is string creation so slow in Julia?
Asked Answered
L

5

16

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.

This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?

This is some reproducible code:

function chop(s; nc=80)
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

s = "A"^500000

chop(s)

It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]

Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.

I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl

Can someone help me understand why this code above is so slow nonetheless?

Weirdly, the below is much faster, just by adding s = collect(s):

function chop(s; nc=80)
    s = collect(s) #this line is new
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end
Leningrad answered 18/5, 2022 at 15:1 Comment(2)
"maybe quadratic.... It seems that this row is where most of the time is spent:" Well, how do you expect nr to grow in terms of the length of the string? How long do you expect s[l(i):r(i)] to take - what's the time complexity of the functions, and how long do you expect the slice to be in big-O terms? Can you see how to test those assumptions? Profiling is how you find the bottleneck, but more analysis is required to understand it.Eleaseeleatic
"Time seems to increase more than linearly, maybe quadratic." One good way to get a sense of the actual time complexity is to try the function at varying problem sizes (you will likely want a geometric sequence) and graph the timing results (log-linear and log-log plots are sometimes helpful). Unfortunately I don't know Julia so I can't really tell you any more. I assume l(i) = 1+(nc*(i-1)) is some kind of lambda-like syntax - rather elegant, if it works like I imagine it does.Eleaseeleatic
H
16

My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):

chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
    join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')

The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.

How the solution works

How does the given solution work:

  • findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
  • next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
  • finally all is joined with \n as separator.

What is wrong in the OP solutions

First attempt:

  • the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
  • length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
  • in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
  • String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)

Second attempt:

  • doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
  • doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String

What would be a fast solution

If you want something faster than regex and correct you can use this code:

function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
    @assert nc > 0
    isempty(s) && return s
    sz = sizeof(s)
    cu = codeunits(s)
    buf_sz = sz + div(sz, nc)
    buf = Vector{UInt8}(undef, buf_sz)
    start = 1
    buf_loc = 1
    while true
        stop = min(nextind(s, start, nc), sz + 1)
        copyto!(buf, buf_loc, cu, start, stop - start)
        buf_loc += stop - start
        if stop == sz + 1
            resize!(buf, buf_loc - 1)
            break
        else
            start = stop
            buf[buf_loc] = UInt8('\n')
            buf_loc += 1
        end
    end
    return String(buf)
end
Harman answered 18/5, 2022 at 18:13 Comment(8)
Can you maybe also explain the titular question "Why is string creation so slow in Julia?" and explain why OPs approach is so slow and why yours is faster?Andres
It was already commented above. The crucial problem is String(s[l(i):r(i)]) in OP code. It causes two allocations per loop iteration. One allocation is s[l(i):r(i)] which creates a new Vector{Char} and the second is a call to String which allocates a new string. In my approach findall creates ranges (which do not allocate) and then SubString is a view of the original string.Duo
Comments are ephemeral, they should not contain important information, that information should be part of an actual answer. Your explanation of why your code works should also be made part of the answer.Andres
I considered my answer as a side information to already given longer answers that is too long for a comment. I will expand the answer to cover all the aspects involved given you would find it useful.Duo
@Polygnome: I think they're referring to jling's answer which explains the copying. So not ephemeral, but yes, sort order can change over time with voting. An answer that just wants to assume readers have already seen other answers should explicitly say so, like "As [@jling explained](link), Julia strings are immutable so they alloc & copy." Credit to the other user for answering that part of the question, and link their answer for more detail if you don't want to at least briefly explain it in your own words.Richmal
TL:DR: I agree with @Polygnome: SO answers shouldn't totally ignore parts of the question without at least saying "see other answers for the xyz part".Richmal
In the first sentence of my original answer I gave the reference to the answer I was adding information to; with the benefit of hindsight I agree that indeed I could have added a link as you suggest and use a better wording for this. Now I have expanded my answer per your suggestions.Duo
Your answer is spot on regarding what's wrong with the proposed solutions in the question. length of string is o(n) because of unicode, and surprisingly, the result is not memoized.Leningrad
D
16

String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.

Dibromide answered 18/5, 2022 at 15:8 Comment(0)
H
16

My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):

chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
    join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')

The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.

How the solution works

How does the given solution work:

  • findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
  • next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
  • finally all is joined with \n as separator.

What is wrong in the OP solutions

First attempt:

  • the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
  • length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
  • in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
  • String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)

Second attempt:

  • doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
  • doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String

What would be a fast solution

If you want something faster than regex and correct you can use this code:

function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
    @assert nc > 0
    isempty(s) && return s
    sz = sizeof(s)
    cu = codeunits(s)
    buf_sz = sz + div(sz, nc)
    buf = Vector{UInt8}(undef, buf_sz)
    start = 1
    buf_loc = 1
    while true
        stop = min(nextind(s, start, nc), sz + 1)
        copyto!(buf, buf_loc, cu, start, stop - start)
        buf_loc += stop - start
        if stop == sz + 1
            resize!(buf, buf_loc - 1)
            break
        else
            start = stop
            buf[buf_loc] = UInt8('\n')
            buf_loc += 1
        end
    end
    return String(buf)
end
Harman answered 18/5, 2022 at 18:13 Comment(8)
Can you maybe also explain the titular question "Why is string creation so slow in Julia?" and explain why OPs approach is so slow and why yours is faster?Andres
It was already commented above. The crucial problem is String(s[l(i):r(i)]) in OP code. It causes two allocations per loop iteration. One allocation is s[l(i):r(i)] which creates a new Vector{Char} and the second is a call to String which allocates a new string. In my approach findall creates ranges (which do not allocate) and then SubString is a view of the original string.Duo
Comments are ephemeral, they should not contain important information, that information should be part of an actual answer. Your explanation of why your code works should also be made part of the answer.Andres
I considered my answer as a side information to already given longer answers that is too long for a comment. I will expand the answer to cover all the aspects involved given you would find it useful.Duo
@Polygnome: I think they're referring to jling's answer which explains the copying. So not ephemeral, but yes, sort order can change over time with voting. An answer that just wants to assume readers have already seen other answers should explicitly say so, like "As [@jling explained](link), Julia strings are immutable so they alloc & copy." Credit to the other user for answering that part of the question, and link their answer for more detail if you don't want to at least briefly explain it in your own words.Richmal
TL:DR: I agree with @Polygnome: SO answers shouldn't totally ignore parts of the question without at least saying "see other answers for the xyz part".Richmal
In the first sentence of my original answer I gave the reference to the answer I was adding information to; with the benefit of hindsight I agree that indeed I could have added a link as you suggest and use a better wording for this. Now I have expanded my answer per your suggestions.Duo
Your answer is spot on regarding what's wrong with the proposed solutions in the question. length of string is o(n) because of unicode, and surprisingly, the result is not memoized.Leningrad
J
9

You could operate on bytes

function chop2(s; nc=80)
    b = transcode(UInt8, s)
    nr   = ceil(Int64, length(b)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(b))
    dat = UInt8[]
    for i in 1:nr
        append!(dat, @view(b[l(i):r(i)]))
        i < nr && push!(dat, UInt8('\n'))
    end
    String(dat)
end

and the benchmarks (around 5000x faster):

 @btime chop($s);
  1.531 s (6267 allocations: 1.28 MiB)

julia> @btime chop2($s);
  334.100 μs (13 allocations: 1.57 MiB)

Notes:

  • this code could be still made slightly faster by pre-allocating dat but I tried to bi similar to the original.
  • when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle
Jadwiga answered 18/5, 2022 at 16:14 Comment(1)
Perhaps use cld(length(b), nc) instead of doing float division with ceil(Int64, length(b)/nc).Riccio
L
5

With the help of a colleage we figured out the main reason that makes the provided implementation so slow.

It turns out length(::String) has time complexity O(n) in Julia, and the results are not cached, so the longer the string, the more calls to length which itself takes longer the longer the input. See this Reddit post for a good discussion of the phenomenon:

Collecting the string into a vector resolves the bottleneck, because length of a vector is O(1) instead of O(n).

This is of course by no means the best way to solve the general problem, but it's a one line change that speeds up the code as provided.

Leningrad answered 19/5, 2022 at 10:1 Comment(1)
Use of length in your original code is in general incorrect (it is correct only for ASCII string). Therefore in my answer I was referring to your corrected code (using s = collect(s)). If you knew your string is ASCII only (which can be checked by isascii function) then use sizeof instead of length which will be much faster in your original code. See bkamins.github.io/julialang/2020/08/13/strings.html for a discussion of string indexing and complexity of various functions operating on strings.Duo
M
4

This has similar performance to the version by @PrzemyslawSzufel, but is much simpler.

function chop3(s; nc=80)
    L = length(s)
    join((@view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end

I didn't choose firstindex(s), lastindex(s) as strings may not have arbitrary indices, but it makes no difference anyway.

@btime chop3(s) setup=(s=randstring(10^6))  # 1.625 ms (18 allocations: 1.13 MiB)
@btime chop2(s) setup=(s=randstring(10^6))  # 1.599 ms (14 allocations: 3.19 MiB)

Update: Based on suggestions by @BogumiłKamiński, working with ASCII strings, this version with sizeof is even 60% faster.

function chop3(s; nc=80)
    L = sizeof(s)
    join((@view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end
Milne answered 19/5, 2022 at 0:28 Comment(4)
As a side note - this function is only correct for ASCII strings - just as Przemysław noted for his function.Duo
Yes, same as the OP code. I believe your solution strikes a balance between generality and speed. But it requires some Regex knowledge.Milne
No, OP code (corrected) is correct as it uses s = collect(s) which ensures Char are iterated.Duo
Agree, I was referring to the first code.Milne

© 2022 - 2024 — McMap. All rights reserved.