Julia: How to read in and output characters with diacritics?

Asked 31/8, 2022 at 9:0 Answered 31/8, 2022 at 13:2

Processing ASCII characters beyond the range 1-127 can easily crash Julia.

mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
    print(i,":::")
    print(Int(mystring[i]),"::" )
    println(  mystring[i]       )
end

gives me

1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
8:::ERROR: LoadError: StringIndexError("A-Za-zÀ-ÿŽž", 8)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:220
 [3] getindex(::String, ::Int64) at .\strings\string.jl:213
 [4] top-level scope at R:\_LV\STZ\Web_admin\Languages\Action\Returning\chars.jl:5
 [5] include(::String) at .\client.jl:457
 [6] top-level scope at REPL[18]:1

It crashes after outputting the first character outside the normal range, rather than during that output, which is mentioned in the answer to String Index Error (Julia)
If declaring the values in Julia one should declare them as Unicode, but I have these characters in my input.
The manual says that Julia looks at the locale, but is there an "everywhere" locale?

Is there some way to handle input and output of these characters in Julia?

I am working on Windows10, but I can switch to Linux if that works better for this.

Refugiorefulgence answered 31/8, 2022 at 9:0 Comment(4)

The answers given below are great. This issue is indeed confusing, and I feel the choice made for handling string indexing is not the best. At the time, it might have been hard to make any other choice (due to young language), but changing this in, say, Julia 2.x might be a good big change. – Limbourg 31/8, 2022 at 9:35

I belive string indexing was very thoroughly considered, and that it is not generally seen as a mistake in hindsight. What kind of changes do you want to see? Going to 'naive' indexing seems highly unlikely. – Buie 31/8, 2022 at 9:51

@Buie This isn't a debate for this comment thread. There are pros and cons to choices made about strings. The decisions made were correct, I agree. There is a place to refresh these choices and perhaps make another String type which gives a better interface and protects from some thorny issues. The 'unexpected' exception raised in this question, is one of these issues. Some devs wouldn't mind being oblivious to these (including me most of time). – Limbourg 31/8, 2022 at 10:41

So the drop-in answer is that instead of the expected mystring[i] one can use mystring[nextind(mystring, 0, i)]. If that could be aliased by something like mystring[[i]] with a suitable warning about performance, then it would be easier for some of us to understand, although I don't know whether that fits with Julia philosophy. – Refugiorefulgence 31/8, 2022 at 13:11

Use eachindex to get a list of valid indices in your string:

julia> mystring = "A-Za-zÀ-ÿŽž"
"A-Za-zÀ-ÿŽž"

julia> for i in eachindex(mystring)
           print(i, ":::")
           print(Int(mystring[i]), "::")
           println(mystring[i])
       end
1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
9:::45::-
10:::255::ÿ
12:::381::Ž
14:::382::ž

Your issue is related to the fact that Julia uses byte-indexing of strings, as is explained in the Julia Manual.

For example character À takes two bytes, therefore, since its location is 7 the next index is 9 not 8.

In UTF-8 encoding which is used by default by Julia only ASCII characters take one byte, all other characters take 2, 3 or 4 bytes, see https://en.wikipedia.org/wiki/UTF-8#Encoding.

For example for À you get two bytes:

julia> codeunits("À")
2-element Base.CodeUnits{UInt8, String}:
 0xc3
 0x80

I have also written a post at https://bkamins.github.io/julialang/2020/08/13/strings.html that tries to explain how byte-indexing vs character-indexing works in Julia.

If you have additional questions please comment.

Billfold answered 31/8, 2022 at 9:7 Comment(0)

String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that are used to encode arbitrary characters (code points). This means that not every index into a String is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown.

You can use enumerate to get the value and the number of iteration.

mystring = "A-Za-zÀ-ÿŽž"

for (i, x) in enumerate(mystring)
    print(i,":::")
    print(Int(x),"::")
    println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#8:::45::-
#9:::255::ÿ
#10:::381::Ž
#11:::382::ž

In case you need the value and index of the string in bytes you can use pairs.

for (i, x) in pairs(mystring)
    print(i,":::")
    print(Int(x),"::")
    println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#9:::45::-
#10:::255::ÿ
#12:::381::Ž
#14:::382::ž

Donatist answered 31/8, 2022 at 9:22 Comment(2)

What is the difference between the two? Is the second one just a shorthand for the first, or is their more going on? – Refugiorefulgence 31/8, 2022 at 12:58

The first one gives you the number of iteration the second one the index of the string in bytes. I have edited the answer to point on the difference. – Donatist 31/8, 2022 at 17:5

In preparation for de-minimising my MCVE for what I want to do, which involves advancing the string position not just in a for-all loop, I used the information in the post written by Bogumił Kamiński, to come up with this:

mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
    print(i,":::")
    mychar = mystring[nextind(mystring, 0, i)]
    print(Int(mychar), "::")
    println(  mychar )
end

Refugiorefulgence answered 31/8, 2022 at 13:2 Comment(0)

Recommended topics

Hot tags