String Index Error (Julia)
Asked Answered
C

2

0

I'm a Julia newbie. When I was testing out the language, I got this error.

First of all, I'm defining String b to "he§y".

Julia seems behaving strangely when I have "special" characters in a String...

When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK

However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.

Costplus answered 23/8, 2018 at 2:4 Comment(6)
next time, please post a MCVE to demonstrate your problem. It makes everyone's life easier.Virago
I have put every information that's necessary for you to diagnose my problem...Costplus
Simply several lines of code that reproduce your problem is way clearer than your way to "provide every information". We couldn't even see how you iterate the string. And, as in your other comment, you said my answer is unrelated: from what I see from your description, it is precisely the problem: For string "he§y", s[1] will give you h, s[3] should give you §, while s[4] will give you String Index Error. Without seeing your code, we couldn't tell more. It could be you simply didn't declare the string right. You should use \u for non-ASCII character in string literalVirago
Seems you are not familiar on how StackOverflow works: You can edit your question (by clicking on "edit" at the end of the question) to provide extra information (preferably concise code snippet)Virago
Have a look at this answer to indexing a UTF8 string.Lach
Why don't you post the exact code to reproduce the error instead of WASTING SO MUCH TIME yakking about it???Rybinsk
V
6

I don't believe the compiler could throw you the error. Do you mean a runtime error?

I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.

The document from Julia lang seems supporting my hypothesis:

https://docs.julialang.org/en/stable/manual/strings/

The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)

...

Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.


Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.

julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"

Whether these Unicode characters are displayed as escapes or shown as special characters depends on your terminal's locale settings and its support for Unicode. String literals are encoded using the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded in the same number of bytes. In UTF-8, ASCII characters – i.e. those with code points less than 0x80 (128) – are encoded as they are in ASCII, using a single byte, while code points 0x80 and above are encoded using multiple bytes – up to four per character. This means that not every byte index into a UTF-8 string is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:

julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)

julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]

julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]

julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Virago answered 23/8, 2018 at 2:12 Comment(2)
Additionally in Julia 1.0 you have nextind and prevind functions that allow you to get an index that is a start of a character. There are several options there so it is best to refer to the Julia manual.Limewater
@BogumiłKamiński exactly. And that's actually mentioned in the quoted documentation too.Virago
C
0

'§' character takes up 2 byte. So the index s[4] is skipped and next index is s[5]

the characters of "he§y" are arranged in memory as s[1]: h s[2]: e s[3]s[4]: § (indices 3 and 4 are clubbed together like a super 3 so there is no 4th index. s[5]: y

Cullan answered 27/1, 2024 at 17:1 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.