Possible problems with String reversing using charAt method
Asked Answered
R

8

15

I saw a comment here that all solutions with charAt are wrong. I could not exactly understand and find something about charAt on internet. As I look the source code it just returns an element from the char array. So my question is that if there any problem or issue about using charAt?

Comment is like that

Strictly speaking, all the solutions based on charAt are wrong, as charAt doesn't give you the "character at", but the "code unit at", and there are code units that are not characters and characters that need multiple code units.

Repine answered 14/3, 2016 at 13:25 Comment(2)
It is referring to the fact that some symbols can't be represented by a single char - only ones with a codepoint less than 65536. Characters like Emoji are outside that range.Fictile
Relevant: #14151030Thaddeus
O
14

Different characters are encoded with a different numbers of bytes (using UTF-16 scheme). For example, the "A" character is represented as follows:

01000001

So far so good.

But if you have a character like 𝔴, you'll have a problem. Its UTF-16 representation (BE) is:

11011000 00110101 11011101 00110100

And then charAt can indeed return the second code unit for that character.

See the JDK 7 implementation of String#charAt:

public char charAt(int index) {
    if ((index < 0) || (index >= count)) {
        throw new StringIndexOutOfBoundsException(index);
    }
    return value[index + offset];
}
Oleson answered 14/3, 2016 at 13:36 Comment(0)
C
11

In Java, String is essentially an array of char. Likewise, a char is a UCS-2 (UTF-16) code point.

There are two problems with this:

  1. Not all characters can be expressed with a single code point in UTF-16.
  2. Unicode supports combining characters.

Reordering characters that are part of either of these situations will result in a String that is incorrect.

StringBuilder's reverse takes the first situation into account, but I'm not aware of anything that takes the second into account.

Cyler answered 14/3, 2016 at 13:35 Comment(0)
J
6

What is said above is true, some code units require two characters to be represented. As Java uses 16 bit characters, it is encountered infrequently; but strictly speaking, any code that uses charAt(...) without considering if the accessed char is part of a two char code unit is exposing itself to character processing issues.

To test if you are working with a two char code unit, you should check to see if the initial value from the charAt(...) is in the range 0xD800 to 0xDFFF; as that range indicates the start of a two char code unit.

Julio answered 14/3, 2016 at 13:33 Comment(0)
W
6

As other answers point out, some characters can take multiple code units, and you will get invalid characters if you try to interpret either of these code units by itself, or in combination with other code units.

One other thing to keep in mind is that having a 2-code-unit character in your string will shift all the subsequent indices by one, so e.g. the tenth character will be charAt(10) instead of charAt(9) - so even if you're not hit by encoding issues with the character itself, you could find yourself extracting the wrong character by index later in the string.

Wanitawanneeickel answered 14/3, 2016 at 13:48 Comment(0)
D
5

Strictly speaking, yes there is a problem, as is outlined in the reason you highlighted. The problem is that some characters can need more than 1 char to represent. So by using, String.charAt, when you reverse the string, you'll have a new semi-random character because of the switch in order of the two chars that make up that character.

But again, this is strictly speaking

Decrescent answered 14/3, 2016 at 13:31 Comment(3)
Can you add an example?..I am hearing this for the first time.Please.Assemble
@MathewsMathai any character in the planes en.wikipedia.org/wiki/Plane_(Unicode) i.e. anything > 65535 or 0xFFFFAllocate
Is there a better method for referring to a character at a particular index then?Assemble
S
5

There are numerous common fatally-broken assumptions about text, especially if you leave the niche of "just one western country", which you do when using unicode.
Just to start some relevant points specifically when dealing with UTF-16:

  • A codepoint might be multiple codeunits.
  • A character might be multiple codepoints.
  • A codepoint might be multiple characters.

Of additional relevance when reversing text are LTR and RTL overrides, which need special handling.

I suggest you read the accepted answer to Why does modern Perl avoid UTF-8 by default?, specifically the section assume brokenness, that part is programming-language-agnostic.

Sifuentes answered 14/3, 2016 at 13:54 Comment(0)
F
3

The String.charAt method is safe (for some definition of "safe"), but it can be used unsafely, if your string contains characters outside the Basic Multilingual Plane, which has codepoints in the range 0 to 65535.

You can implement string reversal using String.charAt - AbstractStringBuilder uses the char[] directly, but this is logically the same as using String.charAt(). It basically implements two passes:

  • The first reverses the characters, but also checks for any surrogate pairs
  • The second re-reverses the surrogate pairs.
Fictile answered 14/3, 2016 at 13:39 Comment(0)
N
2

The simplest example to your question is the case of UTF-8 characters like ñ..

charAt() will easily return the ASCII characters as ASCII characters occupy 1 byte. On the other hand UTF-8 / UTF-16 characters can occupy multiple bytes and therefore you may get an unexpected output.

Many languages have alphabets /symbols in UTF-8 format, so let's say if your application is giving some locale specific information you might be using utf-8 chars and charAt() will fail in that case..

Nataline answered 14/3, 2016 at 13:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.