Different results reading file with Files.newBufferedReader() and constructing readers directly
Asked Answered
S

2

11

It seems that Files.newBufferedReader() is more strict about UTF-8 than the naive alternative.

If I create a file with a single byte 128---so, not a valid UTF-8 character---it will happily be read if I construct an BufferedReader on an InputStreamReader on the result of Files.newInputStream(), but with Files.newBufferedReader() an exception is thrown.

This code

try (
    InputStream in = Files.newInputStream(path);
    Reader isReader = new InputStreamReader(in, "UTF-8");
    Reader reader = new BufferedReader(isReader);
) {
    System.out.println((char) reader.read());
}

try (
    Reader reader = Files.newBufferedReader(path);
) {
    System.out.println((char) reader.read());
}

has this result:

�
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.read(BufferedReader.java:182)
    at TestUtf8.main(TestUtf8.java:28)

Is this documented? And is it possible to get the lenient behavior with Files.newBufferedReader()?

Spiritual answered 19/1, 2016 at 20:25 Comment(2)
Wild stab in the dark, but have you tried specifying charset in the newBufferedReader call?Grice
@Grice He shouldn't have to. That method is documented as using UTF-8.Bullpup
A
13

The difference is in how the CharsetDecoder used to decode the UTF-8 is constructed in the two cases.

For new InputStreamReader(in, "UTF-8") the decoder is constructed using:

Charset cs = Charset.forName("UTF-8");

CharsetDecoder decoder = cs.newDecoder()
          .onMalformedInput(CodingErrorAction.REPLACE)
          .onUnmappableCharacter(CodingErrorAction.REPLACE);

This is explicitly specifying that invalid sequences are just replaced with the standard replacement character.

Files.newBufferedReader(path) uses:

Charset cs = StandardCharsets.UTF_8;

CharsetDecoder decoder = cs.newDecoder();

In this case onMalformedInput and onUnmappableCharacter are not being called so you get the default action which is to throw the exception you are seeing.

There does not seem to be a way to change what Files.newBufferedReader does. I didn't see anything documenting this while looking through the code.

Asset answered 19/1, 2016 at 21:0 Comment(0)
B
7

From what I can tell, it is not documented anywhere, and it is not possible to get newBufferedReader to behave leniently.

It should be documented, though. In fact, the lack of documentation on it is a valid Java bug, in my opinion, even if the amended documentation ends up saying "invalid charset sequences result in undefined behavior."

Moreover, since there is no documentation on the subject, I don't think you can safely rely on the behavior you're observing. It's entirely possible that a future version of InputStreamReader will default to using an internal CharsetDecoder that is strict.

So, to guarantee lenient behavior, I would take your code a step farther:

try (
    InputStream in = Files.newInputStream(path);
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
        .onMalformedInput(CodingErrorAction.REPLACE);
    Reader isReader = new InputStreamReader(in, decoder);
    Reader reader = new BufferedReader(isReader);
) {
    System.out.println((char) reader.read());
}
Bullpup answered 19/1, 2016 at 21:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.