MalformedInputException with Files.readAllLines()
Asked Answered
Y

2

9

I was iterating over some files, 5328 to be precise. These files are average XML files with 60-200 lines max. They are first filtered through a simple method isXmlSourceFile that parse the path.

    Files.walk(Paths.get("/home/me/development/projects/myproject"), FileVisitOption.FOLLOW_LINKS)
            .filter(V3TestsGenerator::isXmlTestSourceFile)
            .filter(V3TestsGenerator::fileContainsXmlTag)

The big question is for the second filter, especially the method fileContainsXmlTag. For each file I wanted to detect if a pattern was contained at least once among the lines of it:

private static boolean fileContainsXmlTag(Path path) {
    try {
        return Files.readAllLines(path).stream().anyMatch(line -> PATTERN.matcher(line).find());
    } catch (IOException e) {
        e.printStackTrace();
    }
    return false;
}

For some files I get then this exception

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at java.nio.file.Files.readAllLines(Files.java:3205)
at java.nio.file.Files.readAllLines(Files.java:3242)

But when I use FileUtiles.readLines() instead of Files.readAllLines everything is getting well.

It's a curiosity question so if someone as a clue of what's going on, it's with pleasure.

Thanks

Yeomanry answered 8/8, 2016 at 12:8 Comment(0)
L
16

The method Files.readAllLines() assumes that the file you are reading is encoded in UTF-8.

If you get this exception, then the file you are reading is most likely encoded using a different character encoding than UTF-8.

Find out what character encoding is used, and use the other readAllLines method, that allows you to specify the character encoding.

For example, if the files are encoded in ISO-8859-1:

return Files.readAllLines(path, StandardCharsets.ISO_8859_1).stream()... // etc.

The method FileUtiles.readLines() (where does that come from?) probably assumes something else (it probably assumes the files are in the default character encoding of your system, which is something else than UTF-8).

Lactoflavin answered 8/8, 2016 at 12:18 Comment(4)
I have the same problem - I believe this is related to CR LF. FileUtils.readLines(new File(filename), StandardCharsets.UTF_8) worked fine for me where Files.readAllLines(Paths.get(filename), StandardCharsets.UTF_8) did not.Roundhead
@Lactoflavin Thanks a lot for the answer!Yeomanry
It's odd, I got this message with a file that was UTF-8 encoded according to Notepad++. So I don't agree with this answer. I suspect @Roundhead may have the answer. I am probably reading a text file created on a Unix platform on a Windows machine and getting this issue. In my case though I have this syntax: Files.readAllLines(Paths.get(fileName));Groce
ISO-8859-1 did fix it for me though. Seems to neutralise that issue.Groce
C
2

FileUtils.readLines from Apache Commons IO uses java.io.InputStreamReader.

InputStreamReader uses a CharsetDecoder to read bytes, but it changes onMalformedInput and onUnmappableCharacter from the default REPORT (meaning throw an exception) to REPLACE (meaning insert some replacement character).

Files.readAllLines however leaves the default behavior of the CharsetDecoder and therefore throws an exception if a malformed or unmappable byte sequence is read. There is a hint to this hidden in the documentation of readAllLines:

 * @throws  IOException
 *          if an I/O error occurs reading from the file or a malformed or
 *          unmappable byte sequence is read

Files.readAllLines is part of "NIO", Java's new approach at IO. NIO quite consistently throws an exception on invalid input, while the "old" classes replace invalid input. However you can change NIO's behavior by using the methods that accept a CharsetDecoder and pass a CharsetDecoder that you have configured according to your needs.

Canica answered 4/12, 2023 at 10:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.