All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?
Asked Answered
T

12

127

I'm creating a simple wordcount program in Java that reads through a directory's text-based files.

However, I keep on getting the error:

java.nio.charset.MalformedInputException: Input length = 1

from this line of code:

BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));

I know I probably get this because I used a Charset that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.

I later learned at the JavaDocs that the Charset is optional and only used for a more efficient reading of the files, so I changed the code to:

BufferedReader reader = Files.newBufferedReader(file);

But some files still throw the MalformedInputException. I don't know why.

I was wondering if there is an all-inclusive Charset that will allow me to read text files with many different types of characters?

Thanks.

Tucket answered 8/10, 2014 at 23:41 Comment(0)
B
103

You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException, try the next encoding.

Boice answered 8/10, 2014 at 23:53 Comment(6)
I tried ISO-8859-1 and it works well. I think it's for European characters, which is fine. I still don't know why UTF-16 doesn't work, though.Tucket
If you have Notepad++, you can try opening text file and it will tell you encoding of file in Menu. You can then adapt code acorrdingly if you always get file from same source.Glomerule
@JonathanLam Well, because if it's encoded with ISO-8859-1, then it's not UTF-16. These encodings are completely different. A file can't be both.Boice
@DawoodsaysreinstateMonica I believe I meant I was surprised UTF-16 didn't work as well as a catch-all for European characters like ISO-8859-1 seems to do. But thanks for the info (even if six years later) :PTucket
Sure. UTF-16 has all the European characters in it. But they're represented differently from ISO-8859-1. In ISO-8859-1, all characters are represented with only 8 bits, so you're limited to 256 possible characters. In UTF-16, most characters are represented with 16 bits, and some characters are represented with 32 bits. So there are a lot more possible characters in UTF-16, but an ISO-8859-1 file will only require half as much space as the same data would use in UTF-16.Boice
I didn't realise that, I was opening BufferReader using US_ASCII but my file was in UTF-8. When I changed it to UTF-8, it works perfectly.Modification
U
57

Creating BufferedReader from Files.newBufferedReader

Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);

when running the application it may throw the following exception:

java.nio.charset.MalformedInputException: Input length = 1

But

new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));

works well.

The different is that, the former uses CharsetDecoder default action.

The default action for malformed-input and unmappable-character errors is to report them.

while the latter uses the REPLACE action.

cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)
Unpractical answered 17/4, 2017 at 7:2 Comment(1)
Following also works in case you wonder reason might be using charset name instead of predefined one from StandardCharsets new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"), StandardCharsets.UTF_8));Frankish
B
39

ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-

req.setCharacterEncoding("ISO-8859-1");

I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.

Babbette answered 29/5, 2017 at 2:40 Comment(0)
L
8

I also encountered this exception with error message,

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)

and found that some strange bug occurs when trying to use

BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));

to write a String "orazg 54" cast from a generic type in a class.

//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");

This String is of length 9 containing chars with the following code points:

111 114 97 122 103 9 53 52 10

However, if the BufferedWriter in the class is replaced with:

FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));

it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.

String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();

Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)

Lakes answered 6/2, 2016 at 5:48 Comment(2)
This is somewhat off-topic, as the OP was talking about reading, rather than writing. I had a similar issue due to BufferedWriter.write(int) - which treats that int as a character and writes it directly to the stream. The workaround is to manually convert it to string and then write.Fowl
This is a sadly under voted answer, Really nice work Tom. I'm wondering if this has been resolved in later versions of Java.Sinter
C
8

try this.. i had the same issue, below implementation worked for me

Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);

then use Reader where ever you want.

foreg:

CsvToBean<anyPojo> csvToBean = null;
    try {
        Reader reader = Files.newBufferedReader(Paths.get(csvFilePath), 
                        StandardCharsets.ISO_8859_1);
        csvToBean = new CsvToBeanBuilder(reader)
                .withType(anyPojo.class)
                .withIgnoreLeadingWhiteSpace(true)
                .withSkipLines(1)
                .build();

    } catch (IOException e) {
        e.printStackTrace();
    }
Commensurable answered 29/5, 2018 at 13:32 Comment(1)
thanks this solved the problem, would have been hard to findBirkle
D
6

ISO_8859_1 Worked for me! I was reading text file with comma separated values

Diandrous answered 26/10, 2018 at 9:50 Comment(0)
D
5

I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.

public static void testCharset(String fileName) {
    SortedMap<String, Charset> charsets = Charset.availableCharsets();
    for (String k : charsets.keySet()) {
        int line = 0;
        boolean success = true;
        try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
            while (b.ready()) {
                b.readLine();
                line++;
            }
        } catch (IOException e) {
            success = false;
            System.out.println(k+" failed on line "+line);
        }
        if (success) 
            System.out.println("*************************  Successs "+k);
    }
}
Doornail answered 25/4, 2017 at 19:55 Comment(0)
K
1

Well, the problem is that Files.newBufferedReader(Path path) is implemented like this :

public static BufferedReader newBufferedReader(Path path) throws IOException {
    return newBufferedReader(path, StandardCharsets.UTF_8);
}

so basically there is no point in specifying UTF-8 unless you want to be descriptive in your code. If you want to try a "broader" charset you could try with StandardCharsets.UTF_16, but you can't be 100% sure to get every possible character anyway.

Kaiak answered 22/2, 2016 at 16:21 Comment(0)
O
0

UTF-8 works for me with Polish characters

Oahu answered 31/7, 2019 at 10:24 Comment(0)
D
0

Adding an additional answer for quarkus mailer and qute templates, as this is always the first result in google no matter what parts of the stack trace I searched for:

If you're using quarkus mailer and a qute template and get this MalformedInputException check if your templates folder contains other files than template files. In my case I had a .png file that I wanted to include in the mail and that was automatically read as template, therefore this encoding issue appeared.

Dominic answered 13/12, 2022 at 14:3 Comment(0)
A
0

I tried with UTF-8, since it's about Vietnamese data, but it was wrong.

SOLUTION: Check the correct encoding of the file I'm reading with NPP, in my case, UTF-16 LE BOM. Check file encoding in NPP

So I need to apply the same encoding in the code

    private static List<String[]> readCsvLinesFromFile(String filePath) {
    List<String[]> lines = new ArrayList<>();
    try (InputStreamReader isr = new InputStreamReader(new FileInputStream(filePath), StandardCharsets.UTF_16LE);
         // do your work
    } catch (IOException e) {
        e.printStackTrace();
    }
    return lines;
}
Anderton answered 18/4 at 3:52 Comment(0)
C
-1

you can try something like this, or just copy and past below piece.

boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.        
int index = 0;

while(exception) {
    try {
        lines = Files.readAllLines(f.toPath(),charset);
          for (String line: lines) {
              line= line.trim();
              if(line.contains(keyword))
                  values.add(line);
              }           
        //No exception, just returns
        exception = false; 
    } catch (IOException e) {
        exception = true;
        //Try the next charset
        if(index<Charset.availableCharsets().values().size())
            charset = (Charset) Charset.availableCharsets().values().toArray()[index];
        index ++;
    }
}
Cuvette answered 17/3, 2017 at 6:40 Comment(1)
The exception handler can potentially make the while(exception) loop forever if it never finds a working charset in the array. The exception handler should rethrow if the end of the array is reached and no working charset is found. Also, as of time of writing this answer had "-2" votes. I have upvoted it to "-1". I think the reason it got negative votes is because there is insufficient explanation. While I understand what the code does, other people may not. So a comment like "you can try something like this" may not be appreciated by some people.Mixup

© 2022 - 2024 — McMap. All rights reserved.