How to convert a Reader to InputStream and a Writer to OutputStream?
Asked Answered
H

13

101

Is there an easy way to avoid dealing with text encoding problems?

Heavierthanair answered 15/9, 2008 at 11:51 Comment(0)
R
51

You can't really avoid dealing with the text encoding issues, but there are existing solutions in Apache Commons:

You just need to pick the encoding of your choice.

Reisfield answered 15/9, 2008 at 12:1 Comment(3)
FYI: the ReaderInputStream code has a bug in the way it reads bytes (it will not work for all encodings). Proof: illegalargumentexception.blogspot.com/2009/05/… There is an open bug: issues.apache.org/bugzilla/show_bug.cgi?id=40455Battaglia
You can find the classes in Apache's commons-io library: commons.apache.org/proper/commons-ioHelio
@McDowell, the bug you mentioned is in Apache Ant's implementation, not in commons-io's, so it's not relevant to this answer.Oversubtlety
A
96

If you are starting off with a String you can also do the following:

new ByteArrayInputStream(inputString.getBytes("UTF-8"))
Amu answered 13/7, 2010 at 9:35 Comment(2)
Good ReaderInputStream implementation would require less memory -- there should be no need to store all the bytes in an array at once.Antimasque
I like this solution for it works when you need to unit test code that accepts input on (e.g.) standard input.Tahoe
R
51

You can't really avoid dealing with the text encoding issues, but there are existing solutions in Apache Commons:

You just need to pick the encoding of your choice.

Reisfield answered 15/9, 2008 at 12:1 Comment(3)
FYI: the ReaderInputStream code has a bug in the way it reads bytes (it will not work for all encodings). Proof: illegalargumentexception.blogspot.com/2009/05/… There is an open bug: issues.apache.org/bugzilla/show_bug.cgi?id=40455Battaglia
You can find the classes in Apache's commons-io library: commons.apache.org/proper/commons-ioHelio
@McDowell, the bug you mentioned is in Apache Ant's implementation, not in commons-io's, so it's not relevant to this answer.Oversubtlety
O
48

Well, a Reader deals with characters and an InputStream deals with bytes. The encoding specifies how you wish to represent your characters as bytes, so you can't really ignore the issue. As for avoiding problems, my opinion is: pick one charset (e.g. "UTF-8") and stick with it.

Regarding how to actually do it, as has been pointed out, "the obvious names for these classes are ReaderInputStream and WriterOutputStream." Surprisingly, "these are not included in the Java library" even though the 'opposite' classes, InputStreamReader and OutputStreamWriter are included.

So, lots of people have come up with their own implementations, including Apache Commons IO. Depending on licensing issues, you will probably be able to include the commons-io library in your project, or even copy a portion of the source code (which is downloadable here).

As you can see, both classes' documentation states that "all charset encodings supported by the JRE are handled correctly".

N.B. A comment on one of the other answers here mentions this bug. But that affects the Apache Ant ReaderInputStream class (here), not the Apache Commons IO ReaderInputStream class.

Ostraw answered 17/10, 2012 at 18:20 Comment(0)
Q
19

Also note that, if you're starting off with a String, you can skip creating a StringReader and create an InputStream in one step using org.apache.commons.io.IOUtils from Commons IO like so:

InputStream myInputStream = IOUtils.toInputStream(reportContents, "UTF-8");

Of course you still need to think about the text encoding, but at least the conversion is happening in one step.

Quail answered 3/3, 2010 at 18:21 Comment(1)
This method does basically new ByteArrayInputStream(report.toString().getBytes("utf-8")), which involves allocation of two additional copies of the report in memory. If the report is large, it is bad. See my answer.Consultative
C
11

Use:

new CharSequenceInputStream(html, StandardCharsets.UTF_8);

This way does not require an upfront conversion to String and then to byte[], which allocates lot more heap memory, in case the report is large. It converts to bytes on the fly as the stream is read, right from the StringBuffer.

It uses CharSequenceInputStream from Apache Commons IO project.

Consultative answered 19/12, 2014 at 11:56 Comment(0)
T
7

commons-io 2.0 has WriterOutputStream

Tynan answered 24/11, 2010 at 15:41 Comment(0)
D
5

You can't avoid text encoding issues, but Apache commons-io has

Note these are the libraries referred to in Peter's answer of koders.com, just links to the library instead of source code.

Dipsomania answered 15/9, 2008 at 11:52 Comment(0)
B
5

The obvious names for these classes are ReaderInputStream and WriterOutputStream. Unfortunately these are not included in the Java library. However, google is your friend.

I'm not sure that it is going to get around all text encoding problems, which are nightmarish.

There is an RFE, but it's Closed, will not fix.

Byrle answered 15/9, 2008 at 12:0 Comment(1)
bugs.openjdk.java.net/browse/JDK-4103785 contains comment "we have a public API for character-set coding ... no compelling reason to add these classes" -- so how one does this in Java 7, without additional libraries, twelve years down the road?Antimasque
S
4

Are you trying to write the contents of a Reader to an OutputStream? If so, you'll have an easier time wrapping the OutputStream in an OutputStreamWriter and write the chars from the Reader to the Writer, instead of trying to convert the reader to an InputStream:

final Writer writer = new BufferedWriter(new OutputStreamWriter( urlConnection.getOutputStream(), "UTF-8" ) );
int charsRead;
char[] cbuf = new char[1024];
while ((charsRead = data.read(cbuf)) != -1) {
    writer.write(cbuf, 0, charsRead);
}
writer.flush();
// don't forget to close the writer in a finally {} block
Sanction answered 1/9, 2009 at 4:3 Comment(0)
S
2

You can use Cactoos (no static methods, only objects):

You can convert the other way around too:

Sealer answered 6/8, 2017 at 18:7 Comment(0)
W
1

A warning when using WriterOutputStream - it doesn't always handle writing binary data to a file properly/the same as a regular output stream. I had an issue with this that took me awhile to track down.

If you can, I'd recommend using an output stream as your base, and if you need to write strings, use an OUtputStreamWriter wrapper around the stream to do it. It is far more reliable to convert text to bytes than the other way around, which is likely why WriterOutputStream is not a part of the standard Java library

Wellesz answered 5/7, 2013 at 16:14 Comment(0)
S
0

This is the source code for a simple UTF-8 based encoding WriterOutputStream and ReaderInputStream. Tested at the end.

    // https://www.woolha.com/tutorials/deno-utf-8-encoding-decoding-examples
    public class WriterOutputStream extends OutputStream {
        final Writer    writer;

        int             count       = 0;
        int             codepoint   = 0;

        public WriterOutputStream(Writer writer) {
            this.writer = writer;
        }

        @Override
        public void write(int b) throws IOException {
            b &= 0xFF;
            switch (b >> 4) {
            case 0b0000:
            case 0b0001:
            case 0b0010:
            case 0b0011:
            case 0b0100:
            case 0b0101:
            case 0b0110:
            case 0b0111:
                count = 1;
                codepoint = b;
                break;

            case 0b1000:
            case 0b1001:
            case 0b1010:
            case 0b1011:
                codepoint <<= 6;
                codepoint |= b & 0b0011_1111;
                break;

            case 0b1100:
            case 0b1101:
                count = 2;
                codepoint = b & 0b0001_1111;
                break;

            case 0b1110:
                count = 3;
                codepoint = b & 0b0000_1111;
                break;

            case 0b1111:
                count = 4;
                codepoint = b & 0b0000_0111;
                break;
            }
            if (--count == 0) {
                writer.write(codepoint);
            }
        }
    }

    public class ReaderInputStream extends InputStream {
        final Reader    reader;
        int             count   = 0;
        int             codepoint;

        public ReaderInputStream(Reader reader) {
            this.reader = reader;
        }

        @Override
        public int read() throws IOException {
            if (count-- > 0) {
                int r = codepoint >> (count * 6);
                r &= 0b0011_1111;
                r |= 0b1000_0000;
                return r;
            }

            codepoint = reader.read();
            if (codepoint < 0)
                return -1;
            if (codepoint > 0xFFFF)
                return 0;

            if (codepoint < 0x80)
                return codepoint;

            if (codepoint < 0x800) {
                count = 1;
                int v = (codepoint >> 6) | 0b1100_0000;
                return v;
            }
            count = 2;
            int v = (codepoint >> 12) | 0b1110_0000;
            return v;
        }
    }

And the test case that verifies if each of the 65536 characters is properly encoded and decoded, as well as verifying it matches the Java encoding. The surrogates verification (2 character encoding) are ignored since this is handled in Java.

    @Test
    public void testAll() throws IOException {
        for (char i = 0; i < 0xFFFF; i++) {
            CharArrayReader car = new CharArrayReader(new char[] { i });
            ReaderInputStream rtoi = new ReaderInputStream(car);
            byte[] data = IO.read(rtoi);

            CharArrayWriter caw = new CharArrayWriter();
            try (WriterOutputStream wtoo = new WriterOutputStream(caw)) {
                wtoo.write(data);
                char[] translated = caw.toCharArray();
                assertThat(translated.length).isEqualTo(1);
                assertThat((int) translated[0]).isEqualTo(i);

                if (!Character.isSurrogate((char) i)) {
                    try (InputStream stream = new ByteArrayInputStream(data)) {
                        caw = new CharArrayWriter();
                        IO.copy(data, caw);
                        translated = caw.toCharArray();
                        assertThat(translated.length).isEqualTo(1);
                        assertThat((int) translated[0]).isEqualTo(i);
                    }
                }
            }
        }
    }

Selfemployed answered 15/2, 2021 at 11:1 Comment(0)
L
-1

For Reading a string in a stream using just what java supplies.

InputStream s = new BufferedInputStream( new ReaderInputStream( new StringReader("a string")));
Lodestone answered 7/1, 2015 at 14:21 Comment(1)
ReaderInputStream is in Apache Commons IO.Freak

© 2022 - 2024 — McMap. All rights reserved.