InputStreamReader buffering issue
Asked Answered
C

6

11

I am reading data from a file that has, unfortunately, two types of character encoding.

There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.

The header is not fixed length and must be run through a parser to determine its content/length.

The file may also be quite large so I need to avoid bring the entire content into memory.

So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.

Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.

Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.

Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)

Thanks in advance.

EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}
Coarsegrained answered 13/4, 2010 at 16:53 Comment(4)
Maybe I'm wrong, but since the moment I thought that file can have only one encoding type at the same time.Crossindex
@Roman: You can do anything you want with files; they're just sequences of bytes. So you can write out a bunch of bytes that are meant to be interpreted as ASCII, then write out a bunch more bytes meant to be interpreted as UTF-16, and even more bytes meant to be interpreted as UTF-32. I'm not saying it's a good idea, although the OP's use case is certainly reasonable (you have to have some way of indicating what encoding a file uses, after all).Glyceryl
@Mike Q - Good idea the InputStreamReaderUnbuffered. I suggest a separate answer - it deserves the attention :)Microsporangium
Regarding InputStreamReaderUnbuffered solution: If the byte buffer is of size 1, how do you consume 2 bytes that are part of a single character?Microsporangium
S
3

Why don't you use 2 InputStreams? One for reading the header and another for the body.

The second InputStream should skip the header bytes.

Sassaby answered 13/4, 2010 at 17:2 Comment(2)
Thanks I think I'll have to do this.Coarsegrained
How do you know what to skip? You need to read the header in order to know where it ends. Once you start reading the header with an InputStreaReader, it can chew on bytes from the body.Microsporangium
C
3

Here is the pseudo code.

  1. Use InputStream, but do not wrap a Reader around it.
  2. Read bytes containing header and store them into ByteArrayOutputStream.
  3. Create ByteArrayInputStream from ByteArrayOutputStream and decode header, this time wrap ByteArrayInputStream into Reader with ASCII charset.
  4. Compute the length of non-ascii input, and read that number of bytes into another ByteArrayOutputStream.
  5. Create another ByteArrayInputStream from the second ByteArrayOutputStream and wrap it with Reader with charset from the header.
Contraption answered 13/4, 2010 at 17:6 Comment(1)
Thanks for your suggestion. Unfortunately the header is not fixed length, either in binary or character terms, so I do need to parse it through a Charset decoder to figure out its structure and therefore its length. I also need to avoid reading the entire content into an internal buffer.Coarsegrained
G
1

My first thought is to close the stream and reopen it, using InputStream#skip to skip past the header before giving the stream to the new InputStreamReader.

If you really, really don't want to reopen the file, you could use file descriptors to get more than one stream to the file, although you may have to use channels to have multiple positions within the file (since you can't assume you can reset the position with reset, it may not be supported).

Glyceryl answered 13/4, 2010 at 17:3 Comment(2)
If you create multiple FileInputStreams with the same FileDescriptor, then they will behave as if they are the same stream.Jacquelyn
@Tom: Yeah, I was assuming he would use them in series, not in parallel, and that he would reset the position between using one and using the other. But you can't assume you can reset the position... (I don't think they'll behave like the same stream, I think it would be worse than that; they'd just share actual file position. Data caching within the individual instances could in theory make that really, really messy if you tried to use them in parallel.)Glyceryl
J
1

I suggest rereading the stream from the start with a new InputStreamReader. Perhaps assume that InputStream.mark is supported.

Jacquelyn answered 13/4, 2010 at 17:6 Comment(0)
C
1

It's even easier:

As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}
Cl answered 29/6, 2010 at 8:43 Comment(1)
Thanks. Eventually I went with another solution which was to write an InputStreamReaderUnbuffered which does exactly the same as InputStreamReader but has no internal buffer so you never read too much. See my edit.Coarsegrained
K
1

If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.

This way we don't have to rewrite the InputStreamReader logic.

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

To construct:

new InputStreamReader(new OneByteReadInputStream(inputStream));
Kailakaile answered 25/2, 2015 at 18:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.