Read both text and binary data from InputStream [duplicate]

W

2

8

I am trying to read data from a binary stream, portions of which should be parsed as UTF-8.

Using the InputStream directly for the binary data and an InputStreamReader on top of it for the UTF-8 text does not work as the reader will read ahead and mess up the subsequent binary data even if it is told to read a maximum of n characters.

I recognize this question is very similar to Read from InputStream in multiple formats, but the solution proposed there is specific to HTTP streams, which does not help me.

I thought of just reading everything as binary data and converting the relevant pieces to text afterwards. But I only have the length information of the character data in characters, not in bytes. Thus, I need the thing which reads characters from the stream to be aware of the encoding.

Is there a way to tell InputStreamReader not to read ahead further than is needed for reading the given number of characters? Or is there a reader that supports both binary data and text with an encoding and can be switched between these modes on the fly?

Wangle answered 30/6, 2011 at 7:2 Comment(0)

W

2

You need to read the binary portions first. Where you recognise a portion of bytes which need UTF-8 decoding you need to extract those bytes and decode it.

DataInputStream dis = 
// read a binary type.
int num = dis.readInt();
int len = dis.readUnsignedShort();
// read a UTF-8 portion.
byte[] bytes = new byte[len];
dis.readFully(bytes);
String text = new String(bytes, "UTF-8");
// read some binary
double d = dis.readDouble();

Wirework answered 30/6, 2011 at 7:31 Comment(5)

The problem is, with UTF8, the number of bytes can be different from the number of characters. So I would need to find out the number of multi-byte characters in the string, read more bytes and convert again and do this over and over until the numbers match. – Wangle 30/6, 2011 at 8:17

I would say your format isn't very easy to decode and I would fix it if you can. However you can parse the UTF-8 yourself if you know the number of characters. (But sending the actual number bytes would be much simpler) – Wirework 30/6, 2011 at 8:28

Another approach is to read more data than needed. Take the number of characters expected e.g. substring() and convert to UTF-8 to determine the length. Using mark() and reset() and read the length you now know. (This only works if the UTF-8 encoding is exactly the same :| e.g. the nul byte \0 is encoded two different ways. (as can other characters can be) – Wirework 30/6, 2011 at 8:34

A rule of thumb is that if you need to make the encoding or decoding harder, make the encoding harder and the decoding easier. – Wirework 30/6, 2011 at 8:37

Ok, I decided to change the format, as that indeed seemed the easiest way. – Wangle 30/6, 2011 at 8:46

T

2

I think that you just should not use StreamReader. Readers deal with text but you deal with text and binary data together.

There is no way. You have to read binary buffers and interpret your format yourself, i.e. find the position of text extract bytes and transform them to String.

To simplify this task I'd recommend you to create your own class (let's say ProtocolRecord.) It should be Serializable. It will contain all your fields. Now you have 2 options:

(1) simple one - use the java serialization mechanism. In this case you just have to wrap your stream with DataInputStream for reading and DataOutputStream for writing and then read/write your objects. The disadvantage of this approach is that you cannot control your protocol.

(2) implement methods readObject() and writeObject() yourself. Now use DataInputStream and DataOutputStream as explained above. In this case you do have to implement the serialization protocol but at least it is encapsulated into your class.

It think that DataInputStream is what you need.

Townsville answered 30/6, 2011 at 7:18 Comment(0)

W

2