Stream decoding of Base64 data
Asked Answered
C

2

6

I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.

Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?

Am I right if I read the characters till the '\r\n' delimiter and decode it line by line? e.g. :

for (int i = 0; i < byteData.length; i++) {
    if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
       if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
            i += 2;
       else 
            i += 1;

       byteBuffer.put(Base64.decodeBase64(record));

       byteCounter = 0;
       record = new byte[8192];
    } else {
        record[byteCounter++] = byteData[i];
    }
}

Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.

Right now, I'm trying to put in an inputstream and then copy to a gzipout

byteBuffer.get(bufferBytes);

InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);

And it gives me a java.io.IOException: Corrupt GZIP trailer

Composed answered 14/11, 2013 at 14:31 Comment(2)
byteBuffer.put(Base64.decodeBase64(record)); Shouldn't that be byteBuffer.put(Base64.encodeBase64(record));Carp
The 'record' is Base64 encoded. I'm trying to get the decoded data and add it to the ByteBuffer.Composed
P
7

Let's go step by step:

  1. You need a GZIPInputStream to read zipped data (that and not a GZIPOutputStream; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires an InputStream in the constructor.

  2. You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false to decode data. This in turn requires another input stream - the raw, Base64 encoded data.

  3. This stream depends on how you get your data; ideally the data should be available as InputStream - problem solved. If not, you may have to use the ByteArrayInputStream (if binary), StringBufferInputStream (if string) etc.

Roughly this logic is:

InputStream fromHadoop = ...;                                  // 3rd paragraph
Base64InputStream b64is =                                      // 2nd paragraph
    new Base64InputStream(fromHadoop, false, 80, "\n".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is);              // 1st paragraph

Please pay attention to the arguments of Base64InputStream (line length and end-of-line byte array), you may need to tweak them.

Pyxie answered 14/11, 2013 at 15:10 Comment(1)
Thanks a lot, Nikos. The Base64InputStream class helped.Composed
C
0

Thanks to Nikos for pointing me in the right direction. Specifically this is what I did:

private static final byte NEWLINE = (byte) '\n';
private static final byte CARRIAGE_RETURN = (byte) '\r';

byte[] lineSeparators = new byte[] {CARRIAGE_RETURN, NEWLINE};      
Base64InputStream b64is = new Base64InputStream(inputStream, false, 76, lineSeparators);

GZIPInputStream zis = new GZIPInputStream(b64is);

Isn't 76 the length of the Base64 line? I didn't try with 80, though.

Composed answered 15/11, 2013 at 8:41 Comment(1)
If it was fixed at 76 length, then they wouldn't've included the constructor argument. Also think about data URIs where the whole thing is one line.Kalman

© 2022 - 2024 — McMap. All rights reserved.