Most efficient way (time and space wise) to send binary data in response
Asked Answered
S

1

6

My setup is a Flask-based server. A bird-view of the project would be: the Flask-based server fetches binary data from AWS S3 based on some algorithmic calculations (like figuring out the filenames to fetch from S3), and serves the data to an HTML+JavaScript client.

At first, I thought a JSON object to be the best response type. I created a JSON response with following (possibly syntactically incorrect) format:

{
  'payload': [
    {
      'symbol': 'sym',
      'exchange': 'exch',
      'headerfile': {
        'name': '#name',
        'content': '#binarycontent'
      },
      'datafiles': [
        {
          'name': '#name',
          'content': '#binarycontent'
        },
        {
          'name': '#name',
          'content': '#binarycontent'
        }
      ]
    },
    'errors': [ //errors ]
}

I apologise for any syntactical errors in the JSON; I am a bit sleepy to find out a minor error. After structuring this JSON, I came to know that JSON doesn't natively support binary data in it. So, I wouldn't be able to embed the binary data as values in JSON.

I realize that I can always convert the bytes into base64-encoded string, and use the string as value in JSON. But, a resultant string is around 30% extra in size; 4010 bytes of data was encoded into 5348 bytes, which while insignificant for a single binary chunk, is seen as a concern by my client when it comes to embedding a lot of such binary chunks in a JSON response. Due to the extra size, response would take more time to reach the client, which is a crucial concern for my client's application.

Another option I considered was to stream the binary chunks as octet-stream Content-Type to the client. But I am not sure if its any better than the above solution. Futhermore, I haven't been able to figure out how to relate the binary chunks and their names in such a situation.

Is there a solution better than 'convert binary to text and embed into JSON'?

Shift answered 19/3, 2014 at 19:58 Comment(8)
Are these video files that you are trying to extract ? Cause AWS has a good service called the Elastic Transcoder.Famine
@kiran.koduru: No, those are non-media binary files/chunks.thus keeping me on the hook.Shift
Check out BSON. en.wikipedia.org/wiki/BSONHaemostat
It sounds like Google's protocol buffers could work for you. There's an official Python implementation and several 3rd party JavaScript implementations.Unicorn
Could you just return the URLs for these files in your JSON response and then let the client get the binary content directly? With your solution that data goes through two hops, do you really need your server to be in the middle?Trypsin
@dstromberg: BSON seems promising. Only caveat is I work in Python3, and the independent BSON module (separate from MongoDB) on PyPi seems compatible only with Python2; pip3 was hit with installation errors.Shift
@LukasGraf: I will look into the protocol buffers. A brief look says they are indeed a good option. I might need to rattle out more fee from my client, though!Shift
@Miguel: I offered that option to the client saying that the two-hop trips will sure take more time. But he isn't willing to expose the S3 interface directly to the client frontend. As for the server, it does some calculations to figure out filenames and stuff, the logic of which the client doesn't want to expose either.Shift
S
7

I solved the problem, and will write down the solution hoping it could save someone else's time.

Thank you, @dstromberg and @LukasGraf for your advices. I checked out BSON first, and found it sufficient for my needs, so never went into details of Procotol Buffer.

BSON on PyPi is available into two packages. In pymongo, it comes as a supplement to MongoDB. In bson, it is a standalone package, obviously suiting to my needs. However, it supports only Python2. So I looked around for a Python3 implementation before rolling out my own port, and found another implementation of BSON spec on bsonspec.org: Link to the module.

The simplest usage of that module goes like this:

>>> import bson
warning: module typecheck.py cannot be imported, type checking is skipped
>>> encoded = bson.serialize_to_bytes({'name': 'chunkfile', 'content': b'\xad\x03\xae\x03\xac\x03\xac\x03\xd4\x13'})
>>> print(encoded)
b'1\x00\x00\x00\x02name\x00\n\x00\x00\x00chunkfile\x00\x05content\x00\n\x00\x00\x00\x00\xad\x03\xae\x03\xac\x03\xac\x03\xd4\x13\x00'
>>> decoded = bson.parse_bytes(encoded)
>>> print(decoded)
OrderedDict([('name', 'chunkfile'), ('content', b'\xad\x03\xae\x03\xac\x03\xac\x03\xd4\x13')])

As you can see, it can accommodate binary data as well. I sent the data from Flask as mimetype=application/bson, which was accurately parsed by the receiving JavaScript using this standalone BSON library provided by MongoDB team.

Shift answered 21/3, 2014 at 7:2 Comment(1)
Wonderful, your answer helped me a lot!Woothen

© 2022 - 2024 — McMap. All rights reserved.