Encoding and decoding binary data for inclusion into JSON with Python 3
Asked Answered
E

1

8

I need to decide on a schema for including binary elements into a message object so that it can be decoded again on the receiving end (In my situation a consumer on an Rabbit MQ / AMQP queue).

I decided against multipart MIME encoding over JSON mostly because it seems like using Thor's hammer to push in a thumb tack. I decided against manually joining parts (binary and JSON concatenated together) mostly because every time a new requirement arises it is a whole re-design. JSON with the binary encoded in one of the fields seems like an elegant solution.

My seemingly working (confirmed by comparing MD5-sum of sent and received data) solution is doing the following:

def json_serialiser(byte_obj):
    if isinstance(byte_obj, (bytes, bytearray)):
        # File Bytes to Base64 Bytes then to String
        return base64.b64encode(byte_obj).decode('utf-8')
    raise ValueError('No encoding handler for data type ' + type(byte_obj))


def make_msg(filename, filedata):
    d = {"filename": filename,
         "datalen": len(filedata),
         "data": filedata}
    return json.dumps(d, default=json_serialiser)

On the receiving end I simply do:

def parse_json(msg):
    d = json.loads(msg)
    data = d.pop('data')
    return base64.b64decode(data), d


def file_callback(ch, method, properties, body):
    filedata, fileinfo = parse_json(body)
    print('File Name:', fileinfo.get("filename"))
    print('Received File Size', len(filedata))

My google-fu left me unable to confirm whether what I am doing is in fact valid. In particular I am concerned whether the line that produces the string from the binary data for inclusion into JSON is correct, eg the line return base64.b64encode(byte_obj).decode('utf-8')

And it seems that I am able to take a shortcut with the decoding back to binary data as the base64.b64decode() method handles the UTF-8 data as if it is ASCII - As one would expect it to be coming from the output of base64.b64encode() ... But is this a valid assumption in all cases?

Mostly I'm surprised at not being able to find any examples online of doing this. Perhaps my google patience are still on holiday!

Ebonyeboracum answered 27/12, 2018 at 9:47 Comment(1)
Another option is to decode the bytes directly as 'latin1', as described e.g. here, instead of using base64. For example byte_obj.decode('latin1')Bignonia
W
9

The docs confirm that your approach is ok.

base64.b64encode(byte_obj).decode('utf-8') is correct - base64.b64encode requires bytes as input:

Encode the bytes-like object s using Base64 and return the encoded bytes.

However base64.b64decode accepts bytes or an ascii string:

Decode the Base64 encoded bytes-like object or ASCII string s and return the decoded bytes.

Wetzel answered 27/12, 2018 at 10:47 Comment(2)
Thank you. But when I use eg ...decode('latin-1') I still get the same result. So assuming everything else I do is correct, the question that remains is whether using decode('utf-8') is the correct approach for serialising the base64 encoded bytes to "str".Ebonyeboracum
It doesn't matter whether you use 'utf-8' or latin-1 because both encodings encode ascii characters to the same values, and base64 only uses ascii. So either is fine.Wetzel

© 2022 - 2024 — McMap. All rights reserved.