msgpack unserialising dict key strings to bytes
Asked Answered
M

3

7

I am having issues with msgpack in python. It seems that when serialising a dict, if the keys are strings str, they are not unserialised properly and causing KeyError exceptions to be raised.

Example:

>>> import msgpack
>>> d = dict()
>>> value = 1234
>>> d['key'] = value
>>> binary = msgpack.dumps(d)
>>> new_d = msgpack.loads(binary)
>>> new_d['key']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'key'

This is because the keys are not strings after calling loads() but are unserialised to bytes objects.

>>> d.keys()
dict_keys(['key'])
>>> new_d.keys()
dict_keys([b'key'])

It seems this is related to a unimplemented feature as mentioned in github

My question is, Is there a way to fix this issue or a work around to ensure that the same keys can be used upon deserialisation?

I would like to use msgpack but if I cannot build a dict object with str keys and expect to be able to use the same key upon deserilisation, it becomes useless.

Marinna answered 18/1, 2018 at 11:13 Comment(2)
Also see github.com/msgpack/msgpack/issues/121#issuecomment-13815058 I suggest that you encode all strings to UTF-8 before passing them to msgpack and then decode upon unpacking. Doing that will also prevent Unicode outside the ASCII / Latin1 range from getting mangled.Taintless
thanks! this was the issue but the source is related to encoding problems with msgpack. you helped track down the problem. answer to follow.Marinna
M
10

A default encoding is set when calling dumps or packb

:param str encoding:
 |      Convert unicode to bytes with this encoding. (default: 'utf-8')

but it is not set by default when calling loads or unpackb as seen in:

Help on built-in function unpackb in module msgpack._unpacker:

unpackb(...)
    unpackb(... encoding=None, ... )

Therefore changing the encoding on the deserialisation fixes the issue, for example:

>>> d['key'] = 1234
>>> binary = msgpack.dumps(d)
>>> msgpack.loads(binary, encoding = "utf-8")
{'key': 1234}
>>> msgpack.loads(binary, encoding = "utf-8") == d
True
Marinna answered 18/1, 2018 at 12:50 Comment(1)
encoding is deprecated -- raw=True|False is the new method.Marianomaribel
H
3

Using the raw=False flag as such worked for me on your example:

msgpack.unpackb(binary, raw=False)
# or
msgpack.loads(binary, raw=False)

See https://msgpack-python.readthedocs.io/en/latest/api.html#msgpack.Unpacker:

raw (bool) – If true, unpack msgpack raw to Python bytes. Otherwise, unpack to Python str by decoding with UTF-8 encoding (default).

Haveman answered 25/3, 2020 at 16:1 Comment(0)
R
0

Try the following:

def c_msgpackloads(bin):
    new_d = msgpack.loads(bin)
    new_d = {key.decode('utf-8') if isinstance(key, bytes) else key: new_d[key].decode('utf-8') if isinstance(new_d[key], bytes) else new_d[key] for key in new_d}
    return new_d

It's a custom loading function that loads the dict and automatically encodes bytes keys and values to utf-8 strings.

Redstone answered 18/1, 2018 at 11:21 Comment(1)
thanks, pedro. nice hack but it is just an encoding issue. will post answer.Marinna

© 2022 - 2024 — McMap. All rights reserved.