Which is the best way to compress json to store in a memory based store like redis or memcache?
Asked Answered
L

6

26

Requirement : Python objects with 2-3 levels of nesting containing basic datypes like integers,strings, lists, and dicts. ( no dates etc), needs to be stored as json in redis against a key. What are the best methods available for compressing json as a string for low memory footprint. The target objects are not very large, having 1000 small elements on average, or about 15000 characters when converted to JSON.

eg.

>>> my_dict
{'details': {'1': {'age': 13, 'name': 'dhruv'}, '2': {'age': 15, 'name': 'Matt'}}, 'members': ['1', '2']}
>>> json.dumps(my_dict)
'{"details": {"1": {"age": 13, "name": "dhruv"}, "2": {"age": 15, "name": "Matt"}}, "members": ["1", "2"]}'
### SOME BASIC COMPACTION ###
>>> json.dumps(my_dict, separators=(',',':'))
'{"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}'

1/ Are there any other better ways to compress json to save memory in redis ( also ensuring light weight decoding afterwards ).

2/ How good a candidate would be msgpack [http://msgpack.org/] ?

3/ Shall I consider options like pickle as well ?

Lief answered 20/3, 2013 at 14:6 Comment(1)
what are the requirements of your application? do you need performance? reliability, consistency, etc? would you consider alternatives to redis?Bestow
T
23

We just use gzip as a compressor.

import gzip
import cStringIO

def decompressStringToFile(value, outputFile):
  """
  decompress the given string value (which must be valid compressed gzip
  data) and write the result in the given open file.
  """
  stream = cStringIO.StringIO(value)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      outputFile.close()
      return 
    outputFile.write(chunk)

def compressFileToString(inputFile):
  """
  read the given open file, compress the data and return it as string.
  """
  stream = cStringIO.StringIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = inputFile.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

In our usecase we store the result as files, as you can imagine. To use just in-memory strings, you can use a cStringIO.StringIO() object as a replacement for the file as well.

Toxoplasmosis answered 20/3, 2013 at 16:34 Comment(1)
Is it better to use with gzip.GzipFile(fileobj=stream, mode='w') as compressor:? which in the usual open python function would allow for a proper closer of the file in the event the loop stops.Lancet
S
13

Based on @Alfe's answer above here is a version that keeps the contents in memory (for network I/O tasks). I also made a few changes to support Python 3.

import gzip
from io import StringIO, BytesIO

def decompressBytesToString(inputBytes):
  """
  decompress the given byte array (which must be valid 
  compressed gzip data) and return the decoded text (utf-8).
  """
  bio = BytesIO()
  stream = BytesIO(inputBytes)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      bio.seek(0)
      return bio.read().decode("utf-8")
    bio.write(chunk)
  return None

def compressStringToBytes(inputString):
  """
  read the given string, encode it in utf-8,
  compress the data and return it as a byte array.
  """
  bio = BytesIO()
  bio.write(inputString.encode("utf-8"))
  bio.seek(0)
  stream = BytesIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = bio.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

To test the compression try:

inputString="asdf" * 1000
len(inputString)
len(compressStringToBytes(inputString))
decompressBytesToString(compressStringToBytes(inputString))
Serpentiform answered 26/8, 2018 at 21:31 Comment(0)
G
8

I did some extensive comparisons between different binary formats (MessagePack, BSON, Ion, Smile CBOR) and compression algorithms (Brotli, Gzip, XZ, Zstandard, bzip2).

For the JSON data I used for testing, keeping the data as JSON and using Brotli compression was the best solution. Brotli has different compression levels, so if you are persisting the data for a long period of time, then using a high level of compression can be worth it. If you are not persisting for very long, then a lower level of compression or using Zstandard might be most effective.

Gzip is easy, but there are almost certainly going to be alternatives that are either quicker, or offer better compression, or both.

You can read the full details of our investigation here: Blog Post

Gutenberg answered 9/12, 2019 at 19:24 Comment(0)
B
3

One easy "post process" way is to build a "short key name" map and run the generated json through that before storage, and again (reversed) before de-serializing to an object. For example:

Before: {"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}
Map: details:d, age:a, name:n, members:m
Result: {"d":{"1":{"a":13,"n":"dhruv"},"2":{"a":15,"n":"Matt"}},"m":["1","2"]}

Just go through the json and replace key->value on the way to the database, and value->key on the way to the application.

You can also gzip for extra goodness (won't be a string after that though).

Bluh answered 21/3, 2013 at 9:50 Comment(0)
S
3

If you want it to be fast, try lz4. If you want it to compress better, go for lzma.

Are there any other better ways to compress json to save memory in redis(also ensuring light weight decoding afterwards)?

How good a candidate would be msgpack [http://msgpack.org/]?

Msgpack is relatively fast and has a smaller memory footprint. But ujson is generally faster for me. You should compare them on your data, measure the compression and decompression rates and the compression ratio.

Shall I consider options like pickle as well?

Consider both pickle(cPickle in partucular) and marshal. They are fast. But remember that they are not secure or scalable and you pay for the speed with the added responsibility.

Sanders answered 6/7, 2017 at 11:23 Comment(0)
A
0

Another possibility would be to use MongoDB's storage format, BSON.

You can find two python implementations in the implementation page on that site.

edit: why not just save the dictionary, and convert to json on retrieval?

Allocution answered 20/3, 2013 at 14:10 Comment(4)
I do not think BSON can be added as a value for a key in redis.Lief
@Lief sure it can, why wouldn't it? Redis has no opinion on what you store in a key.Bluh
@JonatanHedborg thanks for the correction. I did not pay attention to the point that redis strings are binary safe.Lief
However, BSON isn't really more compact than JSON (as stated on their site), so it's not really an option.Bluh

© 2022 - 2024 — McMap. All rights reserved.