Fastest way to store a numpy array in redis
Asked Answered
A

6

17

I'm using redis on an AI project.

The idea is to have multiple environment simulators running policies on a lot of cpu cores. The simulators write experience (a list of state/action/reward tuples) to a redis server (replay buffer). Then a training process reads the experience as a dataset to generate a new policy. New policy is deployed to the simulators, data from previous run is deleted, and the process continues.

The bulk of the experience is captured in the "state". Which is normally represented as a large numpy array of dimension say, 80 x 80. The simulators generate these as fast as the cpu will allow.

To this end, does anyone have good ideas or experience of the best/fastest/simplest way to write a lot of numpy arrays to redis. This is all on the same machine, but later, could be on a set of cloud servers. Code samples welcome!

Alti answered 23/3, 2019 at 6:58 Comment(1)
use pypi.org/project/direct-redis without any hassleRavo
D
35

I don't know if it is fastest, but you could try something like this...

Storing a Numpy array to Redis goes like this - see function toRedis():

  • get shape of Numpy array and encode
  • append the Numpy array as bytes to the shape
  • store the encoded array under supplied key

Retrieving a Numpy array goes like this - see function fromRedis():

  • retrieve from Redis the encoded string corresponding to supplied key
  • extract the shape of the Numpy array from the string
  • extract data and repopulate Numpy array, reshape to original shape

#!/usr/bin/env python3

import struct
import redis
import numpy as np

def toRedis(r,a,n):
   """Store given Numpy array 'a' in Redis under key 'n'"""
   h, w = a.shape
   shape = struct.pack('>II',h,w)
   encoded = shape + a.tobytes()

   # Store encoded data in Redis
   r.set(n,encoded)
   return

def fromRedis(r,n):
   """Retrieve Numpy array from Redis key 'n'"""
   encoded = r.get(n)
   h, w = struct.unpack('>II',encoded[:8])
   # Add slicing here, or else the array would differ from the original
   a = np.frombuffer(encoded[8:]).reshape(h,w)
   return a

# Create 80x80 numpy array to store
a0 = np.arange(6400,dtype=np.uint16).reshape(80,80) 

# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)

# Store array a0 in Redis under name 'a0array'
toRedis(r,a0,'a0array')

# Retrieve from Redis
a1 = fromRedis(r,'a0array')

np.testing.assert_array_equal(a0,a1)

You could add more flexibility by encoding the dtype of the Numpy array along with the shape. I didn't do that because it may be the case that you already know all your arrays are of one specific type and then the code would just be bigger and harder to read for no reason.

Rough benchmark on modern iMac:

80x80 Numpy array of np.uint16   => 58 microseconds to write
200x200 Numpy array of np.uint16 => 88 microseconds to write

Keywords: Python, Numpy, Redis, array, serialise, serialize, key, incr, unique

Dinka answered 23/3, 2019 at 11:38 Comment(11)
Thanks Mark, really helpfulAlti
definately will. currenty implementing this in a mult-process simulation/training setup as described. Once I've finished testing, will accept, assuming performance is good.Alti
It worked out, although I ended up switching to ray github.com/ray-project/ray for this use caseAlti
Good! Thank you for the link to ray and good luck with your project.Dinka
Turns out, after 2 weeks of messing around with ray, I came back to this answer! Ray is good, but still needs more work I think. (it's only version 0.7 at the time I write this)Alti
@Duane, what are your latency numbers using Redis? Is it fast for you?Alcmene
@MarkSetchell This script throws an error now ValueError: cannot reshape array of size 1600 into shape (80,80)Scull
@AshwinNair An 80x80 array will need 6400 elements so that is correct, it cannot be reshaped from an array of 1600 elements.Dinka
Yes but you've declared a0 to have 6400 elements. I mean to say that running this script as it is fails. Not sure why thoughScull
Ok this is because of dtype. You could just specify the dtype via np.frombuffer(encoded[8:], dtype=np.uint16). But a better option would indeed be to encode dtype as well.Scull
Could you please update your answer by adding the data type persistence as well?Salvo
S
14

You could also consider using msgpack-numpy, which provides "encoding and decoding routines that enable the serialization and deserialization of numerical and array data types provided by numpy using the highly efficient msgpack format." -- see https://msgpack.org/.

Quick proof-of-concept:

import msgpack
import msgpack_numpy as m
import numpy as np
m.patch()               # Important line to monkey-patch for numpy support!

from redis import Redis

r = Redis('127.0.0.1')

# Create an array, then use msgpack to serialize it 
d_orig = np.array([1,2,3,4])
d_orig_packed = m.packb(d_orig)

# Set the data in redis
r.set('d', d_orig_packed)

# Retrieve and unpack the data
d_out = m.unpackb(r.get('d'))

# Check they match
assert np.alltrue(d_orig == d_out)
assert d_orig.dtype == d_out.dtype

On my machine, msgpack runs much quicker than using struct:

In: %timeit struct.pack('4096L', *np.arange(0, 4096))
1000 loops, best of 3: 443 µs per loop

In: %timeit m.packb(np.arange(0, 4096))
The slowest run took 7.74 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.6 µs per loop
Snippy answered 5/3, 2020 at 2:14 Comment(2)
Whilst I certainly appreciate the simplicity and elegance of using msgpack, I am not sure what your sample timing is trying to say. You seem to compare timings for msgpack with struct-packing, but if you read my answer carefully, I only struct-pack the dimensions not the array data itself for which I use np.tobytes(). If you compare np.tobytes() with msgpack on my machine at least it is 50x faster, i.e. 314ns versus 17.3 microseconds.Dinka
@MarkSetchell ah yes, you're totally right, not a fair comparison. If I take out just the pack logic from your answer to test the speed and call it def pack(a), then on an 80x80 array the %timeit pack(a) give 4.62us, whereas %timeit m.packb(a) takes 12us, so is 2.5x slower. Still, msgpack-numpy is a great package!Snippy
T
6

You can check Mark Setchell's answer for how to actually write the bytes into Redis. Below I rewrite the functions fromRedis and toRedis to account for arrays of variable dimension size and to also include the array shape.

def toRedis(arr: np.array) -> str:
    arr_dtype = bytearray(str(arr.dtype), 'utf-8')
    arr_shape = bytearray(','.join([str(a) for a in arr.shape]), 'utf-8')
    sep = bytearray('|', 'utf-8')
    arr_bytes = arr.ravel().tobytes()
    to_return = arr_dtype + sep + arr_shape + sep + arr_bytes
    return to_return

def fromRedis(serialized_arr: str) -> np.array:
    sep = '|'.encode('utf-8')
    i_0 = serialized_arr.find(sep)
    i_1 = serialized_arr.find(sep, i_0 + 1)
    arr_dtype = serialized_arr[:i_0].decode('utf-8')
    arr_shape = tuple([int(a) for a in serialized_arr[i_0 + 1:i_1].decode('utf-8').split(',')])
    arr_str = serialized_arr[i_1 + 1:]
    arr = np.frombuffer(arr_str, dtype = arr_dtype).reshape(arr_shape)
    return arr
Tracheitis answered 28/2, 2020 at 18:59 Comment(0)
O
4

Give plasma a try as it avoids serialization/deserialization overhead.

Install plasma using pip install pyarrow

Documentation: https://arrow.apache.org/docs/python/plasma.html

firstly, launch plasma with 1 gb memory[terminal]:

plasma_store -m 1000000000 -s /tmp/plasma

import pyarrow.plasma as pa
import numpy as np
client = pa.connect("/tmp/plasma")
temp = np.random.rand(80,80)

Write time: 130 µs vs 782 µs (Redis implementation: Mark Setchell's answer)

Write time can be improved by using plasma huge pages but is available only for Linux machines : https://arrow.apache.org/docs/python/plasma.html#using-plasma-with-huge-pages

Fetch time: 31.2 µs vs 99.5 µs (Redis implementation: Mark Setchell's answer)

PS: Code was run on a MacPro

Oversubscribe answered 24/9, 2020 at 14:4 Comment(2)
Thanks for the pyarrow example. A welcome contribution!Alti
Interesting - I was unaware of plasma/pyarrow. A couple of things though. 1) your code doesn't actually show how to write or read plasma at all 2) your code uses a different dtype and different data on a different machine so timings are not at all comparable 3) if I use plasma and client.put() with the same array as I create in my answer, Redis takes around 70us and plasma takes 196us - though I have to say I have no experience with plasma or optimising it.Dinka
D
1

The tobytes() function is not very storage efficient. In order to decrease the storage which has to be written to the redis server, you can use the base64 package:

def encode_vector(ar):
    return base64.encodestring(ar.tobytes()).decode('ascii')

def decode_vector(ar):
    return np.fromstring(base64.decodestring(bytes(ar.decode('ascii'), 'ascii')), dtype='uint16')

@EDIT: Ok, since Redis stores values as byte strings, it is more storage efficient to store the byte string directly. However, if you convert it to a string, print it to the console, or store it in a text file it makes sense to do the encoding.

Diadiabase answered 5/9, 2019 at 6:43 Comment(0)
S
1

here is the code I modified from Jadiel de Armas, his code is almost correct, just missing decode part. I tested it, it works for me.

   def set_numpy(redis, key: str, np_value: np.ndarray):
        d_type =  bytearray(str(np_value.dtype),'utf-8')
        d_shape =  bytearray(','.join([str(a) for a in np_value.shape]), 'utf-8')
        sep = bytearray('|', 'utf-8')
        data = np_value.ravel().tobytes()
        value = base64.a85encode(d_type + sep + d_shape + sep + data)
        redis.set(key, value)

   def get_numpy(redis, key:str):
        binary_value = redis.get(key)
        binary_value = base64.a85decode(binary_value)
        sep = '|'.encode('utf-8')
        i_0 = binary_value.find(sep)
        i_1 = binary_value.find(sep, i_0 + 1)
        arr_dtype = binary_value[:i_0].decode('utf-8')
        arr_shape = tuple([int(a) for a in binary_value[i_0 + 1:i_1].decode('utf-8').split(',')])
        arr_str = binary_value[i_1 + 1:]
        arr = np.frombuffer(arr_str, dtype=arr_dtype).reshape(arr_shape)
        return arr
Smarten answered 19/12, 2023 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.