I'm wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.
I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.
So far I have tried:
- Redis (64GB RAM is not large enough)
- TileDB SparseArray (no way to add to values)
- Sqlite (way too slow)
- LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
- Zarr (coordinate based updating is SUPER slow)
- Scipy .npz (can't keep the matrices in memory for addition part)
- sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)
Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I'm processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.
Any ideas?
append=True
in yourput
for LMDB this would give a huge speedup – Chickyndbm
(native in Python standard library) probably deserves a try. – Pantaloon