Fast key-value disk storage for Python

Asked 2/4, 2020 at 7:43 Answered 3/4, 2023 at 16:5

python arrays sparse-matrix key-value on-disk

I'm wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.

I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.

So far I have tried:

Redis (64GB RAM is not large enough)
TileDB SparseArray (no way to add to values)
Sqlite (way too slow)
LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
Zarr (coordinate based updating is SUPER slow)
Scipy .npz (can't keep the matrices in memory for addition part)
sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)

Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I'm processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.

Any ideas?

Distinctly answered 2/4, 2020 at 7:43 Comment(8)

try PySpark. This is kind of a task it was designed for. – Coil 2/4, 2020 at 7:45

@Coil I have considered PySpark but have no clue where to start for this problem. When testing locally I keep getting OutOfMemoryErrors for very small matrices. There also seems to be no way of adding SparseMatrices nor the distributed CoordinateMatrix. – Distinctly 2/4, 2020 at 12:46

Solved the merging problem by converting the submatrices (COO format) to csv and merging them using PySpark. However, the problem of turning this csv into a key-value store remains. I'll try TileDB SparseArray since write performance was really good. – Distinctly 3/4, 2020 at 11:25

You could have sorted your keys and then use append=True in your put for LMDB this would give a huge speedup – Chicky 25/2, 2021 at 15:56

How do you do concurrence with Python? – Helpmate 16/11, 2021 at 20:17

By the way that is the reason I moved to Chez Scheme. – Helpmate 16/11, 2021 at 20:17

SQLite way too slow is weird because it was built to be fast. It is easy to achieve poor performances with SQLite but with a correctly configured table, accesses should be fast. If you can hope that memory caching can help, I would give SQLite a second try. If accesses are more full writes then full reads, a true direct access database like ndbm (native in Python standard library) probably deserves a try. – Pantaloon 17/11, 2021 at 9:59

lmdb is also memory based, btw. – Scotfree 11/1 at 20:13

You might want to check out my project.

pip install rocksdict

This is a fast on-disk key-value storage based on RockDB, it can take any python object as value. I consider it to be reliable and easy to use. It has a performance that's on par with GDBM, but it is cross-platform compared to GDBM which is only available for python on Linux.

https://github.com/Congyuwang/RocksDict

Below is a demo:

from rocksdict import Rdict, Options

path = str("./test_dict")

# create a Rdict with default options at `path`
db = Rdict(path)

db[1.0] = 1
db[1] = 1.0
db["huge integer"] = 2343546543243564534233536434567543
db["good"] = True
db["bad"] = False
db["bytes"] = b"bytes"
db["this is a list"] = [1, 2, 3]
db["store a dict"] = {0: 1}

import numpy as np
db[b"numpy"] = np.array([1, 2, 3])

import pandas as pd
db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]})

# close Rdict
db.close()

# reopen Rdict from disk
db = Rdict(path)
assert db[1.0] == 1
assert db[1] == 1.0
assert db["huge integer"] == 2343546543243564534233536434567543
assert db["good"] == True
assert db["bad"] == False
assert db["bytes"] == b"bytes"
assert db["this is a list"] == [1, 2, 3]
assert db["store a dict"] == {0: 1}
assert np.all(db[b"numpy"] == np.array([1, 2, 3]))
assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]}))

# iterate through all elements
for k, v in db.items():
    print(f"{k} -> {v}")

# batch get:
print(db[["good", "bad", 1.0]])
# [True, False, 1]
 
# delete Rdict from dict
del db
Rdict.destroy(path)

Spue answered 1/1, 2022 at 7:26 Comment(3)

Looks good, does it allow adding values while reading? are there any locks? – Scotfree 11/1 at 20:18

This package is thread-safe. So, yes. But it does not allow multiprocessing (i.e., you cannot open two instances of it in two different python processes, a file lock prevents this). You can use multi-threading though (using two different threads and share the same db object). It is thread safe. – Spue 13/1 at 9:42

You can not have two instances reading the same file? – Scotfree 13/1 at 21:56

Have a look at Plyvel, which is a python interface to LevelDB.

I used it successfully several years ago, and both projects appear to still be active. My own use-case was storing 100s of millions of key:value pairs, and I was more focussed on read performance, but it looks optimized for write also.

Dyslexia answered 3/4, 2023 at 16:5 Comment(1)

How are the read performance ? – Scotfree 11/1 at 20:23

-2

PySpark is more useful here .

PairFunction<String, String, String> keyData =
  new PairFunction<String, String, String>() {
  public Tuple2<String, String> call(String x) {
    return new Tuple2(x.split(" ")[0], x);
  }
};

JavaPairRDD<String, String> pairs = lines.mapToPair(keyData); https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html

Watchcase answered 10/11, 2021 at 8:54 Comment(0)

Recommended topics

Hot tags