Reliable and efficient key--value database for Linux? [closed]

O

10

43

I need a fast, reliable and memory-efficient key--value database for Linux. My keys are about 128 bytes, and the maximum value size can be 128K or 256K. The database subsystem shouldn't use more than about 1 MB of RAM. The total database size is 20G (!), but only a small random fraction of the data is accessed at a time. If necessary, I can move some data blobs out of the database (to regular files), so the size gets down to 2 GB maximum. The database must survive a system crash without any loss in recently unmodified data. I'll have about 100 times more reads than writes. It is a plus if it can use a block device (without a filesystem) as storage. I don't need client-server functionality, just a library. I need Python bindings (but I can implement them if they are not available).

Which solutions should I consider, and which one do you recommend?

Candidates I know of which could work:

Tokyo Cabinet (Python bindings are pytc, see also pytc example code, supports hashes and B+trees, transaction log files and more, the size of the bucket array is fixed at database creation time; the writer must close the file to give others a chance; lots of small writes with reopening the file for each of them are very slow; the Tyrant server can help with the lots of small writes; speed comparison between Tokyo Cabinet, Tokyo Tyrant and Berkeley DB)
VSDB (safe even on NFS, without locking; what about barriers?; updates are very slow, but not as slow as in cdb; last version in 2003)
BerkeleyDB (provides crash recovery; provides transactions; the bsddb Python module provides bindings)
Samba's TDB (with transactions and Python bindings, some users experienced corruption, sometimes mmap()s the whole file, the repack operation sometimes doubles the file size, produces mysterious failures if the database is larger than 2G (even on 64-bit systems), cluster implementation (CTDB) also available; file grows too large after lots of modifications; file becomes too slow after lots of hash contention; no built-in way to rebuild the file; very fast parallel updates by locking individual hash buckets)
aodbm (append-only so survives a system crash, with Python bindings)
hamsterdb (with Python bindings)
C-tree (mature, versatile commercial solution with high performance, has a free edition with reduced functionality)
the old TDB (from 2001)
bitcask (log-structured, written in Erlang)
various other DBM implementations (such as GDBM, NDBM, QDBM,, Perl's SDBM or Ruby's; probably they don't have proper crash recovery)

I won't use these:

MemcacheDB (client-server, uses BereleleyDB as a backend)
cdb (needs to regenerate the whole database upon each write)
http://www.wildsparx.com/apbcdb/ (ditto)
Redis (keeps the whole database in memory)
SQLite (it becomes very slow without periodic vacuuming, see autocompletion in the in the location bar in Firefox 3.0, even though versions 3.1 and later of sqlite allow auto_vacuuming; beware: small writing transactions can be very slow; beware: if a busy process is doing many transactions, other processes starve, and they can never get the lock)
MongoDB (too heavy-weight, treats values as objects with internal structure)
Firebird (SQL-based RDBMS, too heavy-weight)

FYI, a recent article about key--value databases in the Linux magazine.

FYI, an older software list

FYI, a speed comparison of MemcacheDB, Redis and Tokyo Cabinet Tyrant

Related questions on StackOverflow:

Obvert answered 6/11, 2009 at 21:32 Comment(6)

I know this question is pretty old, but have you looked at aodbm ( sourceforge.net/projects/aodbm )? It fulfils your requirements and using an append-only file means it will survive a systems crash. – Batchelor 26/2, 2011 at 7:58

@dan_waterworh: aodbm is an interesting project, thanks for mentioning it. Please add documentation for all public functions. Does aodbm support quick appending a few bytes to an existing value? If I append 1 byte 1000 times, will it be slower to retrieve the value than if I appended 1000 bytes once? – Obvert 27/2, 2011 at 0:6

@pts, sorry, I missed your message when you first posted it. aodbm doesn't support quick appending of values, this is because there is no speed advantage in doing so for append-only formats. I am working on the documentation. – Batchelor 4/3, 2011 at 18:42

@pts, I think I missed your msg because you missed out the 'h' in my name. I've added documentation for the complete C API into the README file. – Batchelor 4/3, 2011 at 20:47

@dan_waterworth: So let's suppose that I append 1024 bytes to a value each day for 365 days. I can imagine two solutions: A. database size grows quadratically (1024*366*365/2), fetching the final value needs 1 disk seek; B. database size grows linearly (1024*365), fetching the final value needs 365 disk seeks. Which of these does aodbm support? – Obvert 11/3, 2011 at 10:29

@pts, At the moment it would grow quadratically, however I'm going to be modifying it soon so that you have the option of "log structuring", in which case it will grow until it reaches a limit and it will not exceed this limit. – Batchelor 11/3, 2011 at 10:36

H

2

I've had good luck with the Tokyo Cabinet/pytc solution. It's very fast (a bit faster than using the shelve module using anydbm in my implementation), both for reading and writing (though I too do far more reading). The problem for me was the spartan documentation on the python bindings, but there's enough example code around to figure out how to do what you need to do. Additionally, tokyo cabinet is quite easy to install (as are the python bindings), doesn't require a server (as you mention) ~~and seems to be actively supported~~ (stable but no longer under active development). You can open files in read-only mode, allowing concurrent access, or read/write mode, preventing other processes from accessing the database.

I was looking at various options over the summer, and the advice I got then was this: try out the different options and see what works best for you. It'd be nice if there were simply a "best" option, but everyone is looking for slightly different features and are willing to make different trade-offs. You know best.

(That said, it'd be useful to others if you shared what ended up working the best for you, and why you chose that solution over others!)

Hemihedral answered 18/11, 2009 at 18:51 Comment(3)

Tokyo Cabinet is unsupported and unreliable. The CfEngine folks used to use it but had constant unresolvable corruption issues with it. – Benedictbenedicta 4/8, 2014 at 16:46

I've been using it for several years and have never had any corruption issues. You are correct that it is no longer supported, however. – Hemihedral 4/8, 2014 at 19:55

google.com/search?q=cfengine+tokyo+cabinet+corruption It seems to have been quite a persistent problem in some software communities. – Benedictbenedicta 5/8, 2014 at 0:45

B

11

LMDB is the most memory-efficient database around http://symas.com/mdb/inmem/

and also proven to be the most reliable - completely crash-proof. http://wisdom.cs.wisc.edu/workshops/spring-14/talks/Thanu.pdf

Of the ones you've mentioned, Tokyo Cabinet has documented corruption issues https://www.google.com/search?q=cfengine+tokyo+cabinet+corruption

BerkeleyDB also has well-documented corruption issues, as does Bitcask. (And bitcask is an in-memory-only DB anyway, so useless for your 1MB RAM requirement.)

LMDB is also well-supported in Python, with a couple different bindings available. https://github.com/dw/py-lmdb/ https://github.com/tspurway/pymdb-lightning

Disclaimer - I am the author of LMDB. But these are documented facts: LMDB is the smallest, most efficient, and most reliable key/value store in the world and nothing else comes anywhere close.

Benedictbenedicta answered 6/11, 2009 at 21:32 Comment(0)

O

2

Riak runs on Linux, and allows you to dynamically add nodes

Onions answered 6/11, 2009 at 21:32 Comment(0)

L

2

cdb can handle any database up to 4 GB, making it too small for the 20GB matter at hand.

Limbic answered 6/11, 2009 at 21:32 Comment(0)

H

2

I've had good luck with the Tokyo Cabinet/pytc solution. It's very fast (a bit faster than using the shelve module using anydbm in my implementation), both for reading and writing (though I too do far more reading). The problem for me was the spartan documentation on the python bindings, but there's enough example code around to figure out how to do what you need to do. Additionally, tokyo cabinet is quite easy to install (as are the python bindings), doesn't require a server (as you mention) ~~and seems to be actively supported~~ (stable but no longer under active development). You can open files in read-only mode, allowing concurrent access, or read/write mode, preventing other processes from accessing the database.

I was looking at various options over the summer, and the advice I got then was this: try out the different options and see what works best for you. It'd be nice if there were simply a "best" option, but everyone is looking for slightly different features and are willing to make different trade-offs. You know best.

(That said, it'd be useful to others if you shared what ended up working the best for you, and why you chose that solution over others!)

Hemihedral answered 18/11, 2009 at 18:51 Comment(3)

Tokyo Cabinet is unsupported and unreliable. The CfEngine folks used to use it but had constant unresolvable corruption issues with it. – Benedictbenedicta 4/8, 2014 at 16:46

I've been using it for several years and have never had any corruption issues. You are correct that it is no longer supported, however. – Hemihedral 4/8, 2014 at 19:55

google.com/search?q=cfengine+tokyo+cabinet+corruption It seems to have been quite a persistent problem in some software communities. – Benedictbenedicta 5/8, 2014 at 0:45

L

1

how about Python 3.0's dbm.ndbm ?

Limbic answered 7/11, 2009 at 15:17 Comment(1)

dbm.ndbm uses NDBM, which I've already mentioned in the question. Could you please provide reasons why I should use NDBM? – Obvert 7/11, 2009 at 21:6

E

1

Another suggestion is TDB (a part of the Samba project). I've used it through the tdb module, however I can't say I've tested its reliability on crashes; the projects I used it in didn't have such requirements, and I can't find relevant documentation.

Embryologist answered 18/11, 2009 at 23:56 Comment(0)

D

0

In my query for a cross-platform ISAM-style database (similar), I also received suggestions for the embedded version of Firebird and GLib.

Dagenham answered 6/11, 2009 at 21:32 Comment(1)

Does it have a key--value API or only SQL? – Obvert 1/12, 2009 at 22:24

N

0

how about a SQLite?

Nibelungenlied answered 6/11, 2009 at 21:34 Comment(3)

I'm afraid SQLite would get too slow after lots of writes (without vacuuming). – Obvert 7/11, 2009 at 0:2

What would you think of setting a cron job to the vacuuming at regular intervals? Plus what would happen if the key-value database gets upgraded to use a more familiar relational database, in short, a pita having to migrate. By using SQLite, you're pretty much on safe grounds for the future? – Ninebark 1/12, 2009 at 0:31

I see you clarified your question - in this case yes sqlite won't be your best solution. Just a recommendation - keep in mind that if you app is cross platform - windows has a hard 2gb limit on file size. The reason I mention this is that it came up a in a different question. If the DB of your choice stores everything in one file, that gets too big - it will crash on windows. Similar to how Thunderbird or Outlook crash when their storage gets too big. – Nibelungenlied 1/12, 2009 at 15:6

R

0

I've used bsddb.hashlib() with Python, it worked pretty good.

Railroad answered 6/11, 2009 at 21:35 Comment(1)

Thanks for mentioning the bsddb Python module. It uses BerkeleyDB. – Obvert 6/11, 2009 at 23:16

A

0

You might like djb's cdb, which has the properties you mention.

Aurie answered 6/11, 2009 at 21:37 Comment(4)

As far as I understand, I have to rebuild the cdb for each write. That would be too slow. – Obvert 6/11, 2009 at 23:13

Reading isn't interrupted during write, making this a non-issue in the real world; see cr.yp.to/cdb/cdbmake.html . Could you explain why you believe performance would be unacceptable? – Cardona 18/11, 2009 at 21:33

esm: did you read the question? The database size is 20 GiB, and there are 100 reads per write, so there are writes. Now, do you consider acceptable to move 20 GiB to-and-fro per write? – Embryologist 19/11, 2009 at 0:0

@Cardona cdb might not work: " No random limits: cdb can handle any database up to 4 gigabytes. There are no other restrictions; records don't even have to fit into memory. Databases are stored in a machine-independent format. " – Chart 22/2, 2014 at 5:19

Recommended topics

Hot tags