BerkeleyDB write performance problems

Asked 24/3, 2011 at 17:52 Answered 18/3, 2017 at 23:1

I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.

I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.

I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.

I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.

I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.

I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.

Anybody here with experience achieving sustained write performance with berkeleydb?

Edit 1:

I tried several things already:

Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.

In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.

This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...

Edit 2:

After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.

But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.

Chequered answered 24/3, 2011 at 17:52 Comment(0)

What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.

Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.

I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.

Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.

Ruvolo answered 24/3, 2011 at 18:8 Comment(11)

I'm seeing an average of 500 writes/second on a 70 seeks/second disk (thanks to the controller cache, right?). I would be satisfied if I could get constant 500 writes/second, but I'm getting a huge variability which I want to avoid. I tried throttling the writes down to 500/s but once a certain number of docs has been written, performance drops regardless. SSD is not an option. – Chequered 24/3, 2011 at 20:50

@user305175: To see if disk is the problem you can try two things: monitor your disk IO performance using PerfMon on Windows or vmstat on Linux. Or try using a BerkeleyDB in RAM-only mode, if you have enough RAM. – Addle 24/3, 2011 at 22:15

@Zan: The whole point of using BerkeleyDB is to offload stuff from memory. The issue is that BerkeleyDB doesn't seem to be able to provide a steady write performance. – Chequered 24/3, 2011 at 22:52

@Alex: I'm talking about debugging. If debugging reveals a disk IO problem then you buy SSDs or whatever. – Addle 24/3, 2011 at 22:54

My point is that your hardware is likely to to be the root cause of the unsteady performance. There only so much the OS/software do to work around your hardware limitations. I am guess you are seeing very little CPU time and mostly waiting for disk. Don't just take it from me. A 7200 RPM disk has a typical access time of 8 ms which means you can perform 125 writes per second sustained. (1000/8) tomshardware.com/reviews/notebook-hdd-750gb,2832-7.htm – Ruvolo 24/3, 2011 at 23:19

@Peter: I know I can get 500 writes per second from the hardware. The hardware is not the issue. Performance is not unsteady: it follows a pattern. BerkeleyDB is periodically stopping everything to reorganize the file, and I can't write to the db during that time. These periods get longer and more frequent as the number of items grows. I'm looking for a way around this issue, for example writing to an alternate file, but I can't think of any way to do this such that I will be able to get to those items when I need to read them. – Chequered 24/3, 2011 at 23:23

@Alex, the behaviour you describe is exactly the same as what I have seen writing to a number of log files, without any re-organization. How do you know Berkeley DB is re-organizing the files? – Ruvolo 24/3, 2011 at 23:36

@Peter: Interesting. I don't see that kind of behavior when writing to logs. Appending to a file, even when competing for the disk with other processes, doesn't seem to slow down with the size of the file. Why would it? I don't know that the BerkeleyDB is reorganizing, but the speed drop patterns seems to indicate it and I can't think of anything else. The test machine is not doing anything else and the disk is local. I'll try to upload a graph of the speed and link it in the original post. Also, 8ms is average seek time, you're assuming each write means a seek, which is not the case. – Chequered 24/3, 2011 at 23:48

If you are adding data to a HashMap persisted to disk wouldn't the the writes are random. (The hash is random) Writing large amounts of data continuously to disk should be very efficient. Random writes will look efficient up to a point, but you reach a point where the system behaves in the pattern you describe. – Ruvolo 24/3, 2011 at 23:55

@Peter: In other words BerkeleyDB is unable to handle more than a couple million documents without degrading beyond usability? That can't be right. – Chequered 25/3, 2011 at 0:3

I have used it to store tens of millions of "documents" (<1 KB each) however I didn't find its write performance impressive and rather hardware dependent. From what I remember there was ways to do bulk loads which worked well by combining updates. (It has been a while) – Ruvolo 25/3, 2011 at 6:23

BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:

Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.

When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.

Millerite answered 22/5, 2011 at 4:42 Comment(0)

We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).

Empennage answered 22/5, 2011 at 4:54 Comment(2)

Have you tried BDB partitioning? I have not tried it myself though and hence cannot back my suggestion with facts. – Empennage 23/5, 2011 at 12:5

Berkeley DB uses B+trees, not B-trees (despite what BDB calls them). See aosabook.org/en/bdb.html. – Baneberry 14/3, 2015 at 16:29

This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).

Wamble answered 18/3, 2012 at 20:27 Comment(1)

This answer helped me a lot! Increasing cache to 8GB allowed me to import smoothly nearly 100 million records. If anyone is interested how to set cachesize using Python3 bindings, see my commit here. Also do not perform import into two databases at once! – Kiangsu 2/8, 2017 at 9:51

I need a disk-based key-value store that can sustain high write and read performance for large data sets.

Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.

Promontory answered 18/3, 2017 at 23:1 Comment(0)

Recommended topics

Hot tags