Can I expect a significant performance boost by moving a large key value store from MySQL to a NoSQL DB?

W

3

7

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.

I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.

Queries are typically for around 100 consecutive values, eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;

I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.

Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.

Thanks

Weepy answered 6/8, 2010 at 18:20 Comment(6)

Before abandoning RDBMSes, I'd want profile against MySQL/Innodb, and postgresql. I'd also want to be sure I had appropriate indices on the table. – Podgorica 6/8, 2010 at 18:34

Re: Indices, My table has two columns: ID BIGINT; Value FLOAT; and I have ID as my primary key as my queries always use where ID between ... – Weepy 6/8, 2010 at 18:40

Wow! 50GB of data in a two-column table. I'd think 0.1 to 0.3 secs isn't to be sneezed at, in the circumstances. If it's any of our business, perhaps you could tell us what you're keeping in what must be a near-record-breaking table? – Immediately 6/8, 2010 at 18:48

If you move to Mongodb you can shard your db over multiple machines and the whole dataset will fit in memory, this will make it really fast. Sharding in MongoDB can deal with range queries. Of course, those extra machines cost money, it is up to you. You can also try using an SSD. – Precipitation 6/8, 2010 at 20:1

Brian: I should have said that currently I'm not running with a full dataset and so my 0.1-0.3s retrievals only apply for a total table size of just over 5GB (but it will eventually be 50GB), NB. my current server only has 512MB RAM(!). The data is a set of geophysical satellite data. TTT: Good point about the sharding. That is definitely an option. – Weepy 6/8, 2010 at 20:12

Any of the key/value nosql db's should work well for you. Especially if you have some other machines laying around that you can make a cluster with. – Slipslop 7/8, 2010 at 23:2

M

2

I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load. at any rate 200ms is way too much for a key/value store.

If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn if you are looking for a distributed solution you can try any Dynamo clone: Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular. keep in mind that looking for strong consistency slows down these systems quite a bit.

Marchant answered 9/8, 2010 at 9:53 Comment(1)

Thanks -am running some benchmarks now with MongoDB, Tokyo Tyrant and Cassandra. I am definitely seeing vast improvements in query times. However, fyi bulk inserts are proving not quite so fast (compared to MySQL LOAD INFILE). – Weepy 26/8, 2010 at 22:10

N

3

Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.

To make your own test with it should take less than 1 hour.

One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).

Nevins answered 12/8, 2010 at 16:6 Comment(0)

C

2

I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

Choanocyte answered 8/8, 2010 at 12:5 Comment(0)

M

2

I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load. at any rate 200ms is way too much for a key/value store.

If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn if you are looking for a distributed solution you can try any Dynamo clone: Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular. keep in mind that looking for strong consistency slows down these systems quite a bit.

Marchant answered 9/8, 2010 at 9:53 Comment(1)

Thanks -am running some benchmarks now with MongoDB, Tokyo Tyrant and Cassandra. I am definitely seeing vast improvements in query times. However, fyi bulk inserts are proving not quite so fast (compared to MySQL LOAD INFILE). – Weepy 26/8, 2010 at 22:10

Recommended topics

Hot tags