When to use a key-value store for web development?
Asked Answered
E

6

25

When would someone use a key-value (Redis, memcache, etc) store for web development? An actual use case would be most helpful.

My confusion is that a simple database seems so much more functional because, to my understanding, it can do everything a key-value store can do PLUS it also allows you to do filtering/querying. Meaning, to my understanding, you can NOT do filter like:

select * homes where price > 100000

with a key-value store.

Example

Let's pretend that StackOverflow uses a key-value store (memcache, redis, etc).

How would a key-value store help benefit Stackoverflow hosting needs?

Embolic answered 4/8, 2011 at 2:33 Comment(1)
Pretty sure you could do filters on key-value stores if you wanted to - depends partly on the implementation of the store and maybe on your own ingenuity.Ortego
C
27

I can't answer the question of when to use a key-value (herein kv) data store but I can show you some of the examples, and answer your stackoverflow example.

With database access, most of what you need is a kv store. For example, a user logs in with the username "joe". So you look up "user:joe" in your database and retrieve his password (hash of course). Or maybe you have his password under "user:pass:joe", it really doesn't matter. If it was stack overflow and you were rendering the page https://mcmap.net/q/529305/-when-to-use-a-key-value-store-for-web-development, you would look up "question:6935566" and use that. It is simple to see how kv stores can solve most of your problems.

I would like to say that a kv store is a subset of functionality provided by a traditional RDMS. This is because the design of the traditional RDMS provides many scaling issues, and generally loses features as you scale. kv stores don't come with these features, so they don't limit you. However, these features can often be created anyways, designed from the core to be scalable (because it becomes immediately obvious if they are not).

However that doesn't mean that there are things that you can't do. For example you mention querying. This is a pitfall of many kv stores, as they are generally agnostic of the value (not always true, example, redis and more) and have no way of finding what you are looking for. Worse, they are not designed to do that quickly, they are just really quick looking up by key.

One solution to this problem is to sort your keys lexicographically and allow range queries. This is essentially "give me everything between question:1 and question:5". Now that example is fairly useless, but there are many uses of range queries.

You said you want all houses more then $100 000. If you wanted to be able to do this you would create an index of houses by price. Say you had the following houses.

house:0 -> {"color":"blue","sold":false,"city":"Stackoverville","price":500000}
house:1 -> {"color":"red","sold":true,"city":"Toronto","price":150000}
house:2 -> {"color":"beige","sold":false,"city":"Toronto","price":40000}
house:3 -> {"color":"blue","sold":false,"city":"The Blogosphere","price":110000}

In SQL you would store each field in a column rather then having it all in one (in this case JSON) document. And could SELECT * FROM houses WHERE price > 100000. This seems all fine and dandy but, if there isn't an index built, this requires looking at every house in your table and checking its price, which if you have a couple million houses, could be slow. So with a kv store you need an index as well. The main difference is that the SQL database would silently do the slow thing, where the kv store wouldn't be able.

If you don't have range queries you would need to stick your index in a single document, which makes safely updating it a pain and means that you would have to download the whole index for every query, again, limiting scalability.

house:index:price -> [{"price":500000,"id":"0"},{"price":150000,"id":"1"},{"price":110000,"id":"3"},{"price":40000,"id":"2"}]

But if you have range queries (often called keyscans) you can create an index like this:

house:index:price:040000 -> 2
house:index:price:110000 -> 3
house:index:price:150000 -> 1
house:index:price:500000 -> 0

And then you could request the keys between house:index:price:100000 and house:index:price:: (the ':' character is the character after '9') and you would get [3,1,0] which is all the houses more expensive than $100 000 (they are also helpfully in order). Another nice thing about this is that they will likely be on one "partition" of your cluster so this query will take about the same time as a singe get (plus the tiny extra transfer overhead) or two gets if your range happens to go over a server boundary (but these can be done in parallel!).

So that shows how to do queries in a kv store. You can query anything that can be ordered as a string (just about anything) and look it up very quickly. If you don't have range queries you will need to store your whole index under one key which sucks, but if you have range queries it is very nice, and very fast. Here is a more complex example.

I want unsold houses in Toronto that are less then $100 000. I simply have to design my index. (I added in a couple of houses to make it more meaningful) At first thought you might just build another index for every property, but you will quickly realize that that means that you have to select every unsold house and download it from the database. (This is what I meant when I said scaling problems are immediately obvious.) The solution is to use a multi-index. Once built you can select exactly the values you want.

house:index:sold:city:price:f~Fooville~000010:5        -> ""
house:index:sold:city:price:f~Toronto~040000:2         -> ""
house:index:sold:city:price:f~Toronto~140000:4         -> ""
house:index:sold:city:price:t~Stackoverville~500000:0  -> ""
house:index:sold:city:price:t~The Blogosphere~110000:3 -> ""
house:index:sold:city:price:t~Toronto~150000:1         -> ""

Now, unlike the last example I put the id in the key. This allows two houses have the same properties. I could have merged them in the value but then adding a removing indexes becomes more difficult. I also chose to separate my data with a ~. This is because it is lexicographically after all of the letters, ensuring that the full name will be sorted and I don't have to pad every city to the same length. In a production system I would probably use the byte 255 or 0.

Now the range house:index:sold:city:price:f~Toronto~100000 - house:index:sold:city:price:f~Toronto~~ will select all houses that match the query. And the important thing to note is that query scales linearly with the number of results. This does mean that you have to build an index for every set of properties that you want to index (although the index in our example also works for sold, and sold-city queries). This may seem like a lot of work but in the end you realize that it is just that you are doing it, not your database. I'm sure we will begin to see libraries for this kind of thing coming out soon :D

After stretching the topic a bit, I have shown:

  • Some uses of a kv store.
  • How to do queries in a kv store.

I think that you will find that kv-stores are enough for many applications and can often provide better performance and availability than traditional RDMS. That being said, every app is different and therefore, it is impossible to answer the original question.

Chromolithograph answered 9/4, 2013 at 17:2 Comment(1)
This is one of the most informative and eye-opening answers I have ever read on StackOverflow. Before reading this I had no idea how the internals of a database actually worked. Now I feel ready to go build something with a kv store to which I would have previously said "welp you can only do that with SQL."Caa
T
5

Do not confuse a NoSQL type database with something like memcached (which is not intended to store data permanently).

Typical use for memcached is to store some query results that can be accessed by a cluster of web servers - ie. a shared cache. Eg. On this page is a list of related posts and there is likely a bit of work for the database to do to produce that list. If you do that every time someone loads the page then you will create a lot of work for the database. Instead, the results once retrieved for the first time could be stored on a memcached server with the key being the page ID. Any of the web servers in the cluster can then fetch that information very quickly without having to constantly hit the database. After a while, the cache entry would be purged by memcached so that the results for old articles don't use up space. [Disclaimer: I've no idea if StackOverflow does this in reality].

A "NoSQL" database on the other hand is for storing information permanently. If your data schema is quite simple and so are your queries, then it may be faster than a standard SQL database. A lot of web applications don't need hugely complex databases, and so NoSQL databases can be a good fit.

Titania answered 4/8, 2011 at 3:18 Comment(2)
Why wouldn't you just cache the ENTIRE page instead?Embolic
You could cache parts of the page, but not all of it since (for example) it has my login name at the top of my version. But it's a fair point - you could cache quite of lot of it as an HTML snippet.Titania
C
5

There are two general viable use-cases for noSQL:

  1. Rapid application development
  2. Massively scalable systems

The fact that most noSQL solutions are effectively schema-less; require far less ceremony to operate; are light-weight (in terms of API); and provide significant performance gains in contrast to the more canonical relational persistence systems informs their suitability for the above 2 use-cases (in the general sense).

Being cynical -- or perhaps practical in the business sense -- one can propose a 3rd general use-case for noSQL systems (still informed by the above set of characteristics/features):

It is easier to grock and any inexperienced (but un-brain-dead) aspring geek can pick it up in a snap. That is a very powerful feature. (Try that with Oracle ..)

So, the use-cases of noSQL systems -- which in general can be characterized as relaxed persistent systems -- are all optimally informed by practical considerations.

There is absolutely no question -- outside of hugely massively scalable systems -- that RDBMS systems are formally perfect systems designed to insure data integrity.

Caning answered 6/8, 2011 at 3:38 Comment(0)
M
4

Key-value stores are usually really fast so it's good to have them as a cache for data that is heavily accessed and rarely updated to reduce load on your DBs.

As you said, you are usually limited with queries (though MongoDB handles them pretty well), but key-value stores are mostly meant for accessing precise data: user X's profile, session X's info, etc.

A "traditional" DB will probably be more than enough for the average website, but if you experience high loads key-value stores can really help your load times.

EDIT: And by "high loads", I mean really high loads. Key-value stores are rarely necessary.

See this comparison of key-value stores.

Myke answered 4/8, 2011 at 2:39 Comment(1)
does your answer still apply if you have a json array with 1000 items and 8 string fields per item that needs to be refreshed every 20 seconds and will be accessed by fuzzy searching the keys?Inquisition
D
1

Just an adding to bstrawson's answer, "mem-cache-d" is a caching mechanism while Redis is a permanent storage but both store data as key-value pair.

Search on a key-value storage(something like Redis or Membase) more like search all the value in a relational database, too slow. If you want do some querying you may need to move to document-oriented NoSQL type DB such as MongoDB or CouchDB which you can do some query part.

Near future you will able to handle couchbase sever 2.0 which will address all your burning issues with NoSQL data querying with newly introduced UnQL and caching(directly derived from the memcached source code)

Derosa answered 4/8, 2011 at 12:5 Comment(0)
A
0

Stack Overflow does indeed use Redis, and extensively. Detailed answer to your question, with Stack Overflow as the example, in a couple of nice blog posts by @Mark Gravell. Mark is the author of the superb Booksleeve fully-asynchronous .NET Redis binding library.

Antique answered 10/8, 2011 at 2:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.