What is a distributed cache?

Asked 17/3, 2013 at 5:42 Answered 4/6, 2018 at 6:33

I am confused about the concept of Distributed Cache. I kinda know what it is from google search. A distributed cache may span multiple servers so that it can grow in size and in transactional capacity. However, I do not really understand how it works or how it distribute the data.

For example, let's say we have Data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 2 cache servers A and B. If we use distributed cache, then one of possible solution is that Data 1, 3, 5, 7, 9 are stored in Cache Server A, and 2, 4, 6, 8, 10 are stored in cache Server B.

So is this correct or did I misunderstand it?

Second question is that I usually heard the word server node. What is it? In the above example, Server A is a server node, right?

Third question, if a server (let's say Server A) goes down, what can we do about that? I mean if my example above is correct, we cannot get the data 1, 3, 5, 7, 9 from cache when Server A is down, then what could Cache Server do in this case?

Logjam answered 17/3, 2013 at 5:42 Comment(1)

Relevant read 8bitmen.com/… – Cordate 15/6, 2019 at 13:53

Yes, half the data on server a, and half on server b would be a distributed cache. There are many methods of distributing the data, though some sort of hashing of the keys seems to be most popular.
The terms server and node are generally interchangeable. A node is generally a single unit of some collection, often called a cluster. A server is generally a single piece of hardware. In erlang, you can run multiple instances of the erlang runtime on a single server, and thus you'd have multiple erlang nodes... but generally you'd want to have one node per server for more optimum scheduling. (For non-distributed languages and platforms you have to manage your processes based on your needs.)
If a server goes down, and it is a cache server, then the data would have to come from its original source. EG: A cache is usually a memory based database designed for quick retrieval. The data in the cache sticks around only so long as its being used regularly, and eventually will be purged. But for distributed systems where you need persistence, a common technique is to have multiple copies. EG: you have servers A, B, C, D, E, and F. For data 1, you would put it on A, and then a copy on B and C. Couchbase and Riak do this. For data 2, it could be on B, and then copies on C and D. This way if any one server goes down you still have two copies.

Fruiterer answered 17/3, 2013 at 5:48 Comment(8)

Firstly, thanks for you quick responses and it is very very clear. I really appreciated it. About point 3, can I make servers A, B, for real-time caching, and servers C, D, E, F only for backups? ie. For data 1, I put it on Server A and then copies on C and D. For data 2, I put it on Server B and then copies on E and F. So C D E F are only used if A or B are down. This way looks more structured. Is it what you mean in point 3? – Logjam 17/3, 2013 at 5:54

To make it more clear, the difference is that you mix the data and its copies in all the servers. For example, you said put data 1 on server A and copies on B and C, data 2 on Server B and copies on C and D. Then Server B has both "original data" of data 2 and a copy of data 1. Is it better if I separate all "original data" and copies and put them in different servers? – Logjam 17/3, 2013 at 6:3

Well, it sounds like you're writing your own software so you can, of course, do whatever you want. But when designing these kinds of systems, you have to keep aware of failure modes and the cost of management of the servers, especially if you get complex systems with many servers. – Fruiterer 17/3, 2013 at 14:7

What you're describing is a "hot failover" system where severs A and B are the masters and the rest simply are idle until they are needed. This is more expensive because the load that your servers can provide is only 2 servers worth -A & B. I described a fully distributed system where all of the servers can answer queries for some data, which would allow you to handle more load for a given number of servers because none are idle. – Fruiterer 17/3, 2013 at 14:9

Finally, Couchbase 2.0 provides basically the system I described. An in memory RAM cache on every server, with all data being persisted to disk, and all data being distributed around the cluster. Unless you're solving this problem for learning purposes, you can use Couchbase to provide this and much more (such as incremental indexes, hot failover and automatic rebalancing of the data.) This is a hard thing to do completely and well, which is why I'm not doing it myself. :-) – Fruiterer 17/3, 2013 at 14:11

I have another question if we use your way to backup 2 copies of data. I think you may have some good idea. For example, somehow if I keep putting some data in Cache C, and their copies on D and E. Now Cache C is full. Now I put another data, let's say DataX in Cache A, then its copy is supposed to be in B and C, but C is already full. What could I do with it. @Bill Warren – Logjam 18/3, 2013 at 17:57

I know we can use some replacement algorithm such as LRU for a single cache server. But in the above case, if I use some algorithm to replace a entry(let's say Data 1) with the copy of DataX, then Data 1 will only have 2 copies in all cache servers because Data 1 in Cache C is replaced. Then it is inconsistent. Do you have any good ideas to fix this problem? @Bill Warren – Logjam 18/3, 2013 at 17:59

Look at the way memcached buckets work in Couchbase 2.0. Much better to just use that (it's open source) than to implement your own. But the answer to your questions is, if this is really a cache, then when it has too many items it should delete some. That's its purpose. When it purges one, becuase its full, then it purges the copies from the other two servers. By using consistent hashing, the data is spread around the cluster, so no cache gets too full, or no cache should get too full, since the hash is essentially a random assignment. – Fruiterer 19/3, 2013 at 2:13

I have been using Distributed caching solutions for quite some time now (NCache , AppFabric etc) and I am going to answer all three questions based on my experience with Distributed caching.

1: Distributed caching solution allows you to keep data on all the servers by creating a cache cluster. Lets say you have 2 cache servers(server nodes) and you have added 10 items in your cache. Ideally 5 items should be present in both of the server nodes since the data load gets distributed between the number of servers in your cache cluster. This is usually achieved with help of hashing and intelligent data distribution algorithms. As a result, your data request load also gets divided between all cache servers and your achieve linear growth in transnational capacity as you as more servers in the cache cluster.

2: A cache cluster can contain many server machines which are also called server nodes. Yes, Server A is a server node or server machine in your example.

3: Typically Distributed caching system are very reliable using replication support. If one or more servers go down and you had the replication turned on then there will be no data lose or downtime. NCache has different typologies to tackle this such as replicated topology and Partition of replica topology where data of each server is replicated over to the other server as well. In case one server goes down, the replicated data of that server is automatically made available from the surviving server node.

In your example, the data of server A(1, 3, 5, 7, 9) is replicated to server B(2, 4, 6, 8, 10) and vice versa. If server A goes down, the data of server A that is present on Server B will be made available and used from there so that no data lose occurs. So if server A goes down and the application requests for data (1), this data will get retrieved from Server B as Server B contains backup of all the data of Server A. This is seamless to your applications and is managed automatically by the caching system.

Dystopia answered 4/6, 2018 at 6:33 Comment(0)

Recommended topics

Hot tags