Aerospike Design | Request Flow Internals

N

3

6

Where can I find information about the how flow of the read/write request in the cluster when fired from the client API?

In Aerospike configuration doc ( http://www.aerospike.com/docs/reference/configuration ), it's mentioned about transaction queues, service threads, transaction threads etc but they are not discussed in the architecture document. I want to understand how it works so that I can configure it accordingly.

Norvun answered 19/5, 2017 at 17:7 Comment(0)

T

13

From client to cluster node

In your application, a record's key is the 3-tuple (namespace, set, identifier). The key is passed to the client for all key-value methods (such as get and put).

The client then hashes the (set, identifier) portion of the key through RIPEMD-160, resulting in a 20B digest. This digest is the actual unique identifier of the record within the specified namespace of your Aerospike cluster. Each namespace has 4096 partitions, which are distributed across the nodes of the cluster.

The client uses 12 bits of the digest to determine the partition ID of this specific key. Using the partition map, the client looks up the node that owns the master partition corresponding to the partition ID. As the cluster grows, the cost of finding the correct node stays constant (O(1)) as it does not depended on the number of records or the number of nodes.

The client converts the operation and its data into an Aerospike wire protocol message, then uses an existing TCP connection from its pool (or creates a new one) to send the message to the correct node (the one holding this partition ID's master replica).

Service threads and transaction queues

When an operation message comes in as a NIC transmit/receive queue interrupt, a service thread picks up the message from the NIC. What happens next depends on the namespace this operation is supposed to execute against. If it is an in-memory namespace, the service thread will perform all of the following steps. If it's a namespace whose data is stored on SSD, the service thread will place the operation on a transaction queue. One of the queue's transaction threads will perform the following steps.

Primary index lookup

Every record has a 64B metadata entry in the in-memory primary index. The primary-index is expressed as a collection of sprigs per-partition, with each sprig being implemented as a red-black tree.

The thread (either a transaction thread or the service thread, as mentioned above) finds the partition ID from the record's digest, and skips to the correct sprig of the partition.

Exist, Read, Update, Replace

If the operation is an exists, a read, an update or a replace, the thread acquires a record lock, during which other operations wait to access the specific sprig. This is a very short lived lock. The thread walks the red-black tree to find the entry with this digest. If the operation is an exists, and the metadata entry does exist, the thread will package the appropriate message and respond. For a read, the thread will use the pointer metadata to read the record from the namespace storage.

An update needs to read the record as described above, and then merge in the bin data. A replace is similar to an update, but it skips first reading the current record. If the namespace is in-memory the service thread will write the modified record to memory. If the namespace stores on SSD the merged record is placed in a streaming write buffer, pending a flush to the storage device. The metadata entry in the primary index is adjusted, updating its pointer to the new location of the record. Aerospike performs a copy-on-write for create/update/replace.

Updates and replaces also needs to be communicated to the replica(s) if the replication factor of the namespace is greater than 1. After the record locking process, the operation will also be parked in the RW Hash (Serializer), while the replica write completes. This is where other transactions on the same record will queue up until they hit the transaction pending limit (AKA a hot key). The replica write(s) is handled by a different thread (rw-receive), releasing the transaction or service thread to move on to the next operation. When the replica writes complete the RW Hash lock is released, and the rw-receive thread will package the reply message and send it back to the client.

Create and Delete

If the operation is a new record being written, or a record being deleted, the partition sprig needs to be modified.

Like update/replace, these operations acquire the record-level lock and will go through the RW hash. Because they add or remove a metadata entry from the red-black tree representing the sprig, they must also acquire the index tree reduction lock. This process also happens when the namespace supervisor thread finds expired records and remove them from the primary index. A create operation will add an element to the partition sprig.

If the namespace stores on SSD, the create will load the record into a streaming write buffer, pending a flush to SSD, and ahead of the replica write. It will update the metadata entry in the primary index, adjusting its pointer to the new block.

A delete removes the metadata entry from the partition sprig of the primary index.

Summary

exists/read grab the record-level lock, and hold it for the shortest amount of time. That's also the case for update/replace when replication factor is 1.
update/replace also grab the RW hash lock, when replication factor is higher than 1.
create/delete also grab the index tree reduction lock.
For in-memory namespaces the service thread does all the work up to potentially the point of replica writes.
For data on SSD namespaces the service thread throws the operation onto a transaction queue, after which one of its transaction threads handles things such as loading the record into a streaming write buffer for writes, up until the potential replica write.
The rw-receive thread deals with replica writes and returning the message after the update/replace/create/delete write operation.

Further reading

I've addressed key-value operations, but not batch, scan or query. The difference between batch-reads and single-key reads is easier to understand once you know how single-read works.
Durable deletes do not remove the metadata entry of this record from the primary index. Instead, those are a new write operation of a tombstone. There will be a new 64B entry in the primary index, and a 128B entry in the SSD for the record.
Performance optimizations with CPU pinning. See: auto-pin, service-threads, transaction-queues.

Tosh answered 2/6, 2017 at 22:12 Comment(6)

@mohit-gupta does this answer your question? – Tosh 4/6, 2017 at 16:41

Thanks a lot, Ronen, for such lucid and detailed explanation. This is what I was looking for. This should be a part of documentation in my opinion :). Few more follow up questions: – Norvun 5/6, 2017 at 9:56

1. Are both tree-locks ( I think you referred them as record-level locks ) and reduce-locks exclusive? If so, what's the benefit of two type of locks? 2. What's the data-structure of "RW Hash (Serializer)". Is it a table table of queues with keys as record's key's hash and value is list of transactions waiting for that record? If so, how is the concurrency handled in this case? 3. When happens when say a replica is down and rw-receive thread times out esp. wrt "write commit policy". At which point a transaction can be said to be committed? Thanks again. – Norvun 5/6, 2017 at 10:13

1. Both locks are for the sprig (the subtree), and they're separated to reduce contention between read/update/replace and create/delete. 2. As described, there are situations where multiple operations accessing the same record are serialized. Aerospike is a mutli-node distributed database, with each node being a multi-core, multi-threaded system, so multiple records can be handled in parallel. On the single record level operations are ensured to be isolated, they line up. How to handle hot key is a separate topic: discuss.aerospike.com/t/hot-key-error-code-14/986 – Tosh 5/6, 2017 at 13:23

Regarding 3. Aerospike is an AP database, emphasizing high availability. A replica would only be down for a very short time until the cluster automatically reorganizes around the N-1 number of nodes. A new replica partition will be created elsewhere. If that write came just as this was happening the replica write might time out, but the write will remain in the master partition, and then get replicated as the cluster rebalanced. The write therefore returns as successful if it only reaches the master and the cluster is changing. – Tosh 5/6, 2017 at 13:31

Again regarding (3) - actually with the default commit level write policy of ALL of any of the replica writes fail an exception will be raised. Only if you relax the policy to MASTER you might lose a replica write, with the recovery happening as I described. – Tosh 10/5, 2018 at 2:41

C

3

Service threads == transaction queues == number of cores in your CPU or use CPU pinning - auto-pin config parameter if available in your version and possible in your OS env.

transaction threads per queue-> 3 (default is 4, for objsize <1KB, non data-in-memory namespace, 3 is optimal)

Cupreous answered 22/5, 2017 at 7:14 Comment(1)

That is what Aerospike recommends - seems like simply exploiting each core by uniformly spreading the incoming transaction workload and at the same time not make it greater than number of cores to avoid unnecessary context switching. – Cupreous 23/5, 2017 at 11:39

C

0

Changes with server ver 4.7+, the transaction is now handled by the service thread itself. By default, number of service threads is now set to 5 x no. of cpu cores. Once a service thread picks a transaction from the socket buffer, it carries it through completion unless it ends up in the rwHash (e.g. writes for replicating). The transaction queue is still there (internally) but only relevant for transaction restarts when queued up in the rwHash. (Multiple pending transactions for the same digest).

Cupreous answered 10/3, 2020 at 20:41 Comment(0)

Recommended topics

Hot tags