how do non-ACID RethinkDB or MongoDB maintain secondary indexes for non-equal queries
Asked Answered
S

3

13

This is more of 'inner workings' undestanding question:

How do noSQL databases that do not support *A*CID (meaning that they cannot update/insert and then rollback data for more than one object in a single transaction) -- update the secondary indexes ?

My understanding is -- that in order to keep the secondary index in sync (other wise it will become stale for reads) -- this has to happen withing the same transaction.

furthermore, if it is possible for index to reside on a different host than the data -- then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.

But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host) , what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?

Soundproof answered 10/8, 2013 at 15:25 Comment(0)
G
22

This is a great question.

RethinkDB always stores secondary indexes on the same host as the primary index/data for the table. Even in case of joins, RethinkDB brings the query to the data, so the secondary indexes, primary indexes, and data always reside on the same node. As a result, there is no need for distributed locking protocols such as two phase commit.

RethinkDB does support a limited set of transactional functionality -- single document transactions. Changes to a single document are recorded atomically. Relevant secondary index changes are also recorded as part of that transaction, so either the entire change is recorded, or nothing is recorded at all.

It would be easy to extend the limited transactional functionality to support multiple documents in a single shard, but it would be hard to do it across shards (for the distributed locking reasons you brought up), so we decided not to implement transactions for multiple documents yet.

Hope this helps.

Gothenburg answered 12/8, 2013 at 20:8 Comment(1)
To clarify a bit on this. Every shard of a table contains all of the indexing data for the shard. This means that there's as many copies of each index as there are replicas. It's incorrect to say that "the entire change is recorded, or nothing is recorded at all." Each replica will either record or not record the change (and resulting index changes.) A replica, and thus its index can be out of date (and the stale data can be accessed by passing a use_outdated = True flag. This data will never be used without that flag being set and will eventually be brought up to date.Ritornello
H
3

This is a MongoDB answer.

I am not quite sure what your logic here is. Updating a secondary index has nothing to do with being able to rollback multi statement transactions such as a multiple update.

MongoDB has transcactions per a single document, and that is what matters for updating indexes. These operations can be reversed using the journal if the need arises.

this has to happen withing the same transaction.

Yes, much like a RDBMS would. The more indexes you apply the slower your writes will be, and it seems to me you know why.

As the write occurs MongoDB will update all indexes which apply to that collection with the fields that apply to specific indexes.

furthermore, if it is possible for index to reside on a different host than the data

I am unsure if MongoDB allows that, I believe there is a JIRA for it; however, I cannot find that JIRA currently.

then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.

Most likely. Allowing this feature would be...well, let's just say creating a hairball.

Even in a sharded setup the index of each range resides on the shard itself, not on the config servers.

But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host)

That is not what a two phase commit means. I believe you need to brush up on what a two phase commit is: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/

I suppose if you are talking about a transaction covering more than one shard then, hmm ok.

what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?

Agan I am unsure why a multi document transaction would effect whether an index would be stale or not, your not grouping across documents. The exception to that is a unique index but that works on single document updates as well; note that its uniqueness gets kinda hairy in sharded setups and cannot be guaranteed.

In an index you are creating, normally, one entry per document prefix key, uless it is a multikey index on the docment then you can make more than one index, however, either way index updating is done per single object, not by multi document transactions and I am unsure what you logic here is aas such this is the answer I have placed.

Hendrika answered 10/8, 2013 at 20:36 Comment(18)
Two phase commit is a protocol used to maintain transaction integrity across multiple hosts. So yes, I am talking 'transactions on multiple documents' across multiple hosts. So the question was: if Mongo and Rethink do not do two-phase commit transaction today, how can they maintain secondary indexes on a shard different than the data. I think your answer was -- that they do not maintain secondary index on a different shard than the data. Therefore they do not need to phase commitSoundproof
@VP OK so you mean in a sharded set, then: yes, they will not mantain that index on the other shard, as well two phase commit has no requirement to be between two hosts, it is a case of two distinct transactions being completely atomic as one operationHendrika
Ok, understood for Mongo. Would like to get some feedback on rethinkDB. Rethink has joins, that means index may be in memory of a host different than data host. Also 2phase commit has requirement of separate hosts: " A distributed transaction is an operations bundle, in which two or more network hosts are involved " en.wikipedia.org/wiki/Two-phase_commit_protocolSoundproof
@VP I do not find that text on that pageHendrika
@Sammaya: the 2phase commit link above has text "It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction ". Then the definition of a distributed atomic transaction is then given by this link en.wikipedia.org/wiki/Distributed_transaction . And this link has the text I referred to when stating that 2phase commit necessarily involves more than one hostSoundproof
@VP I would consider that page incorrect, and it is disputed. You can have a two phase commit to the same shard, i.e. lets take an example: updating the index and data files, the index and data reside on the same shard, a two phase commit is used. You cna use two phase commits in any "distributed" (such a terribly vague word to use for these techs) envo that sits on a single server.Hendrika
@VP I mean lets take another example from the docs, lets modify the two phase commit of transferring money so that the two accounts reside on the same shard. If MongoDB supported it (which it doesn't, you must do this client side atm) this distributed transaction would be to a single host.Hendrika
@Sammaya I do not consider the Wikipedia page on 2phase commit wrong. In your example there is no reason to use 2phase commit, because you can use in memory locking & durability of a write guarantee by the OS APIs of the same host. 2phase commit protocol exists when you cannot rely on a durable write guarantee and on a stability of a lock. You scenario is handled by what's called a transaction monitor -- which client-side, above database construct. And not related to a database level (client invisible) transactionSoundproof
@VP One of the main points (as stated by wikipedia) of a two phase commit is to ensure database consistentcy of rollback scenarios, the OS does not provide such abilities, and in-memory is useless since that is not durable. The OS cannot understand the true consistentcy levels required by the database, it concerns itself only with the file system. The main problem with 2 phase commits in MongoDB atm is that they are client side which means they are not as good as RDBMS alternatives, you don't put transaction monitors client side, it is too flimsyHendrika
@Sammaya. As a reference shared-nothing, multi-host systems like MarkLogic (noSQL, native XML and JSON) database, Sap Hana, terradata, Oracle (via database links do this). They also insure that the multi-object transactions are replayed in correct order during replication and recovery from backup. But again, this discussion on 2phase commit definition and examples -- is not helping me with original question. this is more of a sideline, but happy to continue if this is helping youSoundproof
@VP Yes, that is part of a 2 phase commit, I am unsure how that comment helped, however, that is why they have the transaction monitor server-side and not client side as you suggestHendrika
@Sammaya, when I posted my last comment, I did not see you last comment. But I think, just like on 2phase commit -- you are somewhat misinformed about rollback as well. I did not claim that multi-object transaction rollback is solely implemented by 2phase commit,neither did Wikipedia article. Even in single host databases multi-object is handled by a redo log. 2phase commit on its own is not enough for multi-object transactions, but it is necessary for it during the multi-host ops, and is not necessary in single-host (db level) scenariosSoundproof
@VP One of the defining points of a 2 phase commit is to write a journal to rollback, else that breaks on of the contraints of the purpose of the 2 phase commit protocol. It is nessecary in all cases you wish to ensure consistentcy in multi-object updates using 2 phase commitsHendrika
@VP In most database systems multi-object rollback is ensured by two phase commit, it is a pretty standard tool for this job. Fair enough there are other tools but that is the most common one you see in most databases. Also two phase commit should be good enough, it is used by all the ACID techs to ensure the same thingHendrika
@Sammaya. No, writing journal to rollback is not a 'defining point' of 2phase commit. It is something 2phase commit must rely on. The reliance is also documented in the same article. It is called reliance on Stable Storage and write ahead log (see assumption section) en.wikipedia.org/wiki/Two-phase_commit_protocol. Transactional file system operations are not called 2phase commit, it is a separate and quite useful featureSoundproof
@VP two phase commit and two phase commit protocol are two different things, you will notice I mentioned two phase commit. The protocol is as you link in the wikipedia article but the tool implmented in many databases includes rollback journals. Also file system is useless for the database unless you are using oracle on a dedicated file system, how does the file system transactions help ensure consistentcy between joins it can't see?Hendrika
@VP lets put it this way, using the typical example of transferring money between accounts how does the file system help in a seeable manner that has an impact on the users workflow and helps to define the rollback consistentcy of that data?Hendrika
@VP Infact that wikipedia page does hint at rollback: "To accommodate recovery from failure (automatic in most cases) the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. " but it's English is a misleading in placesHendrika
R
3

RethinkDB always stores secondary index data on the same machine as the data it's indexing. This allows it to be updated within the same transaction. Rethink promises to be ACIDy with single document operations and considers the indexing of a document to be part of the document itself.

Ritornello answered 12/8, 2013 at 19:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.