How do databases perform atomic I/O?

Asked 9/2, 2012 at 12:18 Answered 9/2, 2012 at 17:46

Databases like Oracle, SQL Server etc are very good at data integrity. If I wanted to write a data store that I knew would either store some data or fail (i.e. be ACID) then I would use a database like MySQL underneath it as the actual store because those problems are already solved.

However, as a non comp-sci graduate, I'm left wondering how ACID actually works at a very low level. I know that Oracle for example, writes data to "online redo logs" all the time, then performs a "commit" at some point when the app signals the transaction should be committed.

It's this "commit" stage that I want to zoom right in on and understand. Is it a case of just writing "one more byte" to disk, or flipping a 0 to a 1 to say a given row has been stored successfully?

Neuro answered 9/2, 2012 at 12:18 Comment(9)

Better suited to dba.stackexchange.com - as this is not a programming question, as such. – Mohammedmohammedan 9/2, 2012 at 12:20

I respectfully disagree. If I were "administering a database" then fine, but if I were "writing" one, then it's a programming question, answerable by people who did software engineering or computer science degrees etc. – Neuro 9/2, 2012 at 12:27

Fair enough. The question seems to ask about current implementations (ie. how does Oracle do this), hence my comment. – Mohammedmohammedan 9/2, 2012 at 12:29

You could have a look at the PostgreSQL source code to find out how they do it. Apparently the code is very clean and relatively "easy" to understand. – Wendling 9/2, 2012 at 12:39

Yes I was actually thinking of looking at RavenDB. – Neuro 9/2, 2012 at 13:31

You may want to read Atomic Commit in SQLite which explains how SQLite does it. – Tympanic 9/2, 2012 at 14:10

RavenDB probably won't help you much because it uses ESENT.dll to do it's transactional bits. – Courtney 9/2, 2012 at 15:3

@DamianPowell I'm not sure that's the case? ayende.com/blog/4686/raven-munin – Neuro 9/2, 2012 at 15:10

@DamianPowell - That's what was originally used. Ayende moved on since. – Mohammedmohammedan 9/2, 2012 at 19:18

I remember once finding the BerkeleyDB docs actually very useful to get an idea of how these implementations can work, because that is/was a quite low-level database that implemented transactions without the whole relational/query planning infrastructure.

Not all databases (even just the ones you mention) work in quite the same way. PostgreSQL's bottom-level implementation is quite different to both Oracle and SQL Server's snapshot implementation, even though they are all based on the same approach (MVCC: multi-version concurrency control).

One way to implement the ACID properties is to write all the changes that you ("you" here is some transaction making changes) make to the database into a "transaction log", as well as locking each row (unit of atomicity) to ensure no other transaction can mutate it at until you have either committed or rolled back. At the end of the transaction, if committing, you simply write a record into the log saying you committed and release the locks. If you roll back, you need to walk back through the transaction log undoing all your changes-- so each change written to the log file contains a "before image" of how the data looked originally. (In practice it will also contain an "after image" because transaction logs are replayed for crash recovery too). By locking each row you are changing, concurrent transactions do not see your changes until you release the locks after ending the transaction.

MVCC is a method by which concurrent transactions that want to read rows, rather than being blocked by you updating, can access the "before image" instead. Every transaction has an identity and has a way of determining which transactions' data it can "see" and which it can't: different rules to produce this set are used to implement different isolation levels. So to get "repeatable read" semantics, a transaction must find the "before image" for any row that was updated by a transaction that was started after it, for example. You could naively implement this by having transactions look back through the transaction log for the before images, but in practice they are stored somewhere else: hence Oracle has separate redo and undo spaces- redo is the transaction log, undo are before images of blocks for concurrent transactions to use; SQL Server stores the before images in tempdb. By contrast, PostgreSQL always creates a new copy of a row whenever it is updated, so the before images live in the data blocks themselves: this has some advantages (commit and rollback are both very simple operations, no additional space to manage) with tradeoffs (those outdated row versions have to be vacuumed up in the background).

In PostgreSQL's case (and this is the DB I'm most familiar with the internals of) each row version on disk has some additional properties the transactions must examine to decide if that row version is "visible" to them. For simplicity, consider that they have "xmin" and "xmax"- "xmin" specifies the transaction ID that created the row version, "xmax" the (optional) transaction ID that deleted it (which may include creating a new row version to represent an update to the row). So you start with a row created by txn#20:

xmin xmax id value
20   -    1  FOO

and then txn#25 performs update t set value = 'BAR' where id = 1

20   25   1  FOO
25   -    1  BAR

Until txn#25 is finished, new transactions will know to regard its changes as not visible. So a transaction scanning this table will take the "FOO" version, since its xmax is a non-visible transaction.

If txn#25 is rolled back, new transactions will not immediately skip it, but will consider whether txn#25 committed or rolled back. (PostgreSQL manages a "commit status" lookup table to serve this, pg_clog) Since txn#25 rolled back, its changes are not visible, so again the "FOO" version is taken. (And the "BAR" version is skipped since its xmin transaction is invisible)

If txn#25 is committed, then the "FOO" row version is now not taken, since its xmax transaction is visible (that is, the changes made by that transaction are now visible). By contrast, the "BAR" row version is taken, since its xmin transaction is visible (and it has no xmax)

While txn#25 is still in progress (again this can be read from pg_clog) any other transaction that wants to update the row will wait until txn#25 completes, by trying to take a shared lock on the transaction ID. I'm highlighting this point, it's why PostgreSQL doesn't usually have "row locks" as such, only transaction locks: there is no in-memory lock for each row changed. (Locking using select ... for update is done by setting xmax and a flag to indicate xmax just indicates locking not deletion)

Oracle... does something somewhat similar but my knowledge of the details are much hazier. In Oracle each transaction is issued a System Change Number, and that is recorded in the top of each block. When a block changes, its original contents are put in the undo space with the new block pointing at the old block: so you essentially have a linked list of versions of block N- the latest version in the data file, with the progressively older versions in the undo tablespace. And at the top of the block is a list of the "interested transactions" which somehow implements locking (again not having an in-memory lock for each row changed), and I can't remember the details beyond that.

SQL Server's snapshot isolation mechanism I believe is largely similar to Oracle's, using tempdb to store blocks that are being changed rather than a dedicated file.

Hope this rambling answer was useful. It's all from memory so large swathes of misinformation are possible, particularly for the non-postgresql implementations.

http://devcenter.heroku.com/articles/postgresql-concurrency
https://gist.github.com/1781649 - runnable demo of MVCC on PostgreSQL
http://simpledbm.googlecode.com/files/mvcc-survey-1.0.pdf

Sociometry answered 9/2, 2012 at 14:6 Comment(1)

Actually, I think MS SQL Server, PostgreSQL and Firebird/Interbase are cousins regarding how they implement MVCC. The basic technique was pioneered by Jim Starkey for what was to be eventually released as "Interbase", and is fundamentally row-oriented. Oracle uses a different (more "page oriented") algorithm, which IIRC is patented, thus keeping the competition at bay. – Chronometer 2/3, 2012 at 20:57

A high-level overview for Oracle:

Each Oracle session is unique, and each session can have 1* active transaction. When a transaction begins, Oracle assigns a monotonically increasing System Change Number (SCN) to it. As Oracle updates/inserts/deletes rows, Oracle locks the rows of interest in the table and supporting index by updating the header in the blocks being written as well as saving the "original" blocks to oracle's rollback (undo) space. Oracle also writes redo log entries to a memory buffer describing the changes being made to both the table and index blocks as well as the undo blocks. Note that the changes being made are being made in memory, not directly to disk.

When committing, Oracle ensures that the entire log buffer up to and including the SCN for the transaction has been written to disk before returning control of the transaction back to the client.

When rolling back, Oracle uses the information in the rollback (undo) to eliminate the changes made.

So how does this implement ACID:

Atomicity: My session, my transaction, it all goes or none of it goes. When I commit, I can't do anything until the commit finishes.

Consistency: Oracle checks that dates are dates, character data is character data, numbers are valid. Same thing with check constraints. Foreign key constraints rely on checking to make sure that the parent key being referenced is valid - and has not been updated or deleted by a transaction in flight. If the parent key has been updated or deleted, your statement hangs - it's in limbo, really - waiting for the statement affecting the parent record to commit or rollback.

Independence: Remember those system change numbers? If you're not making changes, Oracle knows what the SCN is at the time you start a statement or declare a cursor. So if you have a long-running statement where the data is changing out from under you, Oracle checks to get the data AS IT WAS COMMITTED when your statement started running. It's multi-version consistency control, and it's pretty complicated. Oracle doesn't implement all of the isolation levels called for by various SQL standards - for example, Oracle never permits dirty or phantom reads.

Durablility: The redo log buffer being flushed to disk is the root level of durability. When a redo log file is full, Oracle forces a checkpoint. This process causes Oracle to write all modified table and index blocks out of memory to disk, regardless of whether they've been committed. If the instance crashes and the data in the datafiles contains uncommitted changes, Oracle uses the redo logs to roll back those changes, as the undo information is contained in the redo log as well.

*Ignore autonomous transactions for the moment, because they are a severe complication.

Mucker answered 9/2, 2012 at 17:46 Comment(2)

"When committing, Oracle ensures that the entire log buffer up to and including the SCN for the transaction has been written to disk..." How does it do that? This is the very core of my question - does it start writing the contents of the log buffer and SCN to disk, then do one last little thing like set a bit on disk from 0 to 1 to say that it all worked okay? – Neuro 9/2, 2012 at 23:57

It uses fwrite and fflush calls (I believe the Oracle kernel's still in C) and relies on the status returned from those calls to know whether it worked or not. At some point Oracle has to trust that the OS did what it says it did. – Mucker 13/2, 2012 at 5:47

Ayende suggested to me on Twitter that I look at Munin, the actual data storage mechanism he's using for RavenDB and Raven MQ.

Neuro answered 9/2, 2012 at 15:5 Comment(0)

Recommended topics

Hot tags