How to version control a record in a database [closed]
Asked Answered
W

12

207

Let's say that I have a record in the database and that both admin and normal users can do updates.

Can anyone suggest a good approach/architecture on how to version control every change in this table so it's possible to roll back a record to a previous revision?

Wolframite answered 27/11, 2008 at 6:39 Comment(1)
See stackoverflow.com/questions/125877.Hemielytron
X
195

Let's say you have a FOO table that admins and users can update. Most of the time you can write queries against the FOO table. Happy days.

Then, I would create a FOO_HISTORY table. This has all the columns of the FOO table. The primary key is the same as FOO plus a RevisionNumber column. There is a foreign key from FOO_HISTORY to FOO. You might also add columns related to the revision such as the UserId and RevisionDate. Populate the RevisionNumbers in an ever-increasing fashion across all the *_HISTORY tables (i.e. from an Oracle sequence or equivalent). Do not rely on there only being one change in a second (i.e. do not put RevisionDate into the primary key).

Now, every time you update FOO, just before you do the update you insert the old values into FOO_HISTORY. You do this at some fundamental level in your design so that programmers can't accidentally miss this step.

If you want to delete a row from FOO you have some choices. Either cascade and delete all the history, or perform a logical delete by flagging FOO as deleted.

This solution is good when you are largely interested in the current values and only occasionally in the history. If you always need the history then you can put effective start and end dates and keep all the records in FOO itself. Every query then needs to check those dates.

Xeniaxeno answered 27/11, 2008 at 7:15 Comment(8)
You can do the audit table updating with database triggers if your data access layer doesn't directly support it. Also, it's not hard to build a code generator to make the triggers that uses introspection from the system data dictionary.Unvoice
I woyuld recommend that you actually insert the new data, not the previous, so the history table has all of the data. Although it stores redyundent data, it eliminates the special cases required to deal with searching in both tables when historical data is required.Foreigner
@Foreigner - I meant insert all the current values into the history table, and then update the main table with new values.Xeniaxeno
Personally I'd recommend not deleting anything (defer this to a specific housekeeping activity) and have an "action type" column to specify whether it is insert/update/delete. For a delete you copy the row as normal, but put "delete" in the action type column.Lille
If the HISTORY table includes the current values, why not just define a view which shows only the most-current-version from the HISTORY table? Would that cause performance problems vs. maintaining a separate table?Patrica
@Patrica A table holding the current values will perform better than a view of the historic table. You might also want to define foreign keys referencing the current values.Xeniaxeno
There is a foreign key from FOO_HISTORY to FOO': bad idea, I would like to delete records from foo without changing history. the history table should be insert-only in normal use.Consume
Will this not result in massively ballooning your data size?Angadresma
K
51

I think you are looking for versioning the content of database records (as StackOverflow does when someone edits a question/answer). A good starting point might be looking at some database model that uses revision tracking.

The best example that comes to mind is MediaWiki, the Wikipedia engine. Compare the database diagram here, particularly the revision table.

Depending on what technologies you're using, you'll have to find some good diff/merge algorithms.

Check this question if it's for .NET.

Kazue answered 27/11, 2008 at 6:59 Comment(0)
W
38

In the BI world, you could accomplish this by adding a startDate and endDate to the table you want to version. When you insert the first record into the table, the startDate is populated, but the endDate is null. When you insert the second record, you also update the endDate of the first record with the startDate of the second record.

When you want to view the current record, you select the one where endDate is null.

This is sometimes called a type 2 Slowly Changing Dimension. See also TupleVersioning

William answered 27/11, 2008 at 7:9 Comment(7)
Won't my table grow quite large using this approach?Wolframite
Yes, but you can deal with that by indexing and/or partitioning the table. Also, there will only be a small handful of large tables. Most will be much smaller.Unvoice
If I'm not mistaken the only downfall here is that it limits changes to once per second correct?Lauree
@pimbrouwers yes, it ultimately depends on the precision of the fields and the function that populates them.William
one problem is that architecture defeats base advantages of RDBMS, like referential integrity and duplicates preventionPigskin
@Pigskin A generated ID column (that is not part of the data) can be used as the primary key for reference in other tables. UNIQUE and/or CHECK constraints and can preserve the natural integrity of the data while triggers can ensure the changes always move forward in time without gaps.William
I've used this solution. The problem was keeping relationships with other tables, so I set up a primary key "foo_id" with autoincrement and another (non primary) "foo_ref" with a value that is not incremented (I saw the '_ref' suffix in some book, but now I don't know if it's a good naming). Another approach I used is to use a composite primary key with a foo_id and a counter. There's no limit in change rate. Some versions may have the same timestamp, but you can still sort by primary key and check if endDate is null to get the last one.Jellyfish
L
10

Upgrade to SQL 2008.

Try using SQL Change Tracking, in SQL 2008. Instead of timestamping and tombstone column hacks, you can use this new feature for tracking changes on data in your database.

MSDN SQL 2008 Change Tracking

Lozenge answered 28/11, 2008 at 0:36 Comment(0)
A
9

Two options:

  1. Have a history table - insert the old data into this history table whenever the original is updated.
  2. Audit table - store the before and after values - just for the modified columns in an audit table along with other information like who updated and when.
Alecalecia answered 27/11, 2008 at 9:4 Comment(0)
G
9

Just wanted to add that one good solution to this problem is to use a Temporal database. Many database vendors offer this feature either out of the box or via an extension. I've successfully used the temporal table extension with PostgreSQL but others have it too. Whenever you update a record in the database, the database holds on to the previous version of that record too.

Guesswarp answered 10/12, 2015 at 3:29 Comment(0)
C
6

You can perform auditing on a SQL table via SQL triggers. From a trigger you can access 2 special tables (inserted and deleted). These tables contain the exact rows that were inserted or deleted each time the table is updated. In the trigger SQL you can take these modified rows and insert them into the audit table. This approach means that your auditing is transparent to the programmer; requiring no effort from them or any implementational knowledge.

The added bonus of this approach is that the auditing will occur regardless of whether the sql operation took place via your data access DLLs, or via a manual SQL query; (as the auditing is performed on the server itself).

Couperin answered 27/11, 2008 at 12:40 Comment(0)
Z
5

Alok suggested Audit table above, I would like explain it in my post.

I adopted this schema-less, single table design on my project.

Schema:

  • id - INTEGER AUTO INCREMENT
  • username - STRING
  • tablename - STRING
  • oldvalue - TEXT / JSON
  • newvalue - TEXT / JSON
  • createdon - DATETIME

This table can hold historical records for each table all in once place, with complete object history in one record. This table can populated using triggers / hooks where data changes, storing old and new value snapshot of the target row.

Pros with this design:

  • Less number of tables to manage for history management.
  • Stores full snapshot of each row old and new state.
  • Easy to search on each table.
  • Can create partition by table.
  • Can define data retention policy per table.

Cons with this design:

  • Data size can be large, if system has frequent changes.
Zildjian answered 3/8, 2018 at 18:36 Comment(1)
came up with the same approach, I believe you are not maintaining field here , I mean versioning based on the fields but taking into account a record state change in generalBergschrund
A
4

You don't say what database, and I don't see it in the post tags. If it's for Oracle, I can recommend the approach that is built in in Designer: use journal tables. If it's for any other database, well, I basically recommend the same way, too...

The way it works, in case you want to replicate it in another DB, or maybe if you just want to understand it, is that for a table there is a shadow table created too, just a normal database table, with the same field specs, plus some extra fields: like what action was last taken (string, typical values "INS" for insert, "UPD" for update and "DEL" for delete), datetime for when the action took place, and user id for who did it.

Through triggers, every action to any row in the table inserts a new row in the journal table with the new values, what action was taken, when, and by what user. You don't ever delete any rows (at least not for the last few months). Yes it'll grow big, easily millions of rows, but you can easily track the value for any record at any point in time since the journaling started or the old journal rows got last purged, and who made the last change.

In Oracle everything you need is generated automatically as SQL code, all you have to do is to compile/run it; and it comes with a basic CRUD application (actually only "R") to inspect it.

Argentite answered 27/11, 2008 at 8:10 Comment(0)
C
3

I am also doing the same thing. I am making a database for lesson plans. These plans need atomic change versioning flexibility. In other words, each change, no matter how small, to the lesson plans needs to be allowed but the old version needs to be kept intact as well. That way, lesson creators can edit lesson plans while students are using them.

The way it would work is that once a student has done a lesson, their results are attached to the version they completed. If a change is made, their result's will always point to their version.

In this way, if a lesson criteria is deleted or moved, their results won't change.

The way I am currently doing this is by handling all the data in one table. Normally I would just have one id field, but with this system, I am using an id and a sub_id. The sub_id always stays with the row, through updates and deletes. The id is auto-incremented. The lesson plan software will link to the newest sub_id. The student results will link to the id. I have also included a timestamp for tracking when changes happened, but it isn't necessary to handle the versioning.

One thing I might change, once I've tested it, is I might use the previously mentioned endDate null idea. In my system, to find the newest version, I would have to find the max(id). The other system just looks for endDate = null. Not sure if the benefits outway having another date field.

My two cents.

Castigate answered 22/10, 2012 at 23:55 Comment(0)
E
2

While @WW. answer is a good answer another way is to make a version column and keep all your versions in the same table.

For one table approach you either:

  • Use a flag to indicate the latest ala Word Press
  • OR do a nasty greater than version outer join.

An example SQL of the outer join method using revision numbers is:

SELECT tc.*
FROM text_content tc
LEFT OUTER JOIN text_content mc ON tc.path = mc.path
AND mc.revision > tc.revision
WHERE mc.revision is NULL 
AND tc.path = '/stuff' -- path in this case is our natural id.

The bad news is the above requires an outer join and outer joins can be slow. The good news is that creating new entries is theoretically cheaper because you can do it in one write operation with out transactions (assuming your database is atomic).

An example making a new revision for '/stuff' might be:

INSERT INTO text_content (id, path, data, revision, revision_comment, enabled, create_time, update_time)
(
SELECT
(md5(random()::text)) -- {id}
, tc.path
, 'NEW' -- {data}
, (tc.revision + 1)
, 'UPDATE' -- {comment}
, 't' -- {enabled}
, tc.create_time
, now() 
FROM text_content tc
LEFT OUTER JOIN text_content mc ON tc.path = mc.path
AND mc.revision > tc.revision
WHERE mc.revision is NULL 
AND tc.path = '/stuff' -- {path}
)

We insert by using the old data. This is particularly useful if say you only wanted to update one column and avoid optimistic locking and or transactions.

The flag approach and history table approach requires two rows to be inserted/updated.

The other advantage with the outer join revision number approach is that you can always refactor to the multiple table approach later with triggers because your trigger should essentially to do something like the above.

Expressionism answered 28/11, 2012 at 2:25 Comment(0)
A
0

As an additional step to the answers above me, I would suggest giving each generated change a unique ID, likely something with the date/time and a unique counter for each day (so that multiple updates a second don't overlap). I would include a action type code within this code, so "9129128213939REPLACE". This provides a robustness to allow sanity checking that your other system of history is working correctly.

Angadresma answered 1/11, 2020 at 16:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.