Design of snapshots in a transactional database along with versioning of reference data
Asked Answered
B

1

6

Disclaimer: I have read everything that I can read on the topic of snapshots and versioning on both stack overflow and on internet. My requirement is not version tracking for audit trail or the database-level snapshots. I have spent more than 1 week to research on my own and to think through the possible options. Sorry, I could have missed some links - if the solution to my problem is already discussed in some other thread, please point me there.

It is a bit long; please bear with me.

Here is the situation: We are trying to create a generic design to store snapshots of the transactional data in our transactional database and also to keep a revision history of reference data.

As part of the business process, a user can press a button to publish certain object. For the purpose of illustration, let us say that the user can publish a proposal from the vendor before negotiation starts. Then, at different points in time through the negotiation process, the user can publish the proposal data. The proposal contains a budget, sales targets and many other items. When a proposal is snapshotted, all the linked entities have to be snapshotted. Finally, after the negotiation, a contract is signed. At this point, a complete snapshot of the contract has to be created. Not all the entities in the contract are there in the proposal – there are lot of overlapping entities, but there are unique entities attached to proposal and contract.

We have to keep both these published versions and the latest active versions available. Published versions are made available on a website to be referenced by both vendors as well the management team. Not all published versions are made available on the website, but the last published proposal and latest published contract are always available in the website. This website also has to be populated from the same database.

Also, a finance user can decide to snapshot only the budget alone and a sales manager can snapshot the sales targets. So, snapshots are available at multiple granularities.

We also have a requirement to track versions of the master data. It is a business requirement to track all the changes to key master data columns over time. For example, we have region information associated with the sales targets. The name of the region can change and we want to track these changes. Let us assume that at the time of proposal, the region’s name is R1 and a snapshot is created. Then, the name of the region changes to R2 and then 2 other snapshots are created. We want to able to link the sales targets to the correct region name at those points in time, not necessarily to the latest region name.

We have some flexibility in modelling as we have both a transaction DB and a data warehouse DB and we can decide to store some of this information either in the transaction DB or in the data warehouse DB.

Here is our design. We have a Publication table which captures basic information about the published data – who published and the date, the reason, and the type of object published (proposal or budget or sales targets).

We store the snapshots in the same table as the original data. So, proposal snapshots would be stored with live proposals in the proposals table. We have a column called Publication ID in every table which has to be published. This column is a FK to the Publication table. If the Publication ID is null, that record is the active version.

I realized that the post is very lengthy. Hence, rather than listing the scenario details, I thought of quickly summarizing the design considerations in a mind map. Snapshot Design Considerations

Now there are 2 solutions that we are leaning towards - both would store a snapshot of all the data whether it has changed or not. Maintaining only the delta while keeping table structures intact would necessitate a very complex stored procedure which has to run on every insert/update of any of the snapshotted object. I do not want to go down this route as this would take longer and the volumes are not that huge anyway.

Solution 1: Each time, an object is published (like proposal or budget), we would populate an XML tree and persist this in the database. Only the latest version need to be available in website and old versions are rarely needed. Given this, would I run into big performance issue due to using XML? We use SQL Server. The data volumes are that not huge.

Solution 2: All the transaction tables would have a publication ID and the reference data would have start and end dates. Whenever an object is published, we would make a copy of all transaction records and put the publication ID there and we would copy all of reference data records and put a snapshot date as the end date. This would allow us to have normal versioning for reference data outside of the publication process.

I would need opinions from experienced minds here as to the drawbacks of these 2 approaches and whether there is any other better scenario.

Bioecology answered 22/3, 2011 at 17:3 Comment(4)
I wonder, have you have looked at temporal databases.Honeybunch
Some questions: Does your data warehouse have any time/calendar based dimensions? Is it an requirement that your snapshots be stored in the same table as the original data?Chyack
Hi Chris, yes the DW schema (which is separate from the schema that we are designing for a transactional DB) has time dimension. There is no requirement to store it in the same table - we can make the decision based on what works best.Bioecology
Hi Angelom, I have taken a brief look at temporal databases. It seemed like a scholarly subject with lot of research. I browsed the links from the stackoverflow post relating to temporal databases.Bioecology
C
1

My approach would be to opt for solution 2. Taking your design considerations in order:

  1. I would store a copy of everything in the snapshot. If you store the change only, you give yourself the problem of snapshotting details of the process to obtain the desired snapshot from the changes. Initially this is not an issue, but as schemas, programs, and processes change, you will have to maintain details of how to retreive your desired snapshot from a process that has itself changed. Doable, but potentially fragile.

  2. I would go for an option not mentioned in your diagram, though sketched out in your description of solution 2. This is using a schema very similar to that of the transaction DB, but extended to include the information specific to the snapshots. You mention publication ID as a foreign key, and dates for the reference data. You might find you need additional information such as dates, related to the transaction data.

  3. The same schema will not do - you have pointed out (Publication ID) that the same schema is not adequate. Nothing in what you post suggests you need to adopt a different schema optimised for reading. Even if this proves to be required, it is something that can be incorporated at a later stage, with the current, extended schema as a starting point. I do not have much experience with XML trees, but would ask "why introduce another technology when you have alternatives that can utilise your existing infrastructure?" Any advantage you perceive from this approach would have to be very significant to warrent throwing away the advantage of leverage from your existing architecture. Similar considerations apply to a denormalised DB. Why go there until there is a demonstrated need to do so?

  4. Again I would adopt the approach of tracking versioning and snapshots. You give a primary benefit of this approach in your solution 2. I would add the snapshotting of the reference data as part of the snapshotting process, rather than the versioning process. (Ie when a snapshot is taken, ensure the appropriate reference tables form part of the snapshot). It seems from your description that you have two different requirements that happen to utilise the same data - the snapshotting, and the versioning. There seems to be little dependency between them, and so you should keep them as independent as possible - lack of coupling.

  5. You make mention of potentially using the data warehouse as storage, though not specifically mentioned in your solutions. If your volumes are, as you suggest, low, then I would have thought that a seperate database was adequate. You do give the impression that volumes of both data and users for the snapshots are low, so there would not seem to be a prima facie case for using the data warehouse. At the same time, the warehouse does have some mechanisms for storing exactly this type of historic data, to be used for reading and analysis.

I am sorry that I have not directly answered your questions here - but I hope this provides some pointers and another view on your stated situation.

Chyack answered 13/4, 2011 at 12:52 Comment(2)
First of all, thanks a lot for the very thoughtful pointers. Definitely very helpful and your effort is really appreciated. I agree with you point no.1 - a brilliant point about process changes that I did not envisage. I have decided to go with snapshotting using the same schema, but adding just the snapshot ID alone and capturing items related to the snapshot in the Snapshot table (my approach 2). So, far I do not have a need to capture additional details specific for each table in the snapshot.Bioecology
I also see the insight of point 4 - yes, there is very little dependency between versioning of reference data and snapshotting of reference data. So, i have decided to do them separately. I have taken the decisions that you hinted at earlier - having a response from someone confirming my decisions provides me the extra comfort. Thanks a lot again Chris for your time.Bioecology

© 2022 - 2024 — McMap. All rights reserved.