Disclaimer: I have read everything that I can read on the topic of snapshots and versioning on both stack overflow and on internet. My requirement is not version tracking for audit trail or the database-level snapshots. I have spent more than 1 week to research on my own and to think through the possible options. Sorry, I could have missed some links - if the solution to my problem is already discussed in some other thread, please point me there.
It is a bit long; please bear with me.
Here is the situation: We are trying to create a generic design to store snapshots of the transactional data in our transactional database and also to keep a revision history of reference data.
As part of the business process, a user can press a button to publish certain object. For the purpose of illustration, let us say that the user can publish a proposal from the vendor before negotiation starts. Then, at different points in time through the negotiation process, the user can publish the proposal data. The proposal contains a budget, sales targets and many other items. When a proposal is snapshotted, all the linked entities have to be snapshotted. Finally, after the negotiation, a contract is signed. At this point, a complete snapshot of the contract has to be created. Not all the entities in the contract are there in the proposal – there are lot of overlapping entities, but there are unique entities attached to proposal and contract.
We have to keep both these published versions and the latest active versions available. Published versions are made available on a website to be referenced by both vendors as well the management team. Not all published versions are made available on the website, but the last published proposal and latest published contract are always available in the website. This website also has to be populated from the same database.
Also, a finance user can decide to snapshot only the budget alone and a sales manager can snapshot the sales targets. So, snapshots are available at multiple granularities.
We also have a requirement to track versions of the master data. It is a business requirement to track all the changes to key master data columns over time. For example, we have region information associated with the sales targets. The name of the region can change and we want to track these changes. Let us assume that at the time of proposal, the region’s name is R1 and a snapshot is created. Then, the name of the region changes to R2 and then 2 other snapshots are created. We want to able to link the sales targets to the correct region name at those points in time, not necessarily to the latest region name.
We have some flexibility in modelling as we have both a transaction DB and a data warehouse DB and we can decide to store some of this information either in the transaction DB or in the data warehouse DB.
Here is our design. We have a Publication table which captures basic information about the published data – who published and the date, the reason, and the type of object published (proposal or budget or sales targets).
We store the snapshots in the same table as the original data. So, proposal snapshots would be stored with live proposals in the proposals table. We have a column called Publication ID in every table which has to be published. This column is a FK to the Publication table. If the Publication ID is null, that record is the active version.
I realized that the post is very lengthy. Hence, rather than listing the scenario details, I thought of quickly summarizing the design considerations in a mind map.
Now there are 2 solutions that we are leaning towards - both would store a snapshot of all the data whether it has changed or not. Maintaining only the delta while keeping table structures intact would necessitate a very complex stored procedure which has to run on every insert/update of any of the snapshotted object. I do not want to go down this route as this would take longer and the volumes are not that huge anyway.
Solution 1: Each time, an object is published (like proposal or budget), we would populate an XML tree and persist this in the database. Only the latest version need to be available in website and old versions are rarely needed. Given this, would I run into big performance issue due to using XML? We use SQL Server. The data volumes are that not huge.
Solution 2: All the transaction tables would have a publication ID and the reference data would have start and end dates. Whenever an object is published, we would make a copy of all transaction records and put the publication ID there and we would copy all of reference data records and put a snapshot date as the end date. This would allow us to have normal versioning for reference data outside of the publication process.
I would need opinions from experienced minds here as to the drawbacks of these 2 approaches and whether there is any other better scenario.