How to model process and status history in a data warehouse?
Asked Answered
E

1

5

Let's say that we have D_PROCESS, D_WORKER and D_STATUS as dimensions, and the fact F_EVENT that links a process (what) with a worker (who's in charge) and the "current" status.

The process status changes over time. Shoud we store in F_EVENT one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker?

I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization.

Eunuchize answered 20/6, 2012 at 7:44 Comment(0)
L
9

Read The Data Warehouse Toolkit by Ralph Kimball for a good introduction to dimensional modeling.

It sounds like you are storing a process change event in F_EVENT. If this process has a defined beginning and end, I would build a snapshot fact table which would let you track the process over time (simply updating the row each time the process moves from one step to another).

EDIT:

I'll try to make a general case using your dimensions as examples.

For D_PROCESS, modeling a "process" isn't usually modeled as a dimension, and you called it a "what", so I'm going to rename this to "D_ACCOUNT".

The basic data model will be for a "tax processing system" in which WORKERS are processing ACCOUNTS, and each ACCOUNT/WORKER combination has several possible "STATUSES" of where this process currently stands.

D_ACCOUNT
    ACCOUNT_NUMBER
    ACCOUNT_TYPE

D_WORKER
    WORKER_ID
    FIRST_NAME
    LAST_NAME
    BADGE_NUMBER
    SHIFT

D_STATUS
    STATUS_ID
    STATUS_NAME

Now if I want to report on all "events" that have happened to an Account, performed by a worker, I can build a Transaction-level fact table F_EVENT:

F_EVENT
    ACCOUNT_ID
    WORKER_ID
    STATUS_ID
    EVENT_TIME_ID
    Metrics taken at time of the measurement (Cost, Worker time spent, etc)

We call the unique combination of dimensions that identifies a row the Granularity or Grain of the fact table.

The grain of this table is Account, Worker, Status, and Time. It answer questions like "How much time did my workers on shift 3 spend processing accounts on Wednesday?" or "How many events occured that changed the processing status to "CLOSED"?

I'm not sure how much this type of table would help.

Instead, say you are interested in tracking the process itself as it moves through various statuses. I'm going to assume that the status always moves forward in time, from "NOT STARTED" to "IN PROCESS" to "CLOSED".

I'll build what Kimball calls an "Accumulating Snapshot Fact table.

F_TAXPROCESSING
    ACCOUNT_ID
    WORKER_ID 
    CURRENT_STATUS_ID
    NOT_STARTED_DTTM
    NOT_STARTED_FLAG
    IN_PROCESS_DTTM
    IN_PROCESS_FLAG
    CLOSED_DTTM
    CLOSED_FLAG

This table's grain is Account, Worker. This table keeps track of the "process" by updating the date/time of the change to the status, and a flag when that status has been reached.

This allows you to track the process over time, allowing you to see how many accounts have reacted the "IN PROCESS" status, how long it took to get there, et cetera.

Lazar answered 20/6, 2012 at 12:56 Comment(7)
Sorry, I mean process not product (in fact it's totally different but I tried to get rid of our business terms). You said "updating" the row, ok, but what if I want to analyze in my cube all the process at status C (from A to D for instance) at a given date?Eunuchize
Data Warehousing is all about the business terms - don't try to "generalize" too much. DW solutions take relational abstracted models and attempt to model them much as the business would use them. I'll try to answer your question in an edit.Lazar
Thanks. But if try to translate our french tax deduction related terms in english words, it will be ugly for you to read :)Eunuchize
Updated the answer. If you have more questions I'll be in the Data Warehouse chat for a bit..Lazar
@n-west If I would have to track quite a lot of interesting intermediate statuses, then how can this be handled?Helm
@Frankin It depends on how many. More than, say, 20, and you probably want to have both an accumulating snapshot to track the "current" state of a item, and a detailed atomic transaction-level fact to keep track of every change to an interesting intermediate status. The SQL gets more complicated depending on the questions that you are asking when you have the each "change" in status represented by a row.Lazar
Note that I have evolved in my thinking and would not necessarily recommend a Kimball-style approach to this. However, as you will hear that from the majority of data warehouse people, I'll leave the answer as is.Lazar

© 2022 - 2024 — McMap. All rights reserved.