Tab Pages

Monday, February 28, 2011

Slowly Changing Dimensions (SCDs)


Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.
For example, you may have a dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?
You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.
Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
The most common slowly changing dimensions are Types 1, 2, and 3.
Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.)
Here is an example of a database table that keeps supplier information:
Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
CA
In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.
Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:
Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
IL
The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain.
If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.
Type 2
The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:
Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
Version
123
ABC
Acme Supply Co
CA
0
124
ABC
Acme Supply Co
IL
1
Another popular method for tuple versioning is to add effective date columns.
Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
Start_Date
End_Date
123
ABC
Acme Supply Co
CA
01-Jan-2000
21-Dec-2004
124
ABC
Acme Supply Co
IL
22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.
Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.
If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.[1]
Type 3
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
Supplier_Key
Supplier_Code
Supplier_Name
Original_Supplier_State
Effective_Date
Current_Supplier_State
123
ABC
Acme Supply Co
CA
22-Dec-2004
IL
Note that this record can not track all historical changes, such as when a supplier moves twice.
One version of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which will then track the most recent historical change.

No comments:

Post a Comment