« Spreadsheet errors- Fear, uncertainty and doubt | Estimating the cost of Business Intelligence »
Mystery or Junk data warehouse dimensions
Sometimes, when you are designing a star schema model, you’ll find yourself in a dilemma. You’ve come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag? Where’s the transaction type? Why can’t I sort based on the “e7” code from the system?
You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won’t stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model. But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don’t have the time or budget to make the perfect data warehouse.
So how do you deliver to them what they need, and avoid messing up your dimensional model?
One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension.
In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.
Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values). It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.
Faced with this the data architect can wrap all these columns up into a junk dimension.
A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key. This key is what is stored in the fact table, in the mystery dimension column.
Lets look at a mystery dimension example. We’ll make up and example dimension thats very small for simplicity sake. Lets say that the transactional table that is used to generate one of our facts has three columns “Zortz” “a3” and “uudl” which we fully satisfy our mystery dimension criteria. (i.e. we don’t know what they are, but people use them in queries.)
“Zortz” is a true/false value, “a3” is one of two values “Confirmed” or “Pending” and “uudl” is either “” or “k”. All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key. Thus the mystery dimension table would look like this:
A key consideration when forming mystery dimensions is how many combinations exist. If the number of combinations is too high the mystery dimensions size may be unmanageable.
And be careful assuming that all the combinations have been used yet. You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you’ve created a dimension row for every combination.
But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system. This might still be worth while, depending on how many new combinations get created.
You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data. Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.
If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.
By moving it to a junk dimension or “mystery” dimension then you’ve got fewer indexes on the fact table which might be important depending on the size.
So if you find yourself telling your end users that they will just have to do without a column, think twice about it. The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.
« Spreadsheet errors- Fear, uncertainty and doubt | Estimating the cost of Business Intelligence »
Very nice write-up. Thanks for sharing!
George
You have excellent understanding regarding the topic. I will be using your website as reference to present this case. In my org they have numerous flag columns which are being stored in Facts increasing the no of columns by at least 20% as compared to without the flag columns. We haven’t tested the figures yet. So i don’t know what percentage of decrease will happen in fetch time.