Saturday, 23 February 2019

Chasm Trap problem in Data Warehouses

Chasm trap  in data warehouses occurs when two fact tables relate into one dimension table. This is a data modelling problem which will cause double-counting and bad data when these tables are joined. SQL and relational databases surprise me every day!

Star schemata in data warehouses usually have one fact table and many dimension tables where the fact table joins to it's dimensions tables via foreign keys. This is OK.

But what if you wanted to 'share' the dimension across two or more fact tables, use it commonly, slowly starting to create an intertwined galaxy maybe! These multi-fact schemas are also called Fact-Constellations.

For example think of a CUSTOMERS dimension table with the details of the customers and two fact tables SALES and REFUNDS with order and refund transactions as in the data model below.





Preparing sample data to reproduce the Chasm Trap

CUSTOMERS



SALES


REFUNDS




BANG! Chasm Trap spotted in the join of the SALES, REFUNDS and CUSTOMERS tables





What happened? Do you see the duplicates? We joined by the n-1 principle. That is 3 tables 2 join statements on the keys. It seems is impossible to query these two fact tables via their common dimension. So why can't we know how many SALES and REFUNDS a customer did? Strange, why is the query falling into a Cartesian Product when you want to query from two different tables via a common dimension table? Hint: Think of data modeling. 

But before let's try to sort out the chasm trap.


A quick SQL solution to 'Chasm Trap' is a pre-aggregated join of the tables like this:


Or an alternative solution can be to study again the data model and look at superfluous or wrongly assumed data relationships and dependencies. Maybe the dimension table is not meant to be shared. Review of the data model and data warehouse design might be necessary.

Conclusion

If you are going to use one dimension on two fact tables in your Data Warehouse, make sure you don't fall into the Chasm Trap. Or your data will be incorrect and untrustworthy. You must first pre-aggregate and then join to get correct results or look to better your understanding of the model and the relationships. For example maybe the REFUNDS entity is not directly related to CUSTOMERS but it is to SALES?

Once a gain we see how important data modeling and the understanding of the problem the data is trying to solve is.


Further reading:




No comments: