Pages

Saturday, 4 March 2017

Bad bad bad data




One should feel really sorry about anyone who will rely on filtering and making a decision based on bad, bad data. It is going to be a bad decision.

This is serious stuff. I read the other day a recent study by IBM which shows that "Bad Data" costs US $3.1 trillion per year!

OK, let's say you don't mind the money and have money to burn, how about the implications of using the bad data? As the article hints these could be misinformation, wrong calculations, bad products, weak decisions mind you these will be weak/wrong 'data' driven decisions. Opportunities can be lost here.

So why all this badness, isn't it preventable? Can't we not do something about it?

Here are some options

1) Data Cleansing: This is a reactive solution where you clean, disambiguate, correct, change the bad data and put the right values in the database after you find the bad data. This is something you do when is too late and you have bad data already. Better late than never. A rather expensive and very time consuming solution. Nevermind the chance that you can still get it wrong. There are tools out there which you can buy and can help you do data cleansing. These tools will 'de-dupe' and correct the bad data up to a point. Of course data cleansing tools alone are not enough, you will still need those data engineers and data experts who know your data or who can study your data to guide you. Data cleansing is the damage control option. It is a solution hinted in the article as well.

2) Good Database Design: Use constraints! My favourite option. Constraints are key at the design time of the database, do put native database checks and constraints in the database entities and tables to guarantee the entity integrity and referential integrity of your database schema, validate more! Do not just create plain simple database tables, always think of ways to enforce the integrity of your data. Do not rely only on code. Prevent NULLS or empty strings as much as you can at database design time, put unique indexes and logical check constraints inside the database. Use database tools and features you are already paying for in your database license and already available to you, do not re-invent the wheel, validate your data. This way you will prevent the 'creation' of bad data at the source! Take a proactive approach. In projects don't just skip the database constraints and say I will do it in the app or later. You know you will not, chances are you will forget it in most of the cases. Also apps can change, but databases tend to outlast the apps. Look at a primer on how to do Database Design

My 'modus operandi' is option 2, a Good Database Design and data engineering can save you money, a lot of money, don't rush into projects with neglecting or skipping database tasks, engage the data experts, software engineers with the business find out the requirements, talk about them, ask many questions and do data models. Reverse engineer everything in your database, have a look. Know your data! That's the only way to have good, integral and reliable true data, and it will help you and your customers win.