Data quantity is from Mars, data quality is from Venus:

Data quantity is intrinsic to business. As business grows, data quantity is bound to grow. Growth in data is intrinsic to business growth. Data quality, however is extrinsic, it has no concern for growth of business, or systems you put in place. Data quality will not remain stable, it will actually become worse as you add more systems, more master data and more users.

Unlike Data Quantity, measuring data quality is difficult:

Data quantity can be measured, and quantified in so that everybody understands. We understan what is 8 Terabyte worth of mortgage data, or data growth of 2 Tb per month, we can fathom the kind of beast. However, there is no such measure for Data quality. We cannot say that 200 Tb out of 2 Pb of data is dirty. We cannot say that 2% of all the data that we have is bad.

Not only measuring, but identifying bad data is difficult:

Quantifying bad data is difficult because, identification of bad data is difficult. This reminds me of a story:

Each time Joe Schmo would buy a bottle of water, he would throw away 1/5th of the water and only then proceed to drinking form the bottle. His friend is perplexed and asks Joe Schmo about this unique behavior, to which the Joe Schmo proudly replies ‘I read in USA Today that almost 20% of the bottled water sold in the country is not drinkable, so I throw away 1/5th of the dirty water before drinking!’

Data quality is like that – it is all over in your data, it is not like a bad apple that you can pick out and throw.

Fixing Data Quality later is expensive!

Data quality standards should be embedded within every process when the process is defined. As Al identified, Deming spent his life teaching that ‘build quality into the product. Deming could teach a thing or two about data quality, but that is for another post!


