What’s the true importance of data quality?

Logic20/20
3 min readNov 18, 2020

--

Article by: Peter Baldridge

Professional worriers, arrested development, and a simple experiment

It’s my humble opinion that data scientists are essentially professional worriers. Don’t believe me? Here are a few examples of thoughts that might run through a typical data scientist’s head on any given day:

How accurate is my model?

What does my sample really say about the population?

How clean is the data that I’m working with?

Did I remember to turn off the stove?

Well, maybe not the last one, but you get the idea.

The other day, a few data scientists and I were stuck on problem that many data scientists face. We had a model that had been running in production for two years, and throughout all that time, and yet we hadn’t been able to improve the model’s performance. We had tried everything. Different architectures, fancy embeddings, and the age-old data science quick fix: adding more data. Our stakeholders were beginning to grow restless. We were beginning to give up hope.

At a certain point, a thought started to circulate among us.

What if the limiting factor was the data quality?

If pop data science was to be believed, the only answer to our problems was more data and more layers. But how important is data quality, really?

I wanted to shed some light on the question, so I decided to set up an experiment. What would happen to a model’s performance if I took its underlying data and added more and more noise?

Garbage in, garbage out? — a simple experiment — and some surprising results!

I set up an experiment with the Titanic dataset. The Titanic dataset is considered an entry-level machine learning dataset where passenger profile data (sex, age, ticket class, etc.) is used to predict whether or not a passenger survived their voyage on the Titanic. To mimic data quality issues, I took a random subset of the training data and reversed the outcomes. Passengers that had once perished had now miraculously survived and vice-versa. The goal was to see if adding noise to the data would confuse the model. I tested varying levels of noise and three common models: logistic regression, random forest, and support vector machine.

You’ll notice that neural networks are missing from the analysis. This is largely because I think that neural networks are overrated in task that don’t include audio, images, or language. It’s also because neural networks are easy to overfit without large amounts of data present.

Next, I calculated baseline accuracy using clean data. Then I took the difference between the dirty data and the baseline calculate accuracy loss. I ran each trial 100 times in order to get a better estimation.

How important is data quality?

According to the results of this experiment, data quality is not as important as I had originally thought. Even at 20% noise, all three models only experienced moderate accuracy loss. Of course, even small losses in accuracy can translate to drastically different results. In fraud prediction, for example, a small loss in model accuracy could be catastrophic.

To find out which models tend to thrive in scenarios or poor data quality, and more conclusions from this experiment, please read the full article here: https://www.logic2020.com/insight/tactical/data-quality-importance?utm_source=social&utm_medium=Medium&utm_campaign=Technical_Insight

--

--

Logic20/20
Logic20/20

Written by Logic20/20

Enabling clarity through business and technology solutions.

No responses yet