Why Your Team Needs To Implement Data Quality For Your AI Strategy
Companies that range from start-ups to enterprises are looking to implement AI and ML into their data strategy.
With that it’s important not to forget about data quality.
Regardless of how fancy or sophisticated a company’s AI model might be, poor data quality will break it. It will make the outputs of these models useless at best, and misleading and company destroying at worst.
So, as your company is rolling out its internally developed LLM or implementing a dynamic pricing model, it’s a great time to review your data quality strategy. Maybe add it as a line item in your data strategy budget.
Or…even better, just make sure your CEO or VP knows that their LLM or AI model will never work without better data quality.
As you’re planning out how to implement better data quality its important to remember data quality is not just about “accuracy”.
But instead, there are several key pillars need to be considered when developing your data quality system.
We’ve outlined the widely accepted pillars below.
Accuracy
This refers to the extent to which data is free from errors. That is to say, if you customer ID 72 has purchased 3 of product X for 12 each on different days, then there likely should be 3 lines representing just that.
Completeness
This ensures all relevant data is present. Incomplete data can lead to incorrect analysis and decisions. For instance, if particular rows or columns are missing from a dataset, its conclusions might be flawed. An interesting point here is that sometimes, in my experience, you might send a sample set of data to a business stakeholder and even when you specify it is merely a sample, they decide to analyze the data as if it were real.
Validity
Data should adhere to a specified format, range, or set of values. Invalid data can be the result of errors during data entry or transmission. This is likely where many data issues occur. Not because the data pipelines being developed are wrong. Instead, because the data being processed from the source is incorrect or has changed. We will be going over a few examples of this below.
Integrity
This pertains to the data’s structural soundness. Data should maintain its integrity throughout its lifecycle. This means relationships between datasets should be maintained, and there shouldn’t be any orphaned or stray data. A great example of this is checking to ensure that IDs for dimensions inside a fact table actually have a relationship to the dim table. This is an issue particularly when you have to manually update dims(which you should try to avoid).
Timeliness
Data should be up-to-date and available when needed. Stale or outdated data can lead to incorrect insights or decisions. In particular, this check is often paired with an SLA that connects to a specific business use case. Perhaps an executive board meeting happens once a month at a specific time or operations team need information live, etc.
Consistency
This pillar assures that data across all systems and processes is presented in a consistent manner. What that means is that naming, formats, units, etc. are all consistent. In turn, you reduce confusion between users as they go through the various data sets across your data analytics platform.
Reliability
This relates to the trustworthiness of the data source. Reliable data can be depended upon for decision-making without needing extensive verification.
Now the question becomes, what do these pillars look like in an actual system?
How Companies Implement Data Quality Checks
There are a few different ways data teams may set up checks.
For example, for companies using dbt, you might implement dbt tests or perhaps you’ll integrate a solution like Elementary which provides data freshness, anomaly and row count checks out of the box. Still others will create Airflow tasks that run prior to pushing data into production to ensure the data is accurate.
Now I will point out that often the standard unit tests that ensure code operates as expected, don’t detect data quality issues in live systems. That is to say, data quality issues as described above are caused by incorrectly inserted data, delayed data, data duplication, etc.
Meaning your code might operate as expected but still push bad data or be delayed.
That’s why data quality checks occur while the pipeline is running or once the data is inserted into a staging table(sometimes that’s the only real purpose of a staging table, to keep bad data from going into production). They are checking the data every time new data is processed, not just when new code is pushed.
So let’s go over a few basic examples of how these checks might be implemented.
Data Validity – Range Checks
A recent project I was working on had an issue with data validity where a field that often expects values ranging from 0 to 50 suddenly shot up to 10,000. This was caught pretty quickly as our daily checks detected that the numbers started exceeding the expected range.
These checks are usually implemented as anomaly detection checks. Data anomaly checks can easily catch these types of problems and can either have upper and lower bound variables hard-coded or utilize an automated detection system that looks for values that exceed the normal values using standard deviations or interquartile ranges.
Below is an example of how you might implement a data range check. You’ll notice that there is a table that manages the data quality ranges. This allows you to implement multiple checks as a function vs rewriting a new SQL script every time.
Instead you could have multiple range checks stored in a table along with:
- The upper and lower bound of the check(unless you plan to use a dynamic method)
- The threshold for when a check should actually fire off a notification or kill the data pipeline altogether(is it 100%, 99%, 99.9%?)
- The name of the check so you can filter it and perhaps store historical information
- The field and table name you’ll be checking
SELECT
SUM(CASE
WHEN VALUE_BEING_CHECKED BETWEEN LOWER_BOUND
AND UPPER_BOUND
-- You can either use hard-coded ranges or use some form of dynamic
-- ranges that can increase and decrease based on the data profile of
-- the column
THEN 1
ELSE 0
END) * 1.0 / COUNT(*) AS PERCENTAGE_ROWS_IN_RANGE
-- You may define a specific tolerance of what percentage of rows out of -- an expected range warrants a notification being fired off
,DATA_QUALITY_CHECK_NAME
FROM SOURCE_TABLE ST
LEFT JOIN DATA_QUALITY_RANGE_CHECK_TABLE DT
-- You want to check every field and there is only on row in the DQRC -- -- Table
ON 1=1
WHERE
-- You could create a single table that manages all expected ranges and -- provide different names per check meaning you can implement this
-- query in a function without rewriting the check
DATA_QUALITY_CHECK_NAME = 'RANGE OF XYZ FIELD'
Another similar issue was detected when I worked with a finance team; we implemented a similar check. We had the check go off when a director spent an amount that exceeded what should have been possible regarding financial controls.
The finance system should have caught this, but it didn’t. We did.
These checks are simple to implement and great at catching issues before they enter production and get seen by an end-user on a dashboard. Yes, the dashboard might be late, but its argubly better than having an end-user point out that your company that does $1 million in sales a month suddenly seems to be doing $100 million.
Data Validity – Category Checks
Another type of check often implemented that also fits in the data validity check category. For example I once had to implement a state abbreviation check. This check would ensure that all the state abbreviation field values were valid states. You’d think this type of check wouldn’t be needed, right?
The field for state abbreviations should be populated by a drop-down, so the data will always be right…right?
Well, it wasn’t, and we’d actually get plenty of abbreviations that weren’t valid. This is where you can implement a category check where you either hard-code a list of expected categories or place a list into a table. From there, the check could look like the code below.
SELECT SUM(CASE
-- You can use either the Is not null approach to check if the value
-- doesn't exist in an approved list. You could also use ARRAY_CONTAINS
WHEN DT.VALUE_BEING_CHECKED IS NOT NULL
THEN 1
ELSE 0
END) * 1.0 / COUNT(*) AS PERCENTAGE_ROWS_IN_RANGE
,DATA_QUALITY_CHECK_NAME
FROM SOURCE_TABLE ST
LEFT JOIN DATA_QUALITY_CATEGORY_CHECK_TABLE DT
-- I Have actually seen some teams only have one data quality check -- table that has an extra field to define the type of check that is
-- being run, that way you can reduce some of the tables you manage but -- I feel like that can be a little complicated
ON ST.STATE_ABBREVIATION = DT.VALUE_BEING_CHECKED
WHERE DATA_QUALITY_CHECK_NAME = 'STATE ABBREVIATIONS CHECK'
In the past few checks, we discussed data issues centering around what many likely consider as data quality issues in the traditional sense. But data quality isn’t just about whether the data makes sense.
It’s also about timeliness.
Timeliness Or Freshness
Whether at Facebook or many of the other companies I have worked at, one issue that might come up is that the data might not be updated at the time the end-user is expecting it.
In one case, this expectation was Tuesday at 8 a.m. prior to a board meeting; in another, it was every day before 10 a.m. Some data teams will have SLAs to ensure these time requirements are called out. But many don’t. In turn, this causes a lot of issues between expectations and what data is actually being shown.
These are interesting cases because the data is arguably “accurate” as of a specific time. But when the end-user is looking at the data, they expect that the data is different and, thus, the data is wrong.
The way this issue is generally solved is two-fold. The first is to have a check that calculates the max date of a table (whichever date represents the timeliness factor, update_date, effective_date, insert_date, create_date, etc). Then you compare that to the current date and time to figure out how long it’s been since the last data load.
This helps inform the data team if there is an issue by often either having an alert that is fired when the table isn’t updated for a certain period of time, or there may be a dashboard that keeps track of the key tables and their most recent update date.
For the second intervention, the data team will place a timestamp on the dashboards that rely on the data source, and it’d likely say something like “As Of MM-DD-YYYY” or something similar.
Overall, data freshness is an important aspect of making sure business teams are not just making accurate decisions but decisions that are up to date.
Garbage In, Garbage Out
Data quality will be a key to ensuring your machine learning models and dashboards are reliable. It’s also important to point out that as the systems supplying said data continue to grow, it’ll only become a more complex problem.
Thus, the more data teams can do now to implement even basic data quality checks, the more problems will be avoided. That means creating range checks, creating a dashboard to manage data quality and more. Of course, it is important to find a solid balance, as you can’t check every column for every possible issue.
But once your team has implemented a reliable system for data quality, it’ll all become much easier.
As always, thanks for reading!