Learnings from Pursuing High Data Quality: A Reflective Piece
Source: Dev.to
Cover Photo by Claudio Schwarz on Unsplash
Introduction
Before diving in, let us discuss what data quality entails. IBM puts it so well:
Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness and fitness for purpose, and it is critical to all data governance initiatives within an organisation.
Data quality does not only care about the data being pristine, but also about being fit for its intended use—meaning that data quality is context‑specific. The domain in which the data is collected and used is as important as the other checks. In many situations, it provides the foundation to define checks for accuracy, validity, consistency, timeliness and uniqueness. Furthermore, data quality can build or destroy trust within a team.
This is a reflective piece that encapsulates my experience setting up a roadmap for maintaining high‑quality data. When asked previously about data‑quality enforcement and implementation, I always responded with “we can use the tool, or that tech to achieve this”. In practice, I was faced with a rude awakening about how limited that response was.
To provide a better context, consider a metric where a group of skilled data analysts and scientists were asked to calculate the same metric for a product for a given time window—month, weeks, days, etc. The result was that everyone came up with a different number. These inconsistencies erode trust, and the output of any data process is predicated on trustworthiness. It matters because if people get different numbers for a metric, then the question becomes, “Is the data being collected good? Are we introducing errors during processing?”
That was the point at which I realised that data quality is not just about the tools. It is about key elements that must be present for the effective delivery of data products. I categorise them into three main elements—process, people, and technology—all working together in harmony. The following sections expand on each element.
People
As long as the data is going to be read and interpreted by more than one person, the people element must be considered. While this may seem like a downstream activity, it is essential to address it as early as possible in any workflow—whether automated, routine, or ad‑hoc.
To make this practical, the quality of the data starts from understanding the request being made. There must be a free flow of knowledge among all stakeholders in the delivery of any analysis. For example, when a senior executive asks, “What is the day‑3 retention for product A?”, instead of immediately writing fancy SQL or Python scripts, respond with clarifying questions such as:
- Do you mean classic retention or rolling retention?
- For a global product, do you want regional retention that may show regional patterns?
- Should the retention be measured in UTC 24‑hour cycles, and so on?
These questions either give you clarity on what exactly needs to be calculated or provide leeway for assumptions to be made. Overall, the quality of the resulting data and its interpretation depends on people communicating effectively. In a team where analysts and engineers work collaboratively, clear definitions must be documented and made readily available to produce reliable downstream data.
Technology
The tooling list is ever‑increasing. With tools like Great Expectations, Amazon Deequ, SODA, and platform‑embedded solutions such as validation rules in DBT, AWS Glue Data Quality, and similar offerings, data‑quality checks are a solved problem technologically. The only questions worth asking are about cost, team competencies, and best fit with the existing tech stack—essentially the criteria that arise during a tool assessment.
These tools do a good job of creating valid definitions of what the data should contain. Typical examples of what they provide are consistent ways to:
- Create data‑quality expectations
- Store the results of checks
- Report outcomes to stakeholders
Additionally, it is now commonplace to treat data‑processing and analysis efforts similarly to software‑development practices. Writing maintainable, readable, and modular code becomes a requirement to foster collaboration and longevity, rather than a luxury. Using version‑control systems like Git is non‑negotiable for achieving this.
Process
We have seen how people working together in harmony play a vital role in aligning expectations. I consider process to be the wrapper around people and technology. Good processes foster seamless interaction among people and with tools to achieve defined goals.
For instance, a group of data engineers and analysts might define a workflow using the Write‑Audit‑Publish (WAP) pattern, with data‑quality and validation tests at the audit layer. Consequently, no data product is published without passing all defined tests. Large datasets might also have preliminary checks that leverage fail‑fast mechanisms.
Building effective processes is not always straightforward:
- Too many steps make achieving a goal tedious.
- Too few steps may lack the robustness needed to define safe boundaries and guidelines for consistent, sustainable results.
A good process strikes a balance, and it may take several iterations to achieve that equilibrium.
Conclusion
It is tempting to argue that data quality can be achieved with only tooling, but without the right processes, tools become useless, and without the right people and a commitment to uphold a structure, processes are easily circumvented. This is what really matters.
The absence of technology, process, and people all working in harmony creates a fragile data quality framework for any organisation or project. Furthermore, the ideas discussed may not be seen as data quality but instead as an aspect of data governance. There is also a strong alignment with these same principles in the concept behind data contracts.
Overall, the intricate application of data governance, data contracts, data quality frameworks, or whatever it may be called in an organisation will rely on these and more.