Data Quality is a User Behavior Problem

March 18, 20267 min read

When most engineers think about data quality, they think about schemas. Are the types correct? Are there null values where there shouldn't be? Does the data conform to the contract? These checks are necessary. They're also the easy part.

The hardest data quality problems I've encountered have nothing to do with schemas or missing values. They come from a shift that no static validation can catch: the people generating the signals changed how they behave.

The signal looks fine. The signal is wrong.

Imagine you're processing billions of daily signals to evaluate an AI system. You've built validation frameworks, regression detection, automated checks. Every metric is green. Then one day, the AI outputs start degrading. Not dramatically, just a slow, quiet drift in quality that shows up in user facing metrics weeks later.

You dig into the data. The schema is fine. The volume is fine. The distribution looks almost the same. Almost. There's a subtle shift in the patterns. Not a bug, not an outage, not a schema violation. What happened is that users started interacting with the product differently. Maybe a new feature changed their workflow. Maybe adoption expanded to a different demographic. Maybe they simply learned the system better and started asking different kinds of questions.

The data your pipeline receives is downstream of human behavior. When that behavior changes, the statistical properties of your signals change, even if every individual record passes every validation check you've ever written.

Why traditional data quality misses this

Most data quality frameworks are built around a static mental model: define the rules, validate against the rules, alert when rules are broken. This works well for structural problems like wrong types, missing fields, duplicates, format violations. It doesn't work when the problem isn't that the data broke the rules, but that the rules no longer describe reality.

Behavioral drift is hard to catch because it's gradual, it's contextual, and it's not "wrong" in any obvious way. A user asking a different kind of question is generating valid data. A new cohort of users with different patterns is generating valid data. The records pass every check. But the aggregate distribution has shifted, and your downstream system, whether it's a model, an evaluation pipeline, or a ranking algorithm, was calibrated for the old distribution.

You need to model the customer, not just the data

The realization that changed how I think about data quality is this: your validation framework needs a model of the user, not just a model of the data. You need to understand what "normal" looks like in terms of user behavior patterns, not just in terms of data schemas. And "normal" is not static. It evolves as the product evolves, as adoption grows, as the world changes.

In practice, this means a few things. First, your monitoring needs to track distributional properties over time, not just per record validation. Averages, percentiles, cardinality, co-occurrence patterns. The shape of the data, not just the content. Second, you need baselines that update. A signal distribution from three months ago may not be a valid reference point if user behavior has shifted meaningfully. Third, and most importantly, you need to close the loop between what users are doing and what your data pipeline expects. If a product change rolls out that alters user workflows, your data quality team needs to know. Not because the data will break, but because the definition of "correct" just changed.

The customer is the upstream system

In distributed systems, we think carefully about upstream dependencies. If a service you depend on changes its behavior, you want to know. You build contracts, you version APIs, you monitor for drift. But when the upstream system is a human being, a user, a customer, a person interacting with your product, most teams don't apply the same rigor.

I think this is backwards. User behavior is the most important upstream signal in any AI system. It's also the least predictable, the hardest to validate, and the one most likely to shift without warning. If you're building data quality frameworks for AI systems and you're only validating the data itself, you're checking the plumbing while the water source is changing.

What this looks like in practice

You don't need to boil the ocean. Start with a few things: add distributional monitoring alongside your per record checks. Track how key signal distributions evolve week over week. Build alerts for statistical drift, not just rule violations. And this is the cultural part: create a feedback loop between product changes and data quality. When a feature ships, ask "How will this change the signals we depend on?"

The teams I've seen do this well treat data quality as a product problem, not just an engineering problem. They understand that the data is a reflection of the customer, and that understanding the customer is the first step to understanding the data.

The bottom line

Data quality isn't about making sure the data is "right." It's about making sure the data still reflects the reality it's supposed to represent. And that reality is shaped by people, people whose behavior is constantly evolving. If your validation framework doesn't account for that, you'll catch every schema violation and miss the regressions that actually hurt your users.

This article was written by me and reflects my own personal and professional experience. AI models were used to assist with revision and editing.