Data Observability InsideAnalysis Interview

Data Observability is showing the problems. Now you need a way to fix them that is actually doable.

A few years ago, the pitch for data observability was simple and compelling: know what’s happening in your environment. Dashboards, alerts, telemetry. When something breaks, you see it.

That was a real step forward. The alternative was flying blind, and plenty of teams were doing exactly that.

But here’s the thing everybody is now realizing. Knowing what’s wrong and being able to fix it are two very different capabilities. And at the scale most enterprises are operating at today, just knowing is no longer enough.

The gap between seeing and doing

Observability solved the visibility problem. It did not solve the action problem.

Think about the typical workflow. Something goes wrong. An alert fires. A data engineer picks it up, triages it, investigates the root cause, writes a fix, tests it, and deploys. That process takes hours on a good day. Sometimes days.

Meanwhile, the damage is already done. The SLA was missed. The pipeline ran for 16 hours too long. The query burned $2,000 more than it should have. By the time a human acts on what the dashboard showed them, the cost has already been absorbed.

This worked when data environments were smaller, and the pace of change was manageable. It doesn’t work when you have hundreds of users generating thousands of new workloads every week, running on data sets that grow by orders of magnitude month over month.

Observability assumes a human is fast enough to act on what they see. That assumption doesn’t hold anymore.

Four layers between a problem and a fix

It helps to break this down into what actually needs to happen when something goes wrong.

The first layer is detection: what happened. This is where observability lives. Your pipeline failed. Your query took 10 times longer than expected. Your costs spiked.

The second layer is diagnosis: why it happened. Was it a bad join? A misconfigured cluster? A data set that grew beyond what the original code was designed to handle? Most teams struggle here because diagnosis requires deep context about the code, the infrastructure, and the data.

The third layer is recommendation: what should you do about it? Not just “something is wrong” but “here is the specific change that will fix it, and here is why we’re confident it will work.”

The fourth layer is action: doing it. Making the change, validating it didn’t break anything else, and confirming the improvement. Most organizations are stuck at layer one, maybe layer two. The real value lives in layers three and four. That’s where problems actually get resolved, not just reported.

Not all automation is created equal

This is where the conversation gets nuanced, and where I think a lot of people get it wrong.

The instinct is to treat automation as binary: either you let the system do it, or a human does it. That framing misses the reality of how enterprises actually adopt this stuff.

The better framework is a grid: risk, effort, and impact. Map every potential action to that grid, and the right approach becomes clear.

Low-risk, low effort, high-impact changes are the obvious wins. Right-sizing a Snowflake warehouse. Adjusting resource allocation for a Spark job on Databricks. These affect everything running on that infrastructure, so the impact is broad, and the risk of breaking something is minimal. Automate these end-to-end. Even the most cautious enterprises will accept that.

On the other end, you have high-risk changes. Modifying code in a business-critical pipeline, especially code written by someone who left the company years ago. Nobody fully understands all the dependencies. A mistake here doesn’t just waste money. It can disrupt a production process on which the business depends.

For those changes, the system needs to be absolutely accurate. Not “kind of accurate.” Not “very accurate.” Absolutely accurate. That means running testing and validation loops, ensuring the optimized version produces exactly the same results as the original, and only then presenting the recommendation with full context for a human to approve.

I think about it like mountain climbing. You have to get every step right. One slip and you lose the trust you spent months building. It takes only one bad automated change for an enterprise to shut the whole thing down and never turn it back on.

Trust is earned incrementally

The good news is that the market is actually moving. Three years ago, enterprises would watch a demo of automated optimization and say, “Love it. Not happening inside my company.” The idea of letting software make changes to production systems was a non-starter, regardless of how good the technology was.

That’s shifted dramatically. The broader AI wave has created an expectation that systems should be able to act, not just observe. People now ask, “Why doesn’t your product automate this?” instead of recoiling from the idea.

But adoption follows the trust curve. You start with the low-risk, high-impact wins. The system proves itself. It gets every step right, consistently, across hundreds of changes. Then teams get comfortable expanding the scope. More than 60-70% of the improvements available in a mature data environment can be automated today, even at the world’s largest banks and healthcare companies. The key is that trust builds through the grid, not through a single leap of faith. You earn the right to automate code-level changes by first proving you can nail the infrastructure-level ones.

The best fix is the one you never have to make

There’s one more layer to this that most people miss. The highest form of actionability isn’t fixing problems faster. It’s preventing them from reaching production in the first place.

Can you check code during the dev-to-prod pipeline and flag inefficiencies before they go live? Can you distinguish healthy code from unhealthy code and establish guardrails that only let the healthy stuff through?

Think about what happens when AI generates a SQL query. It might return the right results, but is it the fastest and most efficient way to do so? That’s a check that can happen before the query ever runs on a production cluster, before it costs anything. We’ve seen cases where catching a single inefficient query in staging, one that would have triggered a full scan on a petabyte table, saved a customer tens of thousands of dollars in a single month. Multiply that across hundreds of queries, and the numbers get serious fast.

It’s the same principle as the expanding user base. Marketing, finance, legal, and analysts are all now hitting these platforms. Most of them aren’t SQL experts. Training them to be better citizens of the platform and catching their mistakes early are far more valuable than cleaning up the damage after the fact.

The shift from reactive firefighting to proactive prevention is the real transformation. Cure is necessary. Prevention is where the compounding value lives.

The question has changed

Five years ago, the right question was “Can I see what’s happening in my data environment?” Observability was the answer, and it was a good one.

The question has moved on. Now it’s: can my systems act on what they see, faster and more accurately, to produce meaningful impact?

The organizations that answer yes are the ones that stop losing hours to triage, stop absorbing preventable cost overruns, and free their data engineers to build instead of firefight. The ones that stay at the dashboard, watching problems scroll by, are going to keep falling behind.

Seeing was never the hard part.

▶ Watch on YouTube

This article was inspired by a conversation between Unravel Data CEO Kunal Agarwal and Eric Kavanagh on Inside Analysis.

published

March 19, 2026

Author

Unravel Data

Explore Other Content

No items found.

Troubleshooting & DataOps

Inside Unravel

Articles