Data Observability for Databricks Register

DataOps

DataOps Resiliency: Tracking Down Toxic Workloads

By Jason Bloomberg, Managing Partner, Intellyx Part 4 of the Demystifying Data Observability Series for Unravel Data In the first three articles in this four-post series, my colleague Jason English and I explored DataOps observability, the […]

  • 4 min read
Open Collection

By Jason Bloomberg, Managing Partner, Intellyx
Part 4 of the Demystifying Data Observability Series for Unravel Data

In the first three articles in this four-post series, my colleague Jason English and I explored DataOps observability, the connection between DevOps and DataOps, and data-centric FinOps best practices.

In this concluding article in the series, I’ll explore DataOps resiliency – not simply how to prevent data-related problems, but also how to recover from them quickly, ideally without impacting the business and its customers.

Observability is essential for any kind of IT resiliency – you can’t fix what you can’t see – and DataOps is no exception. Failures can occur anywhere in the stack, from the applications on down to the hardware. Understanding the root causes of such failures is the first step to fixing, or ideally preventing, them.

The same sorts of resiliency problems that impact the IT environment at large can certainly impact the data estate. Even so, traditional observability and incident management tools don’t address specific problems unique to the world of data processing.

In particular, DataOps resiliency must address the problem of toxic workloads.

Understanding Toxic Workloads

Toxic data workloads are as old as relational database management systems (RDBMSs), if not older. Anyone who works with SQL on large databases knows there are some queries that will cause the RDBMS to slow dramatically or completely grind to a halt.

The simplest example: SELECT * FROM TRANSACTIONS where the TRANSACTIONS table has millions of rows. Oops! Your resultset also has millions of rows!

JOINs, of course, are more problematic, because they are difficult to construct, and it’s even more difficult to predict their behavior in databases with complex structures.

Such toxic workloads caused problems in the days of single on-premises databases. As organizations implemented data warehouses, the risks compounded, requiring increasing expertise from a scarce cadre of query-building experts.

Today we have data lakes as well as data warehouses, often running in the cloud where the meter is running all the time. Organizations also leverage streaming data, as well as complex data pipelines that mix different types of data in real time.

With all this innovation and complexity, the toxic workload problem hasn’t gone away. In fact, it has gotten worse, as the nuances of such workloads have expanded.

Breaking Down the Toxic Workload

Poorly constructed queries are only one of the causes of a modern toxic workload. Other root causes include:

  • Poor quality data – one table with NULL values, for example, can throw a wrench into seemingly simple queries. Expand that problem to other problematic data types and values across various cloud-based data services and streaming data sources, and small data quality problems can easily explode into big ones.
  • Coding issues – Data engineers must create data pipelines following traditional coding practices – and whenever there’s coding, there are software bugs. In the data warehouse days, tracking down toxic workloads usually revealed problematic queries. Today, coding issues are just as likely to be the root cause.
  • Infrastructure issues – Tracking down the root causes of toxic workloads means looking everywhere – including middleware, container infrastructure, networks, hypervisors, operating systems, and even the hardware. Just because a workload runs too slow doesn’t mean it’s a data issue. You have to eliminate as many possible root causes as you can – and quickly.
  • Human issues – Human error may be the root cause of any of the issues above – but there is more to this story. In many cases, root causes of toxic workloads boil down to a shortage of appropriate skills among the data team or a lack of effective collaboration within the team. Human error will always crop up on occasion, but a skills or collaboration issue will potentially cause many toxic workloads over time.

The bottom line: DataOps resiliency includes traditional resiliency challenges but extends to data-centric issues that require data observability to address.

Data Resiliency at Mastercard

Mastercard recently addressed its toxic workload problem on Hadoop, as well as Impala, Spark, and Hive.

The payment processor has petabytes of data across hundreds of nodes, as well as thousands of users who access the data in an ad hoc fashion – that is, they build their own queries.

Mastercard’s primary issue was poorly constructed queries, a combination of users’ inexperience as well as the complexity of the required queries.

In addition, the company faced various infrastructure issues, from overburdened data pipelines to maxed-out storage and disabled daemons.

All these problems led to application failures, system slowdowns and crashes, and resource bottlenecks of various types.

To address these issues, Mastercard brought in Unravel Data. Unravel quickly identified hundreds of unused data tables. Freeing up the associated resources improved query performance dramatically.

Mastercard also uses Unravel to help users tune their own query workloads as well as automate the monitoring of toxic workloads in progress, preventing the most dangerous ones from running in the first place.

Overall, Unravel helped Mastercard improve its mean time to recover (MTTR) – the best indicator of DataOps Resiliency.

The Intellyx Take

The biggest mistake an organization can make around DataOps observability and resiliency is to assume these topics are special cases of the broader discussion of IT observability and resiliency.

In truth, the areas overlap – after all, infrastructure issues are often the root causes of data-related problems – but without the particular focus on DataOps, many problems would fall through the cracks.

The need for this focus is why tools like Unravel’s are so important. Unravel adds AI optimization and automated governance to its core data observability capabilities, helping organizations optimize the cost, performance, and quality of their data estates.

DataOps resiliency is one of the important benefits of Unravel’s approach – not in isolation, but within the overall context for resiliency that is so essential to modern IT.

Copyright © Intellyx LLC. Unravel Data is an Intellyx customer. None of the other organizations mentioned in this article is an Intellyx customer. Intellyx retains final editorial control of this article. No AI was used in the production of this article.