Cox Automotive is a large, global business. It’s part of Cox Enterprises, a media conglomerate with a strong position in the Fortune 500, and a leader in diversity. Cox also has a strong history of technological innovation, with its core cable television business serving as a leader in the growth and democratization of media over the last several decades.
At DataOps Unleashed, senior data engineer James Fielder talked about how Cox Automotive has established a strong DataOps practice, while moving from on-premises Cloudera to Databricks on Microsoft Azure. Fielder is part of Data Services, the central data intelligence team providing analytics, reporting, and data science services for Cox Automotive in the UK and Europe.
By extensive use of the built-in capabilities of Azure, by adding Unravel Data to the mix, and with a lot of hard work, Cox Automotive has created a data platform that supports a large business, all with a DataOps team that currently numbers just eight people. Crucial to the effectiveness of the Cox Automotive team is making their platform as self-service, from the point of view of their internal users, as possible.
Note. If you are interested in making your own data platform self-service, you might be interested in the Unravel Data webinar series, The Self-Service Data Roadmap, featuring our CDO and VP Engineering, Sandeep Uttamchandani. You can also find a detailed description of the book on the O’Reilly website.
Building a Self-Service Data Platform
“By pushing this kind of self-service mentality, we can reduce cycle time for projects because we don’t have to collaborate as much.” – James Fielder
The Data Services group is a full-stack data team, handling cloud engineering, creating data pipelines, doing business intelligence (BI) development, and carrying out data science tasks. Also operations: the team supports everything in production. There are only eight full-time team members.
As a result, Cox Automotive Data Services seeks to head off problems before they can begin. Proposed solutions have to scale, technically and product-wise. Outside the team, the business is still largely legacy-oriented; for instance, data comes in from relational databases, using SQL, and as CSV extracts. So Data Services works to make their solutions self-service, enabling autonomy for users, while maintaining good data engineering practices wherever possible.
As James Fielder puts it: “Thinking in DataOps terms, by pushing this kind of self-service mentality, we can reduce cycle time for projects because we don’t have to collaborate” as much. For instance, the Data Services team makes it possible for users to move SQL-based extract, transform, and load (ETL) operations to production-level Spark or Databricks jobs with minimal friction. To enable this, everything possible is automated.
In the last five years, the team has competed a multi-step journey:
- 2016-2017: Cloudera infrastructure as a service (IaaS), with manual provisioning and configuration. This took a lot of time. Configurations tended to be inconsistent, leading to a lot of hard-to-find errors and a high mean time to recovery (MTTR) from errors.
- 2017-2018: Move to Cloudera IaaS with infrastructure as code (IaC). Used Ansible to systematize cluster creation and operations, reducing manual effort and reducing MTTR.
- 2019 to present: Azure Databricks platform as a service (PaaS), with maintenance and upkeep handled by the service provider. GUIs are available for infrastructure management; users can handle some tasks directly.
In the original setup, users would set up their own Cloudera clusters and Spark clusters and Hadoop clusters, SSHing onto nodes and so on. It was impossible to smoothly scale up and down. The move to infrastructure as code helped, and with Ansible, spinning up – and even spinning down – clusters became easier.
Databricks, Unravel Data, and Success
“We’re using Unravel to optimize our virtual machine choice on Databricks, optimize our jobs, making them quicker and making them cost less.” – James Fielder
In 2019, the team took the next step. They moved to Azure Databricks, a successful platform as a service offering, instead of running their own VMs and similar.
“Databricks,” according to Fielder, “has been a wonderful move for us. It’s really made the platform an awful lot better, and an awful lot easier for us to maintain internally.”
Now data comes in from Azure Blob Storage, SFTP servers, SQL servers, and a few other sources – no streaming as yet. Outputs are also typically in file form, or go to Azure Blob Storage or SFTP.
The team makes heavy use of Databricks services:
- Interactive job clusters
- The job scheduler
- All the major languages that are supported on Databricks
- The Databricks file system
- Azure Data Lake gen2, with Hive metastore
- Azure Key Vault data services
- Tableau, running in Azure, for reporting
- Unravel Data for monitoring
The team saves a great deal of money by only paying for compute as they use it. They keep things simple, with no orchestrator (such as Airflow – see the Airflow talk from DataOps Unleashed – or Oozie). Fielder estimates that they were able to cut their cloud costs in half as part of the move to Databricks, with help from Unravel Data.
The team has used Unravel Data to save time, save money, and keep their data pipelines flowing. “Unravel is a really useful tool if you’re interested in observability and optimization in your Spark jobs,” says Fielder. “We’re trying to do a lot of things around optimization for our platforms at the moment. So we’re using Unravel to optimize our virtual machine choice on Databricks, optimize our jobs, making them quicker and making them cost less.”
“Unravel gives us a lot of visibility into what the jobs are doing,” Fielder continues; “how much memory they use over time, and all this kind of stuff that we really struggle to get ourselves and to build ourselves. We’re a big fan of Unravel, and we are going to be trying to get as much out of it as we can.”
Fielder has several recommendations for small teams that want to achieve similar results:
- Keep it simple, always
- Only introduce technology when it’s really needed
- Use PaaS, to offload management tasks
- Build your own tools to automate best practices
- Build libraries to embody common functionality and make it repeatable
- Build big Spark jobs to maximize parallelism, the compiler’s ability to check properties, and the availability of quality checks on data
In the future, the Data Services team at Cox Automotive will be doing more with Azure capabilities and Unravel. They will be using Azure Devops deployment comprehensively, and they are considering adopting an orchestrator, such as Airflow. Unravel will provide monitoring and observability for their data jobs. The team will use serverless where possible, as it’s even more hands-off and easy to manage than their current approaches. And they will be further optimizing data science approaches to the data they already have in the cloud.
The Move to DataOps with Waimak
“The idea with Waimak is that we create a flow, and that flow is effectively a program which represents a data pipeline.” – James Fielder
The team makes a practice of building and maintaining their own tools where possible. In particular, they are developing a DataOps framework, written in Scala, called Waimak, which they have made available as open source on GitHub.
Waimak wraps up the team’s best practices and ways of doing things in a reusable data engineering framework. With Waimak, data scientists can create their own data pipelines – without really needing to, for instance, know the intricacies of Spark.
Waimak gives users the ability to embed data flows into a Spark job and have the framework manage the parallelism of the framework, and deal with environmental concerns – getting secrets, being able to deal with partitioning tables correctly, is all lifted into this framework layer. All their production data pipelines are written on top of it.
According to Fielder, “The idea with Waimak is that we create a flow, and that flow is effectively a program which represents a data pipeline. Then you place actions on that flow and that creates a sort of data structure, which represents the actions that Spark needs to take. And then once you’re done with that, you ask the executer to execute it. The executer works out what can be done in parallel and does this for you.” For more information, you can see a talk about Waimak.
To the Cloud, and Beyond
“Once you adopt a tool, really use it as best as you can.” – James Fielder
Fielder finished by saying: “You know, once you adopt a tool, really use it as best as you can” – as demonstrated by the team’s use of the Azure cloud platform, Databricks, Unravel Data, and their own Waimak framework. “And don’t just follow the hype,” he continues, “because all of the fundamentals are the same. For instance, we’re still doing SQL. We’ll all still be doing SQL in 20 years’ time because it’s just great.”
While this blog post provides a summary, you can also view the Cox Automotive talk directly. And you can view other videos from DataOps Unleashed here. You can also download The Unravel Guide to DataOps, which was made available for the first time during the conference.