At Airflow Summit 2021, Unravel’s co-founder and CTO, Shivnath Babu and Hari Nyer, Senior Software Engineer, delivered a talk titled Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow. This story, along with the slides and videos included in it, comes from the presentation.
Data pipelines convert rich, varied, and high-volume data sources into insights that power the innovative data products that many of us run today. Shivnath represents a typical data pipeline using the diagram below.
In a data pipeline, data is continuously captured and then stored into distributed a storage system, such as a data lake or data warehouse. From there, a lot of computation happens on the data to transform it into the key insights that you want to extract. These insights are then published and made available for consumption.
Many enterprises have already built data pipelines on stacks such as Hadoop, or using solutions such as NewHive and HDFS. Many of these pipelines are orchestrated with enterprise schedulers, such as Autosys, Tidal, Informatica Pentaho, or native schedulers. For example, Hadoop comes with a native scheduler called Oozie.
In these environments, there are common challenges people face when it comes to their data pipelines. These problems include:
These challenges are causing many enterprises to modernize their stacks. In the process, they are picking innovative schedulers, such as Airflow, and they’re changing their stacks to incorporate systems like Databricks, Snowflake, or Amazon EMR. With modernization, companies are often striving for:
Shivnath shares even more goals of modernization, including removing resources as a constraint when it comes to how fast you can release apps and drive ROI, as well as reducing cost.
So why does Airflow often get picked as part of modernization? The goals that motivated the creation of Airflow often tie in very nicely with the goals of modernization efforts. Airflow enables agile development and is better for cloud-native architectures compared to traditional schedulers, especially in terms of how fast you can customize or extend it. Keeping with the modern methodology of agility, Airflow is also available as a service from companies like Amazon and Astronomer.
Diving deeper into the process of modernization, there are two main phases at the high level, Phase 1: Assess and Plan and Phase 2: Migrate, Validate, and Optimize. The rest of the presentation dived deep into the key lessons that Shivnath and Hari have learned from helping a large number of enterprises migrate from their traditional enterprise schedulers and stacks to Airflow and modern data stacks.
Phase 1: Assess and Plan
The assessment and planning phase of modernization is made up of a series of other phases, including:
Shivnath said that he has learned two main lessons from the assessment and planning phase:
Lesson 1: Don’t underestimate the complexity of pipeline discovery
Multiple schedulers may be used, such as Autosys, Informatica, Oozie, Pentaho, Tidal, etc. And worse, there may not be any common pattern in how these pipelines work, access data, schedule and name apps, or allocate resources.
Lesson 2: You need very fine grain tracking from a telemetry data perspective
Due to the complexity of data pipeline discovery, tracking is needed in order to do a good job at resource usage estimation, dependency analysis, and to map the complexity and cost of running pipelines in a newer environment.
After describing the two lessons, Shivnath goes through an example to further illustrate what he has learned.
Shivnath then passes it on to Hari, who speaks about the lessons learned during the migration, validation, and optimization phase of modernization.
While Shivnath shared various methodologies that have to do with finding artifacts and discovering the dependencies between them, there is also a need to instill a sense of confidence in the entire migration process. This confidence can be achieved by validating the operational side of the migration journey.
Data pipelines, regardless of where they live, are prone to suffer from the same issues, such as:
To maintain the overall quality of your data pipelines, Hari recommends constantly evaluating pipelines using three major factors: correctness, performance, and cost. Here’s a deeper look into each of these factors:
Hari then demos several use cases where he can apply the lessons learned. To set the stage, a lot of Unravel’s enterprise customers are migrating from the more traditional on-prem pipelines, such as Oozie and Tidal, to Airflow. The examples in this demo are actually motivated by real scenarios that customers have faced in their migration journey.
As exemplified in the demos and throughout this blog, Shivnath and Hari have learned some invaluable lessons by migrating data pipelines from enterprise schedulers to Airflow. You can view Shivnath and Hari’s full session from Airflow Summit 2021 here. If you’re interested in assessing Unravel for your own data-driven applications, you can try Unravel for free or contact us to learn how we can help.