Migrating from a Monolithic Orchestrator to Apache Airflow
Photo by Corinne Kutz on Unsplash
Before we knew better
Our orchestration system started as a simple internal solution to manage event pipelines and trigger downstream jobs. Over time, as more workflows and dependencies were added, it gradually evolved into a tightly coupled monolithic scheduler that became increasingly difficult to understand and maintain.
Understanding how a workflow executed often meant looking through multiple files, configurations and database tables.
For newer team members, onboarding into the system took time because much of the workflow context was distributed across different parts of the codebase. Even relatively small changes required careful coordination to ensure existing pipelines continued to work as expected. Similarly, debugging typically involved manually tracing logs and rerunning jobs to better understand execution behavior.
Limitations of our legacy design
We had a monolithic architecture written in Clojure that bundled all our event pipelines together, added dependencies between them and triggered a Lambda function.
Legacy Workflow

This Lambda function added a single monolithic step to the EMR cluster. If there was an issue in any one of the pipelines, the entire flow would fail due to the single step on the cluster.
We did not have step-wise monitoring in the old design, so during on-call situations it became very difficult to identify which part of the pipeline was causing the issue.
There was no single place to answer basic questions like:
- What runs first?
- What happens if this step fails?
- How do I re-run just one part safely?
The scheduler worked, but it was hard to understand, hard to maintain and even harder to explain. That’s when we realized we needed a better way.
What we actually needed
Our aim was less about fancy scheduling features and more about making our daily work easier and more reliable.
- Simpler onboarding, less mental overhead
Our existing step scheduler was built in Clojure and came with a steep learning curve. New engineers had to spend a lot of time just understanding how jobs were defined, chained and monitored before they could confidently make changes. In contrast, moving to Apache Airflow and its Python-based DAGs made workflows far more readable and intuitive. The structure of dependencies, retries and scheduling became obvious at a glance and because Python was already familiar to a much broader set of engineers, onboarding became significantly faster. The shift not only improved maintainability but also made the scheduler more accessible to the wider engineering team.
2. One failure shouldn’t kill everything
In our old setup, if one step in a multi-step job failed, the entire job would terminate. This was painful, especially for long-running pipelines. A single transient issue like a network glitch or a temporary dependency failure meant rerunning everything from scratch. We needed better fault isolation, where a failed step doesn’t automatically bring down the whole workflow.
3. Clear visibility and easy recovery from failures
When something failed, it wasn’t always obvious what failed or why. We wanted a clear view of each step or task, with the ability to:
- Quickly see which step failed
- Get alerted when failures happen
- Re-trigger only the failed step instead of restarting the entire job
In short, we were looking for a workflow system that was easier to understand, more forgiving when things go wrong and faster to recover from failures without adding more operational complexity.
Why we chose Apache Airflow
Once we were clear with our needs, Apache Airflow was a good fit. Airflow is designed to act as a robust orchestration platform.
- Airflow Was Already Available and Production-Ready
One of the biggest factors was that Apache Airflow was already set up and running in our environment. This allowed us to avoid the upfront cost of evaluating, deploying and writing a new scheduler from scratch. Instead, we could focus on solving workflow problems rather than infrastructure problems.
2. Better Observability and Debugging
Airflow provides rich visibility into workflows:
- Task-level logs
- Retry history
- Execution timelines
- Clear success/failure states
- Better dependency management between tasks
This significantly improved our ability to debug failures compared to the limited visibility offered by the step scheduler.
3. Support for Retries, Backfills and Scheduling
Apache Airflow provides core orchestration capabilities such as configurable retries, historical backfills, and flexible time- and event-based scheduling out of the box. These features are built into the platform and require minimal configuration. As a result, we were able to eliminate a large amount of custom logic from our legacy scheduler. This simplified our workflows and reduced long-term maintenance overhead.
4. Separation of Orchestration and Business Logic
With Airflow, orchestration logic lives in DAGs while business logic lives in independent services or jobs. This separation made workflows easier to test, reason about and evolve over time — something that was harder to enforce with a step scheduler.
Migration approach
The first step in the migration was mapping the existing scheduler concepts into Airflow primitives.
- Each pipeline was converted into an Airflow DAG.
- Existing steps became Airflow tasks/operators.
- Dependencies were explicitly defined using DAG relationships instead of implicit execution logic.
This made the execution flow much easier to understand and visualize.

In the new setup, Airflow is responsible for workflow coordination, scheduling DAGs, managing task dependencies, handling retries, monitoring execution and sending notifications. The compute-heavy processing workloads such as Spark, Hive and Presto jobs run independently on transient EMR clusters.
The execution flow starts when a scheduled trigger initiates a DAG run in Airflow. Airflow then creates an EMR cluster and submits the required processing steps using EMR operators. After submitting the processing steps to EMR, Airflow monitors their execution using EMR Sensors such as EmrStepSensor.
During this process the sensor continues polling until the step reaches a terminal state such as COMPLETED, FAILED or CANCELLED. If the step completes successfully, Airflow marks the sensor task as successful and proceeds with downstream tasks. If the step fails, Airflow immediately updates the task status and triggers the configured retry policies, failure callbacks or alerting mechanisms.
This allows Airflow to track the progress of jobs while EMR handles the actual processing and compute resources. The Airflow UI reflects the current state of each task, making it easy to track long-running jobs and troubleshoot failures.
In addition to status monitoring, Airflow captures execution metadata and integrates with EMR logs stored in locations such as S3 and CloudWatch. This allows us to quickly investigate failed jobs by navigating from the Airflow task instance to the corresponding EMR step logs.
Once all submitted steps have completed successfully, Airflow initiates cluster termination using EmrTerminateJobFlowOperator. Since we use transient clusters, this ensures that compute resources exist only for the duration of the workflow, helping optimize infrastructure costs and avoid idle cluster usage.
Finally, Airflow updates the DAG status, records execution metrics and sends notifications through configured channels such as Slack and email. This end-to-end orchestration model provides a clear separation between workflow management and data processing while improving observability, reliability and operational efficiency.
Separating business logic
Previously, orchestration and execution logic were tightly coupled in the same system. During migration:
- Spark jobs, Hive queries and scripts were extracted into reusable components.
- Airflow was used purely for orchestration.
- DAGs became lightweight and focused only on workflow coordination.
This improved maintainability and simplified testing.

Incremental rollout strategy
Instead of migrating everything at once:
- Low-risk workflows were migrated first
- Legacy scheduler and Airflow ran in parallel initially
- Compared workflow executions across both systems over a defined observation window, validating not only data correctness but also runtime behavior and infrastructure efficiency. Key metrics such as total execution time, EMR instance-hours consumed, job completion rates and failure patterns were analyzed across multiple pipeline runs.
- Critical pipelines were migrated gradually
This minimized operational risk during the transition.
Logging and observability improvements
The previous system required manually tracing logs across multiple components.
With Airflow:
- Each task had centralized logs
- DAG execution history became visible through the UI
- Failures could be traced directly from task graphs
This reduced troubleshooting effort for developers.
Success metrics (post migration)

Conclusion
Migrating from a monolithic orchestration system to Apache Airflow was more than just a technology change, it was an architectural shift toward building a more scalable, maintainable and observable workflow platform.
By decoupling orchestration from business logic, standardizing workflow management and leveraging Airflow’s built-in capabilities such as scheduling, retries and monitoring, we significantly reduced operational complexity and improved developer experience.
While there are still areas we plan to improve, moving to Airflow gave us a solid foundation to scale our data and event-driven workflows more reliably in the future.
Migrating from a Monolithic Orchestrator to Apache Airflow was originally published in helpshift-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.
entity type hierarchy with one function change expanded to show a side-by-side diff with highlighting"/>