My IBM

Everyday Data Engineering: Tips on Developing Production Pipelines in Airflow

29 January 2021

5 min read

There are a lot of guides out there for getting started with Apache Airflow (link resides outside ibm.com); how to build data pipelines, how to schedule processes, how to integrate with various data systems like Snowflake and Spark, but when we started our journey with Airflow, it took us time to find the best way to manage the lifecycle of our Airflow deployments—how to iterate and quickly deploy changes to production in our DAGs.

In this article, I’m going to discuss how we at IBM® Databand® manage our pipeline development lifecycle, including how we deploy iterations over multiple Airflow environments—development, staging, and production. This post will talk more about development culture and less about internal technical stuff (we need to save something interesting for the next post!)

I hope that some of these practices help you in your everyday data engineering.

The problem we are trying to solve

Handling multiple Airflow environments is difficult, and becomes even more difficult when you try to set up a local environment in order to develop new DAGs.

Let’s assume I want to develop a new piece of python code that enriches my data during an ingestion process. This python code is part of a broader pipeline with different kinds of tasks. It is being called from a PythonOperator (link resides outside ibm.com), which is part of a DAG that also includes a JVM Spark job called from the SparkSubmitOperator (link resides outside ibm.com), and some analytical queries called by SnowflakeOperators (link resides outside ibm.com) on our Snowflake database.

With a DAG like this, that has multiple different systems as part of the execution, a lot of questions come to mind:

How do we make sure our python code is running as expected during the development process?
How do we validate that it works in a real environment before we push it to production?
How do we quickly discover and remedy issues?
How do we make sure that what worked locally will also work on production?

Let’s try to answer those questions.

Folders structure

This is the folder structure we are using in our pipelines repository:

Let’s start with Airflow. We have our DAGs folder, containing all the logic of how to schedule and orchestrate our data pipelines. This folder contains only DAGs and Operators. You will not find here the source code of the python callables.

You can also see the docker-compose and helm folders. We found docker-compose to be a great tool for deploying a local development environment, and helm to be a good fit for our staging and production environments (I’ll come back to that later).

Moving forward, you can observe the three business logic folders, one for python (with setup.py file, so we can import it as a module), one for tech stuff (such as SQL queries for Snowflake or other DB), and one for jvm business logic (including java or scala files, and gradle configuration).

In these folders we keep our business logic—the code that actually does the ingesting and data enrichment (link resides outside ibm.com). Structuring the folder in this hierarchy keeps it easy to run those tasks separately from Airflow if we want to test it locally, or just run it in different orchestrators such as Azkaban, Luigi, or even from a CRON job. It is important for us to treat the python business logic as a package so it will be easy to import it from everywhere.

from Airflow import DAG from Airflow.operators.python_operator import PythonOperator from python_business_logic import enrich_data with DAG( DAG_id="ingest_data_example", schedule_interval="0 1 * * *", # Daily default_args=DEFAULT_ARGS, ) as ingest_data_example: enrich_data = PythonOperator( task_id="enrich_data", python_callable=enrich_data, )

In our Airflow DAG file, our code will look something like this

Pipelines Development Cycle

There are 4 steps in our development cycle that we found as most effective and also most efficient.

Local development
Run in docker-compose environment
Run in staging environment
Release to production

This 4 step process assures us that we are able to quickly identify problems before they happen in production, fix them quickly, and keep everything contained.

Of course we don’t want to cause too much overhead on the testing phase, so if our environments are set up correctly, testing in each of them should not take more than a few minutes.

Local development: The obvious step of developing our python code locally and making sure it’s running in our local Airflow environment.

Docker-compose env: After finishing developing our pipelines locally and making sure everything works, we spin up this environment using the same docker image we use in our staging and production environments. This step helps us to easily locate and fix problems in our pipeline that we didn’t encounter during the development process, with no need to wait for a k8s deployment or CI pipelines to run.

It is really easy to debug our DAG in a semi-production system with docker-compose, and to fix issues quickly and test again.

Staging environment: our staging environment is almost the same as our production one. It is running on a Kubernetes cluster deployed with helm3. The difference is that we create a staging environment for each git branch. After development is finished and everything works as expected in our docker-compose environment, when the developer pushes their code into the gitlab repository and opens a Merge Request, our CI\CD starts building it’s image with the new code and deploying it to a cluster with a specific namespace. We are using git-sync (link resides outside ibm.com) for our DAG folder in this environment, so we’ll be able to quickly fix issues we find here without waiting for a CD job to deploy our image again. When the code is being merged, this environment is deleted.

Production environment: This is the step where we publish our DAGs to production. The developer merges their changes into master, and the CI\CD process is capable of taking the new changes and deploying them into production. We managed to get to a place where it is very rare that we have problems with our production pipelines that we couldn’t encounter before.

Wrapping things up with some concrete tips:

Structure your folder in a way that your business logic is isolated and accessible whether you want to run it from Airflow or from another orchestration system.
Use docker-compose for development, and kubernetes with helm for staging and production deployment. Using a semi-production environment with docker-compose makes it so much easier to locate and fix issues with your pipelines and keep the development cycle short and effective.
As part of your CI\CD, make sure you create a staging environment for each Merge Request. In this way you’ll be able to test your DAGs before they go to production in an almost equivalent environment.
In your staging environment, make sure your DAG folder is git-synced with the deployment branch. This is how you quickly fix issues in your staging environment without waiting for a CI\CD pipeline to be finished.
Setup your environments correctly to ensure this process is efficient and not time consuming. Use the docker-compose and helm best practices and share the configuration between them.

Your production environment is a temple. Following the above steps will help you to make sure that all code that enters this environment was carefully tested.

Learn more about Databand’s continuous data observability platform and how it helps detect data incidents earlier, resolve them faster and deliver more trustworthy data to the business. If you’re ready to take a deeper look, book a demo today.

Author

Databand

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

Gartner® Predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

How to successfully align your AI, data and analytics strategy

Connect your data and analytics strategy to business objectives with these 4 key steps.

Overcoming low adoption to make smart decisions

Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.