Common Pitfalls in Deploying Airflow for Data Teams
If you’re a data engineer, then you’ve likely at least heard of Airflow.
Apache Airflow is one of the most popular open-source workflow orchestration solutions that gets used for data pipelines.
DAG Folder, Scheduler and Web Server in One Repo
Airflow comes off as easy.
In terms of locally setting up Airflow, you can do it with or without docker; or even creating an MWAA environment on AWS.
You could do all of that easily.
But if you don’t have the DAGs in a separate repo from your Airflow deployment, then you’ll need to deploy a quick change to a DAG and guess what will happen: You’ll have to deploy your entire project.
In other words, you’ll likely be taking down your webserver and scheduler for a few seconds or minutes just to push some code changes. During that time, the jobs that are currently running may or may not continue. So, if you’ve got a job that’s been running for 15 hours…
Too bad.
It’s starting over.
This is far from ideal behavior.
Instead, your DAGs folder should be a separate repo, or perhaps several repos that then likely push to a centralized location like an S3 bucket that is then likely pulled into a file system that is attached to your Airflow instance(s).
I have seen this approach both at some clients and in Shopify’s and Scribd’s articles. I’ll go over some other points that Shopify went through later; for now, I’ll link to Scribd’s article. This article was more focused on discussing breaking up a DAG mono repo, but you’ll also get an understanding of how they structured their DAGs folder.
Overall, replicating the problems you will face when deploying Airflow is hard, but knowing more about Airflow isn’t.
Not Using Features Airflow Provides
In particular, Hooks and Variables are two that come to mind.
Both of these features are self-explanatory. But if you’re just learning Airflow in the middle of having to deploy and hit deadlines, I think they can get missed.
Now, instead of talking about a mistake, I wanted to talk about a project that implemented Hooks and Variables. There isn’t much to say other than it made my development process smooth.
Instead of constantly searching for Variables or wondering if a connection would work, I, as a data engineer, flew through development because I could easily point to the correct abstracted data sources and variables.
Just in case you haven’t been exposed to it, here is a quick background on hooks and variables.
Hooks
Hooks are interfaces to external platforms and databases like Hive, BigQuery, and Snowflake.
They are used to abstract the methods to connect to these external systems. This means that instead of constantly having to reference the same connection string over and over again, you can pull in a hook. Like the example below.
s3_hook = S3Hook(aws_conn_id="aws_conn")
or
h = HttpHook(method='GET', http_conn_id=hubspot_conn_id)
Why are Hooks useful:
- Abstraction: Instead of having custom code in each DAG for each type of connection, you can use Hooks to avoid writing a different connection string for every database/source.
- Centralized Management: Connection information is stored centrally in Airflow’s metadata database. This means that credentials, host information, and other connection settings are managed in one place, which helps in security and maintenance.
- Extendability: Airflow has a lot of pre-built Hooks for popular systems, but if you have a custom or niche system, you can create a custom hook.
Variables
Variables are a way to store values as a simple key-value store within Airflow. In addition, they can help encrypt data that needs to be secure and isn’t being managed by a connection ID. You can see a few examples below of how some teams might use variables.
Why are Variables useful:
- Dynamic Configuration: Instead of hardcoding specific values in your DAGs, you can use Variables, making updating configurations easier without altering the DAG code.
- Security: Sensitive data can be stored as an encrypted variable within Airflow. This means secrets or passwords can be kept out of the DAG code and stored securely.
- Reusability: If multiple DAGs require the same piece of information, instead of replicating the data, you can store it as a Variable and reference it in the needed DAGs.
With Airflow constantly rolling out new versions, there are probably a few features that I have missed that would, once again, make my job easier. So do keep up to date with the new Airflow roll-outs!
Not Preparing For Scale
If you’re rushed to deploy Airflow, you might not realize, or you might not take the time to think through, how workers and schedulers will scale out as more jobs start to be deployed.
When you first start deploying Airflow DAGs, you won’t notice this issue. All your DAGs will run (unless you have a few long-running ones) without a problem.
But then you start having 20, 30, 100 DAGs, and you’ll notice that DAGs will be sitting in the light green stage for a while before they run. Now, one issue might be that you need to change some configurations in your airflow.cfg (if you can’t tell, this is your new friend when you use Airflow), but another might be that you’re using the wrong executor.
So, this is a great time to review some of the executors that exist:
- SequentialExecutor – runs tasks one at a time, ideal for debugging.
- LocalExecutor – permits parallel task execution on a single machine.
- CeleryExecutor – distributes tasks across multiple nodes using the Celery framework.
- KubernetesExecutor – dynamically allocates tasks as isolated Kubernetes pods, ideal for cloud-native setups.
- DebugExecutor is tailored for in-depth task debugging using the Python debugger.
But the truth is it’s not even that simple. As Megan Parker from Shopify put it in the article: Lessons Learned From Running Apache Airflow at Scale
There’s a lot of possible points of resource contention within Airflow, and it’s really easy to end up chasing bottlenecks through a series of experimental configuration changes. Some of these resource conflicts can be handled within Airflow, while others may require some infrastructure changes.
Following this section, she discusses Pools, Priority Weights, Celery Queues, and Isolated Workers, and perhaps the solutions that Shopify came up with aren’t even the best fit. I think my favorite is the line:
“through a series of experimental configuration changes.”
This isn’t a programming issue.
You’ll need to consider multiple factors to set-up Airflow for scale. I could pretend to know every solution, but what I have used in the past might not work for everyone, and your fix might require a little testing before confirming you can now scale.
Airflow Is Easy Until It’s Not
Airflow is nearly ten years old and has continued to be adopted at companies ranging from start-ups to enterprises (I say that because some people don’t think enterprises are using it). Anecdotally speaking, I have worked with several this year alone.
That being said I’d put money down that Azure Data Factory is probably doing more in terms of total jobs.
Regardless, Airflow is challenging to manage in production. Perhaps that’s why there are so many alternatives to Airflow popping up.
Don’t let the basic DAG tutorials fool you. Building DAGs is easy. Managing and deploying Airflow is hard.
But, I guess that really is how it is with all things.
Your first website is easy to build because it’s local.
Building an ML model is “easy-ish.”
But putting things into operation is always hard.
Thanks for reading! If you’d like to read more about data engineering, then check out the articles below.
How to screen 1000 resume in 50 sec with SQL?
Normalization Vs Denormalization – Taking A Step Back
What Is Change Data Capture – Understanding Data Engineering 101