Simplifying the creation of DAGs
In the first story about an airflow architecture (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), I explained how to use airflow with Kubernetes Executor.
This allows us to scale airflow workers and executors, but we still have problems like this. This article is going to show how to:
- Use airflow kubernetes operator to isolate all business rules from airflow pipelines;
- Create a YAML DAG using schema validations to simplify the usage of airflow for some users;
- Define a pipeline pattern;
Kubernetes Pod Operator
This is accomplished by using only Kubernetes Pod Operator, so the users will keep all the code (and business rules) in their own repository/docker image.
Of course we lose some great operators already available, but this way we keep our pipelines easier to read and without any airflow dependencies.
Simplifying docker images
When we decide to use kubernetes pod operator, we have to think that users will be working with docker images. This can create other problems:
- Lack of docker knowledge;
- Lots of similar images in the docker repository;
To solve this, we created a concept called “base” image.
Every time someone wants to use their own docker image, they just need to add the
But, if they have a python code in any repository of the company, they can use the
The base image works like this:
- Docker image with python, git and some common libraries, like pandas;
- It receives arguments like “which python file should I run?” or “do you need to install any other requirements?”;
- The entrypoint of this image, performs a git clone of the repository, install the requirements and execute.
Example of this
args: [“ — agg”, “1”, “ — name”, “teste”]
In this case, the image will sparse checkout the workdir
sample_project in the
docker-test repository, and runs the
exec.py like this:
python exec.py --agg 1 --name test
Using YAML to create DAGs
One good thing about airflow DAGs being written in python is the flexibility of creating our own code for the pipeline.
- We can’t forbid people from using other operators inside airflow;
- Many business rules are added inside the DAG code;
- A lot of difference between programming patterns;
- There are some default configuration that some users are not aware, like
image pull policyor
The concept created here is very simple:
- The user adds/updates a YAML file containing all necessary info to create the DAG;
- The YAML is validated and generated using cerberus;
- The DAG python file is pushed to master;
YAML Dags Validation
The cerberus library was used to check if the schema was correct:
- Set mandatory fields;
- Apply regex rules;
- Set dict and list fields;
- Set field types;
Another example of validation is to check if airflow is considering the DAG file valid. This can be done executing CLI commands inside airflow installation and check for error messages.
Python Dag Generator
The dag is generated using jinja2 python library. Like any HTML website, you can create your template and convert a YAML file to Python. The library itself has some methods to parse this conversion.
When do we create these python files? In this case, these scripts are executed during CI/CD time.
After creating the new files, the updates are pushed to a git repository where the airflow syncs all the DAGs.
Here I’ll show an example of a DAG as YAML file and the conversion. Some of the functions in the python file were “hidden” in a
utils file, so the DAG files generated are smaller and doesn’t show some default parameters.
For a team or company that already knows very well airflow and python all this article seems like “removing” some airflow features.
But in our case, it simplifies the process for new users and gave us (the team responsible for airflow) some advantages:
- The pipeline in YAML is a description of the job without dependencies with Airflow;
- We can change the airflow behavior of the generator to recreate all DAGs very easily;
- All of our DAGs are very similar and everyone has to use dockerized applications;
- For people that doesn’t know docker, we can create more
baseimages to run their code;
- And the most important part, airflow in our stack was very quickly accepted by many users;
Thank you for reading this :)