Airflow with YAML Dags and kubernetes operator

Rodrigo Lazarini Gil
4 min readApr 28, 2020

Simplifying the creation of DAGs

Posted by fatty119 on Reddit

In the first story about an airflow architecture (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), I explained how to use airflow with Kubernetes Executor.

This allows us to scale airflow workers and executors, but we still have problems like this. This article is going to show how to:

  1. Use airflow kubernetes operator to isolate all business rules from airflow pipelines;
  2. Create a YAML DAG using schema validations to simplify the usage of airflow for some users;
  3. Define a pipeline pattern;

Kubernetes Pod Operator

This is accomplished by using only Kubernetes Pod Operator, so the users will keep all the code (and business rules) in their own repository/docker image.

Of course we lose some great operators already available, but this way we keep our pipelines easier to read and without any airflow dependencies.

Simplifying docker images

When we decide to use kubernetes pod operator, we have to think that users will be working with docker images. This can create other problems:

  • Lack of docker knowledge;
  • Lots of similar images in the docker repository;

To solve this, we created a concept called “base” image.

Every time someone wants to use their own docker image, they just need to add the custom task.

But, if they have a python code in any repository of the company, they can use the base image.

The base image works like this:

  • Docker image with python, git and some common libraries, like pandas;
  • It receives arguments like “which python file should I run?” or “do you need to install any other requirements?”;
  • The entrypoint of this image, performs a git clone of the repository, install the requirements and execute.

Example of this task :

name: base-pod 
base:
repository: docker-test
workdir: sample_project
main_file: exec.py
requirements: False
args: [“ — agg”, “1”, “ — name”, “teste”]

In this case, the image will sparse checkout the workdir sample_project in the docker-test repository, and runs the exec.py like this:

python exec.py --agg 1 --name test

Using YAML to create DAGs

One good thing about airflow DAGs being written in python is the flexibility of creating our own code for the pipeline.

But:

  • We can’t forbid people from using other operators inside airflow;
  • Many business rules are added inside the DAG code;
  • A lot of difference between programming patterns;
  • There are some default configuration that some users are not aware, like image pull policy or kubernetes annotations

The concept created here is very simple:

  1. The user adds/updates a YAML file containing all necessary info to create the DAG;
  2. The YAML is validated and generated using cerberus;
  3. The DAG python file is pushed to master;

YAML Dags Validation

The cerberus library was used to check if the schema was correct:

  • Set mandatory fields;
  • Apply regex rules;
  • Set dict and list fields;
  • Set field types;

Another example of validation is to check if airflow is considering the DAG file valid. This can be done executing CLI commands inside airflow installation and check for error messages.

Python Dag Generator

The dag is generated using jinja2 python library. Like any HTML website, you can create your template and convert a YAML file to Python. The library itself has some methods to parse this conversion.

Deploy

When do we create these python files? In this case, these scripts are executed during CI/CD time.

After creating the new files, the updates are pushed to a git repository where the airflow syncs all the DAGs.

Example

Here I’ll show an example of a DAG as YAML file and the conversion. Some of the functions in the python file were “hidden” in a utils file, so the DAG files generated are smaller and doesn’t show some default parameters.

YAML DAG
Python DAG

Conclusion

For a team or company that already knows very well airflow and python all this article seems like “removing” some airflow features.

But in our case, it simplifies the process for new users and gave us (the team responsible for airflow) some advantages:

  • The pipeline in YAML is a description of the job without dependencies with Airflow;
  • We can change the airflow behavior of the generator to recreate all DAGs very easily;
  • All of our DAGs are very similar and everyone has to use dockerized applications;
  • For people that doesn’t know docker, we can create more base images to run their code;
  • And the most important part, airflow in our stack was very quickly accepted by many users;

Thank you for reading this :)

--

--

Rodrigo Lazarini Gil

Working through the years with SQL, data modeling, data platform and engineering. Currently focused on data platform and spark jobs with python.