I always remember to write about this when I hear someone in a meeting complaining about some broken package in python.

Here I’ll explain some about my experience.

What is poetry?

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you.

Okay. I copied the definition from their documentation page: https://python-poetry.org/docs/ . There you can understand better about this library and run the usual quick start

Why poetry?

For me, the main purpose of using poetry is related to “stop breaking my build/deploy”. With the…

Anytime we want to trigger a spark job, we also need to deploy our jar package, or in pyspark case, our zip package to be executed.

Usually we create a CI/CD job that creates this package and delivers it in a storage, where our spark cluster can download it to run.

This is good enough when you have everything ready to deploy, but not for quick development and a lot of experimentation.

Recently I’ve tried something different, inside the pyspark docker entrypoint, I’ve added some git sparse checkout in a way that we just need to provide some parameters for…

Sempre quis uma imagem docker oficial para rodar alguns jobs locais, rodar uns testes e usar para meus projetos.

Recentemente descobri que no spark 3.0.1 (na fonte) temos um Dockerfile e alguns scripts onde é possível criar sua imagem com alguns jars e bibliotecas adicionais.

Criei um processo que:

  1. Baixa o código base do spark.
  2. Constrói a image pyspark sem modificar nada.
  3. Usa essa imagem como base para outro dockerfile adicionando alguns jars (no meu caso, para acesso a GCS) e uma bibliotecas bastante utilizadas no python.
    No meu código, você vai encontrar um exemplo de leitura de um arquivo CSV.

Se você rodar o comando a seguir, vai poder conferir como ficou a imagem

make run_local

Você pode olhar o código em mais detalhes: https://github.com/rodrigolazarinigil/docker-spark

I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my projects.

Recently I found out that spark 3.0.1 source has a Dockerfile inside the source distribution and I’ve created some scripts to build this, add some jars and common libraries.

The process does:

  1. Download the base spark code
  2. Builds the spark dockerfile for pyspark, without changing anything
  3. Adds anothers dockerfile with GCS jar and another python requirements, as an example.
    You can also find a sample job to read a CSV inside.

If you just run the following code…

Simplificando a criação de DAGs

Posted by fatty119 on Reddit

No primeiro artigo sobre a arquitetura com airflow (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), eu expliquei como usar o airflow com Kubernetes Executor.

Isso permitiu que nós tivessemos um airflow executores escaláveis, mas ainda temos problemas como esse. Esse artigo irá mostrar como:

  1. Usar o operador Kubernetes do airflow para isolar todas as regras de negócio dos fluxos do airflow;
  2. Criar DAGs em YAML usando validadores de schema para simplificar o uso do airflow para alguns usuários;
  3. Definir um padrão nos fluxos criados

Kubernetes Pod Operator

Conseguimos fazer isso usando somente o Kubernetes Pod Operator, de forma que os usuários mantenham todo seu…

Simplifying the creation of DAGs

Posted by fatty119 on Reddit

In the first story about an airflow architecture (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), I explained how to use airflow with Kubernetes Executor.

This allows us to scale airflow workers and executors, but we still have problems like this. This article is going to show how to:

  1. Use airflow kubernetes operator to isolate all business rules from airflow pipelines;
  2. Create a YAML DAG using schema validations to simplify the usage of airflow for some users;
  3. Define a pipeline pattern;

Kubernetes Pod Operator

This is accomplished by using only Kubernetes Pod Operator, so the users will keep all the code (and business rules) in…

Usando airflow para prover uma solução para múltiplos times

Aqui no GrupoZap, tivemos algumas necessidades que acabamos solucionando implementando uma versão do Airflow na nossa stack de dados. Precisavamos de uma ferramenta que pudessemos delegar um ambiente onde todos os times pudessem criar e monitorar seus fluxos de dados.

Antes de começar alguns termos que usarei durante esse texto:

Exemplo de uma DAG
  • DAG: Directed Acyclic Graph — É um grafo finito direcionado e acíclico, isto é, ele vai definir o conjunto de tarefas a serem executadas em uma ordem a ser seguida
  • Workers: Pods criados no airflow para execução de tarefas
  • Airflow UI…

Using airflow to provide a solution for multiple teams

This article summarizes a way to use of Airflow with Kubernetes with DAGs synced through Git.

This architecture here shows:

  • Airflow with scalable workers and executors as Kubernetes pods;
  • Airflow UI and Scheduler also running inside Kubernetes;
  • Adding Dags through git-sync allowing users to create and update new pipelines without restarting airflow
Airflow kubernetes architecture

Airflow docker image

You can have any airflow image created. The important thing is that your airflow have to be installed with the extra feature kubernetes :

apache-airflow[kubernetes]==1.10.6

The entrypoint of my image starts the airflow metadata db, the webserver and the…

Luigi’s basic structure

I started working with Luigi only a few weeks ago. I understood the basics and felt like trying it in a data warehouse project to link all the dependency graph.

It was really easy to make this work until one coleague of mine asked me “How do I execute one task, multiple times, with differentes parameters? I wanna load all history data processed last year”

After some struggle, I managed to do that in two ways. One is faster, the other may be more flexible in other situations.

Using a WrapperTask

A wrapper task is a task that doesn’t do anything. …

Rodrigo Lazarini Gil

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store