I always remember to write about this when I hear someone in a meeting complaining about some broken package in python.
Here I’ll explain some about my experience.
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you.
Okay. I copied the definition from their documentation page: https://python-poetry.org/docs/ . There you can understand better about this library and run the usual quick start
For me, the main purpose of using poetry is related to “stop breaking my build/deploy”. With the…
Anytime we want to trigger a spark job, we also need to deploy our jar package, or in pyspark case, our zip package to be executed.
Usually we create a CI/CD job that creates this package and delivers it in a storage, where our spark cluster can download it to run.
This is good enough when you have everything ready to deploy, but not for quick development and a lot of experimentation.
Recently I’ve tried something different, inside the pyspark docker entrypoint, I’ve added some git sparse checkout in a way that we just need to provide some parameters for…
Sempre quis uma imagem docker oficial para rodar alguns jobs locais, rodar uns testes e usar para meus projetos.
Recentemente descobri que no spark 3.0.1 (na fonte) temos um Dockerfile e alguns scripts onde é possível criar sua imagem com alguns jars e bibliotecas adicionais.
Criei um processo que:
Se você rodar o comando a seguir, vai poder conferir como ficou a imagem
make run_local
Você pode olhar o código em mais detalhes: https://github.com/rodrigolazarinigil/docker-spark
I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my projects.
Recently I found out that spark 3.0.1 source has a Dockerfile inside the source distribution and I’ve created some scripts to build this, add some jars and common libraries.
The process does:
If you just run the following code…
Simplificando a criação de DAGs
No primeiro artigo sobre a arquitetura com airflow (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), eu expliquei como usar o airflow com Kubernetes Executor.
Isso permitiu que nós tivessemos um airflow executores escaláveis, mas ainda temos problemas como esse. Esse artigo irá mostrar como:
Conseguimos fazer isso usando somente o Kubernetes Pod Operator, de forma que os usuários mantenham todo seu…
Simplifying the creation of DAGs
In the first story about an airflow architecture (https://medium.com/@nbrgil/scalable-airflow-with-kubernetes-git-sync-63c34d0edfc3), I explained how to use airflow with Kubernetes Executor.
This allows us to scale airflow workers and executors, but we still have problems like this. This article is going to show how to:
This is accomplished by using only Kubernetes Pod Operator, so the users will keep all the code (and business rules) in…
Usando airflow para prover uma solução para múltiplos times
Aqui no GrupoZap, tivemos algumas necessidades que acabamos solucionando implementando uma versão do Airflow na nossa stack de dados. Precisavamos de uma ferramenta que pudessemos delegar um ambiente onde todos os times pudessem criar e monitorar seus fluxos de dados.
Antes de começar alguns termos que usarei durante esse texto:
Using airflow to provide a solution for multiple teams
This article summarizes a way to use of Airflow with Kubernetes with DAGs synced through Git.
This architecture here shows:
You can have any airflow image created. The important thing is that your airflow have to be installed with the extra feature kubernetes
:
apache-airflow[kubernetes]==1.10.6
The entrypoint of my image starts the airflow metadata db, the webserver and the…
I started working with Luigi only a few weeks ago. I understood the basics and felt like trying it in a data warehouse project to link all the dependency graph.
It was really easy to make this work until one coleague of mine asked me “How do I execute one task, multiple times, with differentes parameters? I wanna load all history data processed last year”
After some struggle, I managed to do that in two ways. One is faster, the other may be more flexible in other situations.
A wrapper task is a task that doesn’t do anything. …