Pyspark image with jobs syncing through git

1 min readJan 27, 2021

Anytime we want to trigger a spark job, we also need to deploy our jar package, or in pyspark case, our zip package to be executed.

Usually we create a CI/CD job that creates this package and delivers it in a storage, where our spark cluster can download it to run.

This is good enough when you have everything ready to deploy, but not for quick development and a lot of experimentation.

Recently I’ve tried something different, inside the pyspark docker entrypoint, I’ve added some git sparse checkout in a way that we just need to provide some parameters for a job to run:

Repository URL
Directory (for sparse checkout)
Branch
Git deploy key to read the repository

This way, the drivers and executors always have up-to-date version with your git branch:

This allowed me to run fast jobs from local environment without worrying too much about packages.

Let me know your opinion about this! :)

You can also check this related one:

Creating and customizing your pyspark image

I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my…

nbrgil.medium.com

Pyspark image with jobs syncing through git

Creating and customizing your pyspark image

I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my…

Written by Rodrigo Lazarini Gil