Pyspark image with jobs syncing through git

Rodrigo Lazarini Gil
1 min readJan 27, 2021

Anytime we want to trigger a spark job, we also need to deploy our jar package, or in pyspark case, our zip package to be executed.

Usually we create a CI/CD job that creates this package and delivers it in a storage, where our spark cluster can download it to run.

This is good enough when you have everything ready to deploy, but not for quick development and a lot of experimentation.

Recently I’ve tried something different, inside the pyspark docker entrypoint, I’ve added some git sparse checkout in a way that we just need to provide some parameters for a job to run:

  • Repository URL
  • Directory (for sparse checkout)
  • Branch
  • Git deploy key to read the repository

This way, the drivers and executors always have up-to-date version with your git branch:

This allowed me to run fast jobs from local environment without worrying too much about packages.

Let me know your opinion about this! :)

You can also check this related one:

--

--

Rodrigo Lazarini Gil

Working through the years with SQL, data modeling, data platform and engineering. Currently focused on data platform and spark jobs with python.