Pyspark image with jobs syncing through git
Anytime we want to trigger a spark job, we also need to deploy our jar package, or in pyspark case, our zip package to be executed.
Usually we create a CI/CD job that creates this package and delivers it in a storage, where our spark cluster can download it to run.
This is good enough when you have everything ready to deploy, but not for quick development and a lot of experimentation.
Recently I’ve tried something different, inside the pyspark docker entrypoint, I’ve added some git sparse checkout in a way that we just need to provide some parameters for a job to run:
- Repository URL
- Directory (for sparse checkout)
- Branch
- Git deploy key to read the repository
This way, the drivers and executors always have up-to-date version with your git branch:
This allowed me to run fast jobs from local environment without worrying too much about packages.
Let me know your opinion about this! :)
You can also check this related one: