Creating and customizing your pyspark image
I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my projects.
Recently I found out that spark 3.0.1 source has a Dockerfile inside the source distribution and I’ve created some scripts to build this, add some jars and common libraries.
The process does:
- Download the base spark code
- Builds the spark dockerfile for pyspark, without changing anything
- Adds anothers dockerfile with GCS jar and another python requirements, as an example.
You can also find a sample job to read a CSV inside.
If you just run the following code, you be able to check this spark image
You can check this for more details: https://github.com/rodrigolazarinigil/docker-spark
For the next articles, I’d like to write about:
- Using git-sync for a spark docker
- Reading binary file and extracting their metadata (in my case DICOM files)