Creating and customizing your pyspark image

Rodrigo Lazarini Gil
1 min readDec 28, 2020

I’ve always wanted an official spark docker to run local jobs, run some tests and prepare to deploy these in my projects.

Recently I found out that spark 3.0.1 source has a Dockerfile inside the source distribution and I’ve created some scripts to build this, add some jars and common libraries.

The process does:

  1. Download the base spark code
  2. Builds the spark dockerfile for pyspark, without changing anything
  3. Adds anothers dockerfile with GCS jar and another python requirements, as an example.
    You can also find a sample job to read a CSV inside.

If you just run the following code, you be able to check this spark image

make run_local

You can check this for more details: https://github.com/rodrigolazarinigil/docker-spark

For the next articles, I’d like to write about:

  • Using git-sync for a spark docker
  • Reading binary file and extracting their metadata (in my case DICOM files)

--

--

Rodrigo Lazarini Gil

Working through the years with SQL, data modeling, data platform and engineering. Currently focused on data platform and spark jobs with python.