Spark

Loading packages in pyspark jupyter notebook/lab

Marti Sec

09 Sep 2022 — 1 min read

Using spark in an interactive way it's a bit cumbersome sometimes if you don't want to go to the good old terminal and you decide something like a jupyter notebook better suits you. Or you are doing a more complex analysis also.

If that is the case and you don't want the trouble of setting-up the infrastructure provided by a cloud provider, and you are just using spark because you are used to it, want to try it or scales well, then using pyspark it's great due to the seamless integration with data science and visualization tools.

Just install the package from pip, and you are ready to go!

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("myApp") \
    .config("spark.driver.memory", "32g") \
    .getOrCreate()

You can find some problems however if you want to use those extra utilities coming from the spark packages. Reading from avro, kafka... using graph frames or anything you can imagine it's easy if you use the command line to start-up a shell or submitting a package. Just write --package and are good to go.

However, if you want to do it from a jupyter instance there are some solutions not working well. The easiest one is not well documented, but it works like a charm. Just add an extra config to your spark session builder ( spark.jars.packages ) and restart the kernel.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("myapp") \
    .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.0,org.package2....")\
    .config("spark.driver.memory", "32g") \
    .getOrCreate()

Screenshot of spark downloading the packages

That's it. as easy as it gets!

Leptos 0.8 + tailwind4 + daisyUI5 for easy websites

An update on how to configure leptos 0.8 tailwindcss 4 and daisyui 5 with trunk for csr or cargo watch for ssr

Learn rust from scratch

Those last months I decided to learn this new fancy and hipster language, so let me share a couple of links I would now recommend. I'm coming from mostly python, but a bit of functional experience with scala, so I'm comfortable with the typing system (and

Full stack with Rust: axum + leptos + tailwind.css

Leptos is a production-ready full stack server written in rust that brings component reactivity, server functions and rust's type safety without needing to know Javascript

Data Concept: Idempotence

Idempotentity is not only useful in the design of data pipelines, but also as a system-wide concern, since if one step is not idempotent, the entire quality of the results will be affected. Let's look at a practical example.

Read more

Leptos 0.8 + tailwind4 + daisyUI5 for easy websites

Learn rust from scratch

Full stack with Rust: axum + leptos + tailwind.css

Data Concept: Idempotence