Spark - 63 orders of magnitude

Spark

Spark and HugginFace: Named Entity Recognition for aircraft ownership

How to use one of the best AI/ML libraries (hugginface) with Spark? Let's try it

Spark

Troubleshoot pyspark Py4JNetworkError: Answer from Java side is empty

The pyspark library communicates to a spark driver. If the driver dies, we'll obtain some odd error messages that will not directly indicate the root cause.

Spark

Troubleshoot: NoClassDefFoundError when reading with Spark Avro

Just a quick post on an error you might find when using apache spark and Avro due to a version mismatch. Software Version Spark 3.3.2, 3.3 spark-avro 2.12:3.4.0 I've you have an older version of spark (latest as of now is

Spark

Easy way to accessing s3 buckets with spark

It's sometimes difficult to access s3 files in Apache Spark if you don't use a prebuilt environment like Zepelin, Glue Notebooks, HUE, Databriks Notebooks or other alternatives. And googling around might get you half working solutions. But do not worry, I'll show you how

Spark

Loading packages in pyspark jupyter notebook/lab

Using spark in an interactive way it's a bit cumbersome sometimes if you don't want to go to the good old terminal and you decide something like a jupyter notebook better suits you. Or you are doing a more complex analysis also. If that is the

Spark

Grid neighbour operations in Spark

In physics or biology you sometimes simulate processes in a 2 dimensional lattice, or discrete space. In those cases you usually compute some local interactions of "cells", and with that, calculate a result. An example of this could be the Ising model which was proposed in 1920 for

Spark

Set-up pyspark in Mac OS X and Visual Studio Code

After reading this, you will be able to execute python files and jupyter notebooks that execute Apache Spark code in your local environment. This tutorial applies to OS X and Linux systems. We assume you already have knowledge on python and a console environment. 1. Download Apache Spark We will

Spark

Compute 1-d histogram in spark

Spark has a way to compute the histogram, however, it is kept under low level and sometimes obscure classes. In this post I'll give you a function that provides you with the desired values passing a dataframe. In the official documentation the only mention [https://spark.apache.org/

Spark

Remember to execute the windowing once with Spark

This is one of the things it makes sense when you stop to think about it. When you are performing the same aggregation over a window in more than one column, it is recommended to execute only once that windowing. Lets dive in with an example: I am working with

Spark

Configure Nginx as reverse proxy for SparkUi

Nginx is a common webserver to be used as reverse proxy for things like adding TLS, basic authentication and forwarding the requests to other internal servers on your network. In this case, we are going to serve the sparkUI adding security (https) and authentication, and serving it in a different

Spark

Obtaining latest PK value Windows vs Joins with spark

When working with change data capture data, containing sometimes just the updated cells for a given PK, it is not easy to efficiently obtain the latest value for the entry (row). Let's dive into an example. First though, we need a couple of definitions about change data capture

Spark

Finding latest non-null values in columns

Imagine we have a table with a sort of primary key where information is added or updated partially: not all the columns for a key are updated each time, but we now want to have a consolidated view of the information, with just one value of the key containing the