Spark
Spark and HugginFace: Named Entity Recognition for aircraft ownership
How to use one of the best AI/ML libraries (hugginface) with Spark? Let's try it
Spark
How to use one of the best AI/ML libraries (hugginface) with Spark? Let's try it
Spark
The pyspark library communicates to a spark driver. If the driver dies, we'll obtain some odd error messages that will not directly indicate the root cause.
Spark
Just a quick post on an error you might find when using apache spark and Avro due to a version mismatch. Software Version Spark 3.3.2, 3.3 spark-avro 2.12:3.4.0 I've you have an older version of spark (latest as of now is
Spark
It's sometimes difficult to access s3 files in Apache Spark if you don't use a prebuilt environment like Zepelin, Glue Notebooks, HUE, Databriks Notebooks or other alternatives. And googling around might get you half working solutions. But do not worry, I'll show you how
Spark
Using spark in an interactive way it's a bit cumbersome sometimes if you don't want to go to the good old terminal and you decide something like a jupyter notebook better suits you. Or you are doing a more complex analysis also. If that is the
Spark
In physics or biology you sometimes simulate processes in a 2 dimensional lattice, or discrete space. In those cases you usually compute some local interactions of "cells", and with that, calculate a result. An example of this could be the Ising model which was proposed in 1920 for
Spark
After reading this, you will be able to execute python files and jupyter notebooks that execute Apache Spark code in your local environment. This tutorial applies to OS X and Linux systems. We assume you already have knowledge on python and a console environment. 1. Download Apache Spark We will
Spark
Spark has a way to compute the histogram, however, it is kept under low level and sometimes obscure classes. In this post I'll give you a function that provides you with the desired values passing a dataframe. In the official documentation the only mention [https://spark.apache.org/
Spark
This is one of the things it makes sense when you stop to think about it. When you are performing the same aggregation over a window in more than one column, it is recommended to execute only once that windowing. Lets dive in with an example: I am working with
Spark
Nginx is a common webserver to be used as reverse proxy for things like adding TLS, basic authentication and forwarding the requests to other internal servers on your network. In this case, we are going to serve the sparkUI adding security (https) and authentication, and serving it in a different
Spark
When working with change data capture data, containing sometimes just the updated cells for a given PK, it is not easy to efficiently obtain the latest value for the entry (row). Let's dive into an example. First though, we need a couple of definitions about change data capture
Spark
Imagine we have a table with a sort of primary key where information is added or updated partially: not all the columns for a key are updated each time, but we now want to have a consolidated view of the information, with just one value of the key containing the
Spark
So today I was trying to use the handy function sha1() provided by Spark and I needed to concatenate all my columns in just one, since it did not supported multiple ones. The solution seemed easy at first: use concat(), however, something odd was happening. It turns out I had
Spark
Recently I have been discussing with some colleagues the advantages of type-safe languages, the main reason I ditched python for Scala. Scala provides, for me the best of both worlds: the compiler is smart enough to determine variable types, and in the most cases it even can infer return types
Spark
Graphs have been around quite a while. In fact, the first paper in the history of graph theory was written by Euler himself in 1736! However, it wasn't until 142 years later when the term graph was introduced in a paper published in Nature. Despite those shiny names,
Spark
A pivot table is an old friend of Business Intelligence: it allows to summarize the data of another table, usually by performing aggregations (sum, mean, max, ...) on aggregated data. They are found in multiple reporting tools (i.e. Qlik, Tableau and Excel) and analytical RDBMS (i.e. Oracle) also implement