Configure Nginx as reverse proxy for SparkUi

Nginx is a common webserver to be used as reverse proxy for things like adding TLS, basic authentication and forwarding the requests to other internal servers on your network. In this case, we are going to serve the sparkUI adding security (https) and authentication, and serving it in a different location: myserver.com/spark

Prerequisites

  • Have spark installed somewhere. For our test set-up, we will have it installed in the same server as nginx.
  • Nginx knowledge
  • Nginx installed with the http_sub_module activated. This is usually activated by default. You can check-it-out doing  sudo nginx -V | grep http_sub_module

Set-up the basic authentication (optional)

This step is straight forward: create a file with the user and passwords and then add that file to the location block in the site config file. You can find a more detailed guide in nginx page.

#Add '-c' option if the file does not exist (just once)
# Execute for each user
$ sudo htpasswd -b /etc/nginx/htpasswd User Password

Then, you will need to add the following content when you want to use the authentication:

auth_basic "Put here any message you wish";
auth_basic_user_file htpasswd;
The authentication message when trying to access a restricted area‌

Spark UI is not prepared for reverse proxies

When trying to set-up this, at first I thought that would be pretty easy. If it is like other services, just adding a parameter to tell the URL the user will see would be enough.

Looking through the documentation, I found something I thought could do the trick: spark.ui.reverseProxy and spark.ui.reverseProxyUrl . Promising!

I tried that, and thought, hey, it seems to work. We get the following message:

$ spark/bin/spark-shell --conf='spark.ui.reverseProxy=true' --conf='spark.ui.reverseProxyUrl=https://8vi.cat/spark' 
...
Spark Context Web UI is available at https://8vi.cat/spark/proxy/local-1587994999941

It is not amazing: we got a random value for each application and i could not get Nginx to forward my connection to spark UI. Furthermore, I have to remember to add those configurations every time I run spark, so it is not the best option IMO.

Let's investigate a little bit more. I came across this 3 year old improvement proposal (SPARK-20044) that it looked like something similar to what I try to achieve.

Purpose: allow to run the Spark web UI behind a reverse proxy with URLs  prefixed by a context root, like www.mydomain.com/spark. In particular,  this allows to access multiple Spark clusters through the same virtual  host, only distinguishing them by context root, like  www.mydomain.com/cluster1, www.mydomain.com/cluster2, and it allows to  run the Spark UI in a common cookie domain (for SSO) with other  services.

This looks promising. There is a PR for that and oh, it never got merged into the open source spark :( Not everything is lost, it points out how we can set-up the nginx configuration to be able to show our UI.

server {
    server_name abc.xyj;
 
    # spark ui configuration
     set $SPARK_MASTER http://127.0.0.1:4040;

    # redirect master UI path without terminating slash,
    # so that relative URLs are resolved correctly
    location ~ ^(?<prefix>/spark$) {
        return 302 $scheme://$host:$server_port$prefix/;
    }   

    # split spark UI path into prefix and local path within master UI
    location ~ ^(?<prefix>/spark)(?<local_path>/.*) {
        # Set authentication
        auth_basic "Spark UI";
        auth_basic_user_file htpasswd;
        # Modify html payloads to redirect links to the correct subfoler
        proxy_set_header Accept-Encoding ""; 
        sub_filter_types *;
        sub_filter 'href="/' 'href="/spark/';
        sub_filter 'src="/' 'src="/spark/';
        sub_filter_once off;
        # strip prefix when forwarding request
        rewrite ^ $local_path break;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        # fix host (implicit) and add prefix on redirects
        proxy_redirect $SPARK_MASTER $prefix;
    }
    more things here
}

Here we use the sub_filter to rewrite the http content sparkUI sends to the browser, adding-in our extra location: /spark

And it works, as you can see in this post's image!