Recently I have tried to use Jupyter notebook to test some data science pipelines in Spark. This project turned out to be more difficult than the expected, with a couple nasty errors and with a new blog post promise

TL;DR:
Infinite problems to install scala-spark kernel in an existing Jupyter notebook. The solution found is to use a docker image that comes with jupyter-spark pre installed. Far from perfect

docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook

A time ago existed spark-kernel, now renamed to Apache Toree. I downloaded Spark 2.2.0, and set the $SPARK_HOME location accordingly export SPARK_HOME=/bigdata/spark

Then, I installed toree directly from pip install toree and then install the kernel to jupyter toree install. I booted up jupyter and selected the scala kernel (toree). However, it was not possible to run any command. This nasty error was the result

Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
	at akka.actor.ActorCell$.<init>(ActorCell.scala:336)
	at akka.actor.ActorCell$.<clinit>(ActorCell.scala)
	at akka.actor.RootActorPath.$div(ActorPath.scala:185)
	at akka.actor.LocalActorRefProvider.<init>(ActorRefProvider.scala:465)
	at akka.actor.LocalActorRefProvider.<init>(ActorRefProvider.scala:453)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
	at scala.util.Try$.apply(Try.scala:192)
	at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
	at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
	at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
	at scala.util.Success.flatMap(Try.scala:231)
	at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
	at akka.actor.ActorSystemImpl.liftedTree1$1(ActorSystem.scala:585)
	at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:578)
	at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
	at akka.actor.ActorSystem$.apply(ActorSystem.scala:109)
	at org.apache.toree.boot.layer.StandardBareInitialization$class.createActorSystem(BareInitialization.scala:71)
	at org.apache.toree.Main$$anon$1.createActorSystem(Main.scala:34)
	at org.apache.toree.boot.layer.StandardBareInitialization$class.initializeBare(BareInitialization.scala:60)
	at org.apache.toree.Main$$anon$1.initializeBare(Main.scala:34)
	at org.apache.toree.boot.KernelBootstrap.initialize(KernelBootstrap.scala:70)
	at org.apache.toree.Main$delayedInit$body.apply(Main.scala:39)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.apache.toree.Main$.main(Main.scala:23)
	at org.apache.toree.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[W 10:42:31.352 NotebookApp] Timeout waiting for kernel_info reply from 81

After some time of hunting for a solution, i have found an explanation: the toree version installed (1.X) is only for Spark up to version 1.6, so no fancy 2.X :(

However, not everything is lost! The solution is to compile the new toree version from source. Forst of all, make sure you have gpg and docker installed (guide for osX installaton). And Docker must be running!

Then, run:

pip install py4j
git clone https://github.com/apache/incubator-toree.git
cd incubator-toree
make clean release APACHE_SPARK_VERSION=2.2.0
pip install --upgrade ./dist/toree-pip/toree-0.2.0.dev1.tar.gz
pip freeze | grep toree 
jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]" --interpreters=Scala,PySpark,SparkR,SQL

This helps to use pyspark in a regular python notebook (import pyspark), however, this does not solve the problem. The previous error changes a bit and now it says:

Exception in thread "main" scala.reflect.internal.FatalError: package scala does not have a member Int
	at scala.reflect.internal.Definitions$DefinitionsClass.scala$reflect$internal$Definitions$DefinitionsClass$$fatalMissingSymbol(Definitions.scala:1186)
	at scala.reflect.internal.Definitions$DefinitionsClass.getMember(Definitions.scala:1203)
...

It is very similar to the previous error. Spark initializer can't find some Scala classes. I have tried using APACHE_SPARK_VERSION=2.1.0 but this made no effect.

Maybe it is because of scala version. Since spark 2.0, it comes with scala 2.11, so maybe we just have to downgrade the scala version to 2.10 (and yaaay, build from source spark).

  1. Install maven. On osX use brew install maven
  2. Dounlad Spark 2.2.0 source code from here
    spark-source
  3. Let's compile spark 2.2.0 with scala version 2.10. This will take a loooong time.
tar xvf spark-2.2.0.tgz
cd spark-2.2.0
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Dscala-2.10 -Phadoop-2.7 -Dhadoop.version=2.7.3 -DskipTests clean package

One finished, you can see if it works executing bin/spark-shell You will see the correct scala version:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)

  1. Now, replace your current $SPARK_HOME variable to point to the new location (in our case, spark-2.2.0)
  2. Run
jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]" --interpreters=Scala,PySpark,SparkR,SQL
jupyter notebook

But it failed miresably. Another time. And now with a new error code

Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
	at org.apache.toree.boot.CommandLineOptions.toConfig(CommandLineOptions.scala:142)
	at org.apache.toree.Main$$anon$1.<init>(Main.scala:35)

So, for now the solution I have found is far from perfect, and I don't like, but it works: there is a docker image with jupyter and apache toree installef with spark 2.2.0 and scala 2.11.

docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook

I will continue to try to solve the problems founds, and I willtry to elaborate a successful guide that works.


Auroras in Jupiter. Photo from NASA