Setting Jupyter kernel with latest version of Spark: the nightmare of versions
Recently I have tried to use Jupyter notebook to test some data science pipelines in Spark. This project turned out to be more difficult than the expected, with a couple nasty errors and with a new blog post promise
TL;DR:
Infinite problems to install scala-spark kernel in an existing Jupyter notebook. The solution found is to use a docker image that comes with jupyter-spark pre installed. Far from perfect
docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook
A time ago existed spark-kernel, now renamed to Apache Toree. I downloaded Spark 2.2.0, and set the $SPARK_HOME
location accordingly export SPARK_HOME=/bigdata/spark
Then, I installed toree directly from pip install toree
and then install the kernel to jupyter toree install
. I booted up jupyter and selected the scala kernel (toree). However, it was not possible to run any command. This nasty error was the result
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
at akka.actor.ActorCell$.<init>(ActorCell.scala:336)
at akka.actor.ActorCell$.<clinit>(ActorCell.scala)
at akka.actor.RootActorPath.$div(ActorPath.scala:185)
at akka.actor.LocalActorRefProvider.<init>(ActorRefProvider.scala:465)
at akka.actor.LocalActorRefProvider.<init>(ActorRefProvider.scala:453)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:192)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:231)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.actor.ActorSystemImpl.liftedTree1$1(ActorSystem.scala:585)
at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:578)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:109)
at org.apache.toree.boot.layer.StandardBareInitialization$class.createActorSystem(BareInitialization.scala:71)
at org.apache.toree.Main$$anon$1.createActorSystem(Main.scala:34)
at org.apache.toree.boot.layer.StandardBareInitialization$class.initializeBare(BareInitialization.scala:60)
at org.apache.toree.Main$$anon$1.initializeBare(Main.scala:34)
at org.apache.toree.boot.KernelBootstrap.initialize(KernelBootstrap.scala:70)
at org.apache.toree.Main$delayedInit$body.apply(Main.scala:39)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at org.apache.toree.Main$.main(Main.scala:23)
at org.apache.toree.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[W 10:42:31.352 NotebookApp] Timeout waiting for kernel_info reply from 81
After some time of hunting for a solution, i have found an explanation: the toree version installed (1.X) is only for Spark up to version 1.6, so no fancy 2.X :(
However, not everything is lost! The solution is to compile the new toree version from source. Forst of all, make sure you have gpg and docker installed (guide for osX installaton). And Docker must be running!
Then, run:
pip install py4j
git clone https://github.com/apache/incubator-toree.git
cd incubator-toree
make clean release APACHE_SPARK_VERSION=2.2.0
pip install --upgrade ./dist/toree-pip/toree-0.2.0.dev1.tar.gz
pip freeze | grep toree
jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]" --interpreters=Scala,PySpark,SparkR,SQL
This helps to use pyspark in a regular python notebook (import pyspark
), however, this does not solve the problem. The previous error changes a bit and now it says:
Exception in thread "main" scala.reflect.internal.FatalError: package scala does not have a member Int
at scala.reflect.internal.Definitions$DefinitionsClass.scala$reflect$internal$Definitions$DefinitionsClass$$fatalMissingSymbol(Definitions.scala:1186)
at scala.reflect.internal.Definitions$DefinitionsClass.getMember(Definitions.scala:1203)
...
It is very similar to the previous error. Spark initializer can't find some Scala classes. I have tried using APACHE_SPARK_VERSION=2.1.0
but this made no effect.
Maybe it is because of scala version. Since spark 2.0, it comes with scala 2.11, so maybe we just have to downgrade the scala version to 2.10 (and yaaay, build from source spark).
- Install maven. On osX use
brew install maven
- Dounlad Spark 2.2.0 source code from here
- Let's compile spark 2.2.0 with scala version 2.10. This will take a loooong time.
tar xvf spark-2.2.0.tgz
cd spark-2.2.0
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Dscala-2.10 -Phadoop-2.7 -Dhadoop.version=2.7.3 -DskipTests clean package
One finished, you can see if it works executing bin/spark-shell
You will see the correct scala version:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
- Now, replace your current
$SPARK_HOME
variable to point to the new location (in our case,spark-2.2.0
) - Run
jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark" --spark_opts="--master=local[*]" --interpreters=Scala,PySpark,SparkR,SQL
jupyter notebook
But it failed miresably. Another time. And now with a new error code
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.toree.boot.CommandLineOptions.toConfig(CommandLineOptions.scala:142)
at org.apache.toree.Main$$anon$1.<init>(Main.scala:35)
So, for now the solution I have found is far from perfect, and I don't like, but it works: there is a docker image with jupyter and apache toree installef with spark 2.2.0 and scala 2.11.
docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook
I will continue to try to solve the problems founds, and I willtry to elaborate a successful guide that works.
Auroras in Jupiter. Photo from NASA