I would like to use Intel BigDL in notebooks on Data Science Experience on Cloud.
How can I install it?
If your notebooks are backed by an Apache Spark as a Service instance in DSX, installing BigDL is simple. But you have to collect some version information first.
With this information, you can determine the URL of the required BigDL JAR file in the Maven repository.
For the example versions, BigDL 0.3.0 with Spark 2.1, the download URL is
https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar
For other versions, replace 0.3.0 and 2.1 in that URL as required. Note that both versions appear twice, once in the path and once in the filename.
You need the JAR, and the matching Python package. The Python package depends only on the version of BigDL, not on the Spark version. The installation steps can be executed from a Python notebook:
Install the JAR.
!(export sv=2.1 bv=0.3.0 ; cd ~/data/libs/ && wget https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_${sv}/${bv}/bigdl-SPARK_${sv}-${bv}-jar-with-dependencies.jar)
Here, the versions of Spark (sv
) and BigDL (bv
) are defined as environment variables, so you can easily adjust them without having to change the URL.
Install the Python module.
!pip install bigdl==0.3.0 --no-deps | cat
If you want to switch your notebooks between Python versions, execute this step once with each Python version.
(Without --no-deps
, a conflicting version of pyspark would be installed.)
After restarting the notebook kernel, BigDL is ready for use.
If you install the JAR as described above for Python, it is also available in Scala kernels.
If you want to use BigDL exclusively with Scala, better not install the JAR at all. Instead, use the %AddJar
magic at the beginning of the notebook. It's best to do this in the very first code cell, to avoid class loading issues.
%AddJar https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar
By not installing the JAR, you gain the flexibility of using different versions of Spark and BigDL in different Scala notebooks sharing the same service. As soon as you install a JAR, you're likely to run into conflicts between that one and the one you pull in with %AddJar
.
If you want to install a different version of BigDL, you'll have to clean up first. Here are commands to check what is installed, and to get rid of it. Execute these commands from a Python notebook.
Check what JAR is installed. If the output is empty, none is installed.
!find ~/data/libs -name bigdl-\*
Check what Python module is installed. If the output is empty, BigDL is not installed.
!pip freeze | grep -i BigDL
Remove installed BigDL JARs.
!find ~/data/libs -name bigdl-\* -exec rm -vf {} +
Remove the installed BigDL Python module for the current Python version.
!rm -rf ~/.local/lib/python${_py_version_}/site-packages/{bigdl,BigDL}*
If re-installation fails with a "multiple dist-info directories" message, also execute:
!rm -rf $PIP_BUILD