apache-sparkpysparkblasarpacknetlib

How to properly setup native ARPACK for Spark 2.2.0


I am getting the following Warning when I run the PySpark job:

17/10/06 18:27:16 WARN ARPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK

17/10/06 18:27:16 WARN ARPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefARPACK

My Code is

mat = RowMatrix(tf_rdd_vec.cache())
svd = mat.computeSVD(num_topics, computeU=False) 

I am using Ubuntu 16.04 EC2 instance. And I have installed all following libraries into my system.

sudo apt install libarpack2 Arpack++ libatlas-base-dev liblapacke-dev libblas-dev gfortran libblas-dev liblapack-dev libnetlib-java libgfortran3 libatlas3-base libopenblas-base

I have adjusted LD_LIBRARY_PATH to point to shared lib path as following.

export LD_LIBRARY_PATH=/usr/lib/

Now when I list $LD_LIBRARY_PATH directory it shown me the following .so files

ubuntu:~$ ls $LD_LIBRARY_PATH/*.so | grep "pack\|blas"
/usr/lib/libarpack.so
/usr/lib/libblas.so
/usr/lib/libcblas.so
/usr/lib/libf77blas.so
/usr/lib/liblapack_atlas.so
/usr/lib/liblapacke.so
/usr/lib/liblapack.so
/usr/lib/libopenblasp-r0.2.18.so
/usr/lib/libopenblas.so
/usr/lib/libparpack.so

But Still I am not able to use the Native ARPACK implementation. Also I am Caching the RDD passing to matrix But it still throws Cache WARNING Any suggestion how to solve these 3 Warnings ?

I have downloaded compiled version of spark-2.2.0 from the spark download page.


Solution

  • After exploring I am able to remove these warnings and using native ARPACK by the following way.

    The solution was to rebuild spark with -Pnetlib-lgpl argument.

    Build Spark for Native Support

    So following are my steps on Ubuntu 16.04

    # Make sure you use the correct download link, from spark download section
    wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0.tgz
    tar -xpf spark-2.2.0.tgz 
    cd spark-2.2.0/
    ./dev/make-distribution.sh --name custom-spark --pip  --tgz -Psparkr -Phadoop-2.7  -Pnetlib-lgpl
    

    When i started the first time it failed by throwing the following error

    Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly installed. [ERROR] Command execution failed.

    [TRUNCATED]


    [INFO] BUILD FAILURE [INFO]


    [INFO] Total time: 02:38 min (Wall Clock) [INFO] Finished at: 2017-10-13T21:04:11+00:00 [INFO] Final Memory: 59M/843M

    [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.5.0:exec (sparkr-pkg) on project spark-core_2.11: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1] [ERROR]

    So i installed the R language

    sudo apt install r-base-core
    

    Then i re-ran the above build command and it successfully installed.

    Following are related versions when i build this release

    $ java -version
    openjdk version "1.8.0_131"
    OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
    
    $ python --version
    Python 2.7.12
    
    $ R --version
    R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
    Copyright (C) 2015 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    $ make --version
    GNU Make 4.1
    Built for x86_64-pc-linux-gnu