I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster.
Currently I'm just trying to get some basic examples up and running to validate that I've got everything configured properly. The problem I am having is that I'm not seeing the performance I expect from the Atlas BLAS libraries on the Amazon machine instance.
Below is a code snippet of my simple benchmark. It's just a square matrix multiply followed by short fat multiply and a tall thin multiply to yield a small matrix that can be printed (I wanted to be sure Scala would not skip any part of the computation due to lazy evaluation).
I'm using Breeze for the linear algebra library and netlib-java to pull in the local native libraries for BLAS/LAPACK
import breeze.linalg.{DenseMatrix, DenseVector}
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.SparkConf
import com.github.fommil.netlib.BLAS.{getInstance => blas}
import scala.reflect.ClassTag
object App {
def NaiveMultiplication(n: Int) : Unit = {
val vl = java.text.NumberFormat.getIntegerInstance.format(n)
println(f"Naive Multipication with vector length " + vl)
println(blas.getClass().getName())
val sm: DenseMatrix[Double] = DenseMatrix.rand(n, n)
val a: DenseMatrix[Double] = DenseMatrix.rand(2,n)
val b: DenseMatrix[Double] = DenseMatrix.rand(n,3)
val c: DenseMatrix[Double] = sm * sm
val cNormal: DenseMatrix[Double] = (a * c) * b
println(s"Dot product of a and b is \n$cNormal")
}
Based on a web survey of benchmarks I'm expecting a 3000x3000 matrix multiply to take approx. 2-4s using a native, optimized BLAS library. When I run locally on my MacBook Air this benchmark completes in 1.8s. When I run this on EMR it completes in approx. 11s (using a g2.2xlarge instance, though similar results were obtained on a m3.xlarge instance). As another cross check I ran a prebuilt EC2 AMI from the BIDMach project on the same EC2 instance type, g2.2xlarge, and got 2.2s (note, the GPU benchmark for the same calculation yielded 0.047s).
At this point I suspect that netlib-java is not loading the correct lib, but this is where I am stuck. I've gone through the netlib-java README many times and it seems the ATLAS libs are already installed as required (see below)
[hadoop@ip-172-31-3-69 ~]$ ls /usr/lib64/atlas/
libatlas.a libcblas.a libclapack.so libf77blas.so liblapack.so libptcblas.so libptf77blas.so
libatlas.so libcblas.so libclapack.so.3 libf77blas.so.3 liblapack.so.3 libptcblas.so.3 libptf77blas.so.3
libatlas.so.3 libcblas.so.3 libclapack.so.3.0 libf77blas.so.3.0 liblapack.so.3.0 libptcblas.so.3.0 libptf77blas.so.3.0
libatlas.so.3.0 libcblas.so.3.0 libf77blas.a liblapack.a libptcblas.a libptf77blas.a
[hadoop@ip-172-31-3-69 ~]$ cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
[hadoop@ip-172-31-3-69 ~]$ ls /etc/ld.so.conf.d
atlas-x86_64.conf kernel-4.4.11-23.53.amzn1.x86_64.conf kernel-4.4.8-20.46.amzn1.x86_64.conf mysql55-x86_64.conf R-x86_64.conf
[hadoop@ip-172-31-3-69 ~]$ cat /etc/ld.so.conf.d/atlas-x86_64.conf
/usr/lib64/atlas
Below I've show 2 examples running the benchmark on Amazon EMR instance. The first shows when the native system BLAS supposedly loads correctly. The second shows when the native BLAS does not load and the package falls back to the reference implementation. So it does appear to be loading a native BLAS based on the messages and the timing. Compared to running locally on my Mac, the no BLAS case runs in approximately the same time, but the native BLAS case runs in 1.8s on my Mac compared to 15s in the case below. The info messages are the same for my Mac compared to EMR (other than specific dir/file names, etc.).
[hadoop@ip-172-31-3-69 ~]$ spark-submit --class "com.cyberatomics.simplespark.App" --conf "spark.driver.extraClassPath=/home/hadoop/simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar" --master local[4] simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar 3000 naive
Naive Multipication with vector length 3,000
Jun 16, 2016 12:30:39 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /tmp/jniloader2856061049061057802netlib-native_system-linux-x86_64.so
com.github.fommil.netlib.NativeSystemBLAS
Dot product of a and b is
1.677332076284315E9 1.6768329748988206E9 1.692150656424957E9
1.6999000993276503E9 1.6993872020220244E9 1.7149145239563465E9
Elapsed run time: 15.1s
[hadoop@ip-172-31-3-69 ~]$
[hadoop@ip-172-31-3-69 ~]$ spark-submit --class "com.cyberatomics.simplespark.App" --master local[4] simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar 3000 naive
Naive Multipication with vector length 3,000
Jun 16, 2016 12:31:32 AM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Jun 16, 2016 12:31:32 AM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
com.github.fommil.netlib.F2jBLAS
Dot product of a and b is
1.6640545115052865E9 1.6814609592261212E9 1.7062846398842275E9
1.64471099826913E9 1.6619129531594608E9 1.6864479674870768E9
Elapsed run time: 28.7s
At this point my best guess is that it is actually loading a native lib, but it is loading a generic one. Any suggestions on how I can verify which shared library it is picking up at run time? I tried 'ldd' but that seems not to work with spark-submit. Or maybe my expectations for Atlas are wrong, but seems hard to believe AWS would pre-install the libs if they weren't running a reasonably competitive speeds.
If you see that the libs are not linked up correctly on EMR, please provide guidance on what I need to do in order for the Atlas libs to get picked up by netlib-java.
thanks tim
Follow up:
My tentative conclusion is that the Atlas libs installed by default on the Amazon EMR instance is simply slow. Either it is a generic build that has not been optimized for the specific machine type, or it is fundamentally slower than other libraries. Using this script as a guide I built and installed OpenBLAS for the specific machine type where I was running the benchmarks(I also found some helpful info here). Once OpenBLAS was installed my 3000x3000 matrix multiply benchmark completed in 3.9s (as compared to the 15.1s listed above when using the default Atlas libs). This is still slower than the same benchmark run on my Mac (by a factor of x2), but this difference falls in a range that could credibly be due to underlying h/w performance.
Here is a complete listing of the commands I used to install OpenBLAS libs on Amazon's EMR, Spark instance:
sudo yum install git
git clone https://github.com/xianyi/OpenBlas.git
cd OpenBlas/
make clean
make -j4
sudo mkdir /usr/lib64/OpenBLAS
sudo chmod o+w,g+w /usr/lib64/OpenBLAS/
make PREFIX=/usr/lib64/OpenBLAS install
sudo rm /etc/ld.so.conf.d/atlas-x86_64.conf
sudo ldconfig
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so.3
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so.3.5
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so.3
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so.3.5