Firstly I don't understand why people put negative score on this question. Either explain how can I improve question. I can further elaborate. This is a feedback form my side. Though I am new but I have no intention to ask question without putting my efforts.
I am trying to run spark job written in Scala on Google Cloud Platform Dataproc cluster which uses jep interpreter.
I have added jep as dependency.
Whats the full short solution to run jep on Scala using Google Cloud Platform Dataproc
"black.ninia" % "jep" % "3.9.0"
in my install.sh script I have written
sudo -E pip install jep
export JEP_PATH=$(pip show jep | grep "^Location:" | cut -d ':' -f 2,3 | cut -d ' ' -f 2)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
Still I am getting the below error (no jep in java.library.path)
20/01/07 09:07:23 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0 in stage 9.0 (TID 74, fs-xxxx-xxx-xxxx-test-w-1.c.xx-xxxx.internal, executor 1): java.lang.UnsatisfiedLinkError: no jep in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at jep.MainInterpreter.initialize(MainInterpreter.java:128)
at jep.MainInterpreter.getMainInterpreter(MainInterpreter.java:101)
at jep.Jep.<init>(Jep.java:256)
at jep.SharedInterpreter.<init>(SharedInterpreter.java:56)
at dunnhumby.sciencebank.SubsCommons$$anonfun$getUnitVecEmbeddings$1.apply(SubsCommons.scala:33)
at dunnhumby.sciencebank.SubsCommons$$anonfun$getUnitVecEmbeddings$1.apply(SubsCommons.scala:31)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
(edited):-
1.) I have seen specific answers to on premises machine but not for Google Cloud Platform.
2.) I found https://github.com/ninia/jep/issues/141 but this did not help
3.) I also found answer but that is not answered and that's not accepted for Google Cloud Platform as well. And I have even performed all steps from there.
4.) If question is missing some snapshots I ll attach . But please provide some comments.
(Edited:- 08012020 I am adding install.sh used )
#!/bin/bash
set -x -e
# Disable ipv6 since it seems to cause intermittent SocketTimeoutException when collecting data
# See CENG-1268 in Jira
printf "\nnet.ipv6.conf.default.disable_ipv6=1\nnet.ipv6.conf.all.disable_ipv6=1\n" >> /etc/sysctl.conf
sysctl -p
if [[ $(/usr/share/google/get_metadata_value attributes/dataproc-role) == Master ]]; then
config_bucket="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-configuration-directory | cut -d'/' -f3)"
dataproc_cluster_name="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)"
hdfs dfs -mkdir -p gs://${config_bucket}/${dataproc_cluster_name}/spark_events
systemctl restart spark-history-server.service
fi
tee -a /etc/hosts << EOM
$$(/usr/share/google/get_metadata_value /attributes/preprod-mjr-dataplatform-metrics-mig-ip) influxdb
EOM
echo "[global]
index-url = https://cs-anonymous:XXXXXXXX@artifactory.xxxxxxxx.com/artifactory/api/pypi/pypi-remote/simple" >/etc/pip.conf
PIP_REQUIREMENTS_FILE=gs://preprod-xxx-dpl-artif/dataproc/requirements.txt
PIP_TRANSITIVE_REQUIREMENTS_FILE=gs://preprod-xxx-dpl-artif/dataproc/transitive-requirements.txt
gsutil cp ${PIP_REQUIREMENTS_FILE} .
gsutil cp ${PIP_TRANSITIVE_REQUIREMENTS_FILE} .
gsutil -q cp gs://preprod-xxx-dpl-artif/dataproc/apt-transport-https_1.4.8_amd64.deb /tmp/apt-transport-https_1.4.8_amd64.deb
export http_proxy=http://preprod-xxx-securecomms.preprod-xxx-securecomms.il4.us-east1.lb.dh-xxxxx-media-55595.internal:3128
export https_proxy=http://preprod-xxx-securecomms.preprod-xxx-securecomms.il4.us-east1.lb.dh-xxxxx-media-55595.internal:3128
export no_proxy=google.com,googleapis.com,localhost
echo "deb https://cs-anonymous:Welcome123@artifactory.xxxxxxxx.com/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list
echo "deb https://cs-anonymous:Welcome123@artifactory.xxxxxxxx.com/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout
sudo dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb
sudo apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb
sudo -E apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
sudo -E apt-get --allow-unauthenticated -y install python-pip gcc python-dev python-tk curl
#requires index-url specifying because the version of pip installed by previous command
#installs an old version that doesn't seem to recognise pip.conf
sudo -E pip install --index-url https://cs-anonymous:xxxxxxx@artifactory.xxxxxxxx.com/artifactory/api/pypi/pypi-remote/simple --ignore-installed pip setuptools wheel
sudo -E pip install jep
sudo -E pip install gensim
JEP_PATH=$(pip show jep | grep "^Location:" | cut -d ':' -f 2,3 | cut -d ' ' -f 2)
cat << EOF >> /etc/spark/conf/spark-env.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOF
tee -a /etc/spark/conf/spark-defaults.conf << EOM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOM
tee -a /etc/*bashrc << EOM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOM
source /etc/*bashrc
sudo -E apt-get install --allow-unauthenticated -y \
pkg-config \
freetype* \
python-matplotlib \
libpq-dev \
libssl-dev \
libcrypto* \
python-dev \
libtext-csv-xs-perl \
libmysqlclient-dev \
libfreetype* \
libzmq3-dev \
libzmq3*
sudo -E pip install -r ./requirements.txt
Assuming you're using install.sh as an init action for Dataproc, your export
commands would only export those environment variables in the local shell session running the init action, not persistently for all Spark processes that run thereafter.
The way to have Spark use custom environment variables is to add them to /etc/spark/conf/spark-env.sh
. Here's a spark user discussion about how to set java.library.path in Spark.
Essentially you can just use a heredoc in your init action around the parts that export environment variables. However, as shown in https://issues.apache.org/jira/browse/SPARK-1719 the environment variable won't be sufficient to propagate the library path into the executors in YARN; spark explicitly sets the library path rather than propagating through LD_LIBRARY_PATH
, so we must use spark.executor.extraLibraryPath
as well in spark-defaults.conf
JEP_PATH=$(pip show jep | grep "^Location:" | cut -d ':' -f 2,3 | cut -d ' ' -f 2)
# spark-env.sh for driver process.
cat << EOF >> /etc/spark/conf/spark-env.sh
# Note that backslash before $LD_LIBRARY_PATH on the right hand side;
# it is important that the variable is evaluated in spark-env.sh rather
# than clobbering it with the local $LD_LIBRARY_PATH of the init action
# running process.
export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:$JEP_PATH/jep
EOF
# For executor processes
cat << EOF >> /etc/spark/conf/spark-defaults.conf
spark.executor.extraLibraryPath=$JEP_PATH/jep
EOF