I am using Amazon EMR 7.x which has Python 3.9 by default.
I added custom Python 3.11 based on
Here is my EMR bootstrap script:
#!/usr/bin/env bash
set -e
PYTHON_VERSION=3.11.7
sudo yum --assumeyes install \
bzip2-devel \
expat-devel \
gcc \
libffi-devel \
make \
systemtap-sdt-devel \
tar \
zlib-devel
curl --silent --fail --show-error --location "https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tar.xz" | tar -x -J -v
cd "Python-${PYTHON_VERSION}"
export CFLAGS="-march=native"
./configure \
--enable-loadable-sqlite-extensions \
--with-dtrace \
--with-lto \
--enable-optimizations \
--with-system-expat \
--prefix="/usr/local/python${PYTHON_VERSION}"
sudo make altinstall
sudo "/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install --upgrade pip
echo "# Install my Amazon EMR cluster-scoped dependencies"
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/1.5.0/sedona-spark-shaded-3.4_2.12-1.5.0.jar
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.5.0-28.2/geotools-wrapper-1.5.0-28.2.jar
"/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install \
apache-sedona[spark]==1.5.0
And I have a step validating the Python version:
import sys
from pyspark.sql import SparkSession
SparkSession.builder.getOrCreate()
print(sys.version_info)
# sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)
assert (sys.version_info.major, sys.version_info.minor) == (3, 11)
which succeed too:
If I changed the code to compare with Python version (3, 9)
, it will fail. So I know it does validate.
When I ssh into EMR master node, I can see folder /usr/local/python3.11.7
.
[hadoop@ip-172-31-177-28 ~]$ cd /usr/local
[hadoop@ip-172-31-177-28 local]$ ls
bin etc games include lib lib64 libexec man python3.11.7 sbin share src
However, in JupterLab, when I select PySpark or Python 3 kernel, the script below still shows I am using Python 3.9:
import sys
print(sys.version_info)
# sys.version_info(major=3, minor=9, micro=16, releaselevel='final', serial=0)
If I open terminal in JupterLab in this EMR cluster, it shows
[notebook@ip-10-131-38-159 /]$ cd /usr/local/
[notebook@ip-10-131-38-159 local]$ ls
bin etc games include lib lib64 libexec sbin share src
So I kind of feeling this JupterLab is running in a Docker container.
How to add Python 3.11 in the JupterLab? Thanks!
I found out JupterLab Python is separate with the EMR cluster custom Python version.
I need first create a new conda Python 3.11 environment for JupterLab, and then register it as a new kernel.
As the JupterLab got installed after the bootstrap script, so I need add a EMR step with script:
#!/usr/bin/env bash
set -e
echo "# Install JupyterLab-scoped dependencies"
PYTHON_VERSION=3.11.7
sudo /emr/notebook-env/bin/conda create --name="python${PYTHON_VERSION}" python=${PYTHON_VERSION} --yes
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m pip install \
apache-sedona[spark]==1.5.0 \
attrs==23.1.0 \
descartes==1.1.0 \
ipykernel==6.28.0 \
matplotlib==3.8.2 \
pandas==2.1.4 \
shapely==2.0.2
echo "# Add JupyterLab kernel"
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m ipykernel install --name="python${PYTHON_VERSION}"
Now the new Python 3.11 kernel shows in the JupterLab:
And it prints correct Python version:
import sys
print(sys.version_info)
# sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)
Reference: