amazon-web-services amazon-emr jupyter-lab

How to use custom Python version as a new kernel in Amazon EMR's JupyterLab?

I am using Amazon EMR 7.x which has Python 3.9 by default.

I added custom Python 3.11 based on

Here is my EMR bootstrap script:

#!/usr/bin/env bash
set -e

PYTHON_VERSION=3.11.7
sudo yum --assumeyes install \
  bzip2-devel \
  expat-devel \
  gcc \
  libffi-devel \
  make \
  systemtap-sdt-devel \
  tar \
  zlib-devel
curl --silent --fail --show-error --location "https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tar.xz" | tar -x -J -v
cd "Python-${PYTHON_VERSION}"
export CFLAGS="-march=native"
./configure \
  --enable-loadable-sqlite-extensions \
  --with-dtrace \
  --with-lto \
  --enable-optimizations \
  --with-system-expat \
  --prefix="/usr/local/python${PYTHON_VERSION}"
sudo make altinstall
sudo "/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install --upgrade pip

echo "# Install my Amazon EMR cluster-scoped dependencies"
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/1.5.0/sedona-spark-shaded-3.4_2.12-1.5.0.jar
sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.5.0-28.2/geotools-wrapper-1.5.0-28.2.jar
"/usr/local/python${PYTHON_VERSION}/bin/python${PYTHON_VERSION%.*}" -m pip install \
  apache-sedona[spark]==1.5.0

And I have a step validating the Python version:

import sys
from pyspark.sql import SparkSession

SparkSession.builder.getOrCreate()
print(sys.version_info)
# sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)
assert (sys.version_info.major, sys.version_info.minor) == (3, 11)

which succeed too:

If I changed the code to compare with Python version (3, 9), it will fail. So I know it does validate.

When I ssh into EMR master node, I can see folder /usr/local/python3.11.7.

[hadoop@ip-172-31-177-28 ~]$ cd /usr/local
[hadoop@ip-172-31-177-28 local]$ ls
bin  etc  games  include  lib  lib64  libexec  man  python3.11.7  sbin  share  src

However, in JupterLab, when I select PySpark or Python 3 kernel, the script below still shows I am using Python 3.9:

import sys

print(sys.version_info)
# sys.version_info(major=3, minor=9, micro=16, releaselevel='final', serial=0)

If I open terminal in JupterLab in this EMR cluster, it shows

[notebook@ip-10-131-38-159 /]$ cd /usr/local/
[notebook@ip-10-131-38-159 local]$ ls
bin  etc  games  include  lib  lib64  libexec  sbin  share  src

So I kind of feeling this JupterLab is running in a Docker container.

How to add Python 3.11 in the JupterLab? Thanks!

Solution

I found out JupterLab Python is separate with the EMR cluster custom Python version.

I need first create a new conda Python 3.11 environment for JupterLab, and then register it as a new kernel.

As the JupterLab got installed after the bootstrap script, so I need add a EMR step with script:

#!/usr/bin/env bash
set -e

echo "# Install JupyterLab-scoped dependencies"
PYTHON_VERSION=3.11.7
sudo /emr/notebook-env/bin/conda create --name="python${PYTHON_VERSION}" python=${PYTHON_VERSION} --yes
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m pip install \
  apache-sedona[spark]==1.5.0 \
  attrs==23.1.0 \
  descartes==1.1.0 \
  ipykernel==6.28.0 \
  matplotlib==3.8.2 \
  pandas==2.1.4 \
  shapely==2.0.2

echo "# Add JupyterLab kernel"
sudo "/emr/notebook-env/envs/python${PYTHON_VERSION}/bin/python" -m ipykernel install --name="python${PYTHON_VERSION}"

Now the new Python 3.11 kernel shows in the JupterLab:

And it prints correct Python version:

import sys

print(sys.version_info)
# sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)

Reference:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html