pysparkjupyter-notebookuser-defined-functionsamazon-emraws-emr-studio

Simple UDF apply function from the doc is failing with Spark 3.3


This simple code from the latest doc does not work on the EMR Studio Spark cluster (current version: 3.3.1-amzn-0)

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()

The error looks like this:

An error was encountered:
An error occurred while calling o184.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 59) (ip-10-130-55-119.us-east-1.aws.(website).com executor 7): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --no-pip --system-site-packages virtualenv_application_1693557403809_0024_0
    at org.apache.spark.api.python.VirtualEnvFactory.execCommand(VirtualEnvFactory.scala:125)
    at org.apache.spark.api.python.VirtualEnvFactory.setupVirtualEnv(VirtualEnvFactory.scala:83)
    at org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:95)

I am convinced this is a problem of Python package versions, as another user had a similar problem with a previous version of Spark (see here). However I did not succeed to find the right version of pandas/pyarrow to use...


Solution

  • The solution was to open a ticket with AWS Support and they fixed the issue. Part of the solution was to use this in the first notebook cell:

    %%configure -f
    {
        "conf": {
            "spark.pyspark.python":"python3",
            "spark.pyspark.virtualenv.enabled": "true",
            "spark.pyspark.virtualenv.type": "native", 
            "spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv"
            
        }
    }