bashapache-sparkinispark-submit

Spark-submit warns about non-Spark properties from an ini file but works when passed directly


I have an ini file that sets some environment-specific properties.

cmds.ini
[DEV]
spark_submit=spark3-submit

[PROD]
spark_submit=spark3-submit

I am parsing this file in shell script and creating spark-submit() function that replaces the original spark-submit

source_cmds.sh
#!/bin/bash
env=${1^^}
eval $(
  awk -v section="[$env]" '
    $0 == section {found=1; next}
    /^\[/{found=0}
    found && /^[^#;]/ {
      gsub(/^[ \t]+|[ \t]+$/, "")
      print
    }
  ' /path/to/cmds.ini |
  sed 's/ *= */=/g'
)

spark-submit(){
  $spark_submit "@"
}

Here's how I source and use this script in another wrapper script (wrapper.sh):

wrapper.sh
#!/bin/bash
source /path/to/source_cmds.sh DEV

spark-submit /path/to/pyspark_script.py

I wanted to include additional environment-specific properties in cmds.ini and updated it as follows:

[DEV]
spark_submit=spark3-submit
conf_args="--conf 'spark.driver.extraJavaOptions=-Djava.io.tmpdir=/tmp/path/' --conf 'spark.executor.extraJavaOptions=-Djava.io.tmpdir=/tmp/path'"

[PROD]
spark_submit=spark3-submit
conf_args=

I also modified source_cmds.sh to pass the conf_args to the spark-submit function:


spark-submit(){
  $spark_submit $conf_args "@"
}

Now, when I run wrapper.sh, Spark shows the following warnings:

Warning: Ignoring non-Spark config property: 'spark.driver.extraJavaOptions

Warning: Ignoring non-Spark config property: 'spark.executor.extraJavaOptions

However, running the same properties directly via the spark-submit command works without any issues:

spark-submit \
--conf 'spark.driver.extraJavaOptions=-Djava.io.tmpdir=/tmp/path/' \
--conf 'spark.executor.extraJavaOptions=-Djava.io.tmpdir=/tmp/path' \
/path/to/pyspark_script.py

Questions:

  1. Why does Spark treat the properties as "non-Spark" when they are read from cmds.ini but work fine when passed directly?
  2. Do I need to change the way conf_args is defined in cmds.ini?
  3. Are there any changes needed in my spark-submit function implementation to properly handle such arguments?

Solution

  • Seems like the issue lies in the way I store conf_args . The outer double quotes make the shell pass it as a single string, and spark interprets it as a single argument.

    Fix:

    Store the conf_args as an array of values and unpack them as below. This lets the array elements be treated as individual arguments in spark-submit

    cmds.ini

    [DEV]
    spark_submit=spark3-submit
    conf_args=(--conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/tmp/path/" --conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=/tmp/path")
    
    [PROD]
    spark_submit=spark3-submit
    conf_args=()
    

    source_cmds.sh

    spark-submit(){
      $spark_submit "${conf_args[@]}" "@"
    }
    

    With this, spark identifies each --conf as individual arguments and works well.