apache-sparkpysparkmemoryjvm

How do you access user memory in a PySpark application?


How do I access the user memory set aside in a PySpark application ?

enter image description here

My guess this is not possible in a PySpark application as this is part of JVM memory which is not accessible via Python.

If I am correct then in a PySpark application there is no need to set aside any memory for this correct (as its inaccesible)?


Solution

  • I'm afraid it is not entirely correct. pyspark interacts with the JVM's memory through py4j module, and you as well can access this memory through java gateway.

    For example, lets check Spark JVM's classpath (these are all Java classes that have a potential to be loaded into your "User memory"):

    $ pyspark
    Python 3.11.2 (main, Feb 17 2023, 09:28:16) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.4.0.
          /_/
    
    Using Python version 3.11.2 (main, Feb 17 2023 09:28:16)
    Spark context Web UI available at http://xxx.xxx.xxx:4040
    Spark context available as 'sc' (master = yarn, app id = application_9999999999999_9999).
    SparkSession available as 'spark'.
    >>> 
    >>> cp = spark._jvm.System.getProperty("java.class.path")
    >>> for jar in sorted(cp.split(":")): print(jar)
    ...
    /etc/hive/conf/
    /opt/spark340/lib/hadoop/client/avro.jar
    /opt/spark340/lib/hadoop/client/aws-java-sdk-bundle-1.12.599.jar
    /opt/spark340/lib/hadoop/client/aws-java-sdk-bundle.jar
    /opt/spark340/lib/hadoop/client/azure-data-lake-store-sdk-2.3.6.jar
    /opt/spark340/lib/hadoop/client/azure-data-lake-store-sdk.jar
    /opt/spark340/lib/hadoop/client/checker-qual-2.8.1.jar
    :
    :
    

    A side comment-- your diagram (a link to the source would've been nice!) is neither accurate nor complete. Spark uses Unified memory model since version 2.0 so there is no Execution vs Storage memory (corresponding config settings were deprecated and removed in 3.0). There is also an option to allocate memory specifically for python in pyspark. For complete list of config settings, look at https://spark.apache.org/docs/latest/configuration.html#application-properties