dataframepyspark

pyspark memory consumption is very low


I am using anaconda python and installed pyspark on top of it. In the pyspark program, I am using the dataframe as the data structure. The program goes like this:

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("data")
df = spark_session.sql("select field1, field2 from data group by field1")
df.write.csv("result.csv")

While this works but it is slow and the memory usage is very low (~2GB). There is much more physical memory installed.

I tried to increase the memory usage by:

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')

But it does not seem to help at all.

Any ways to speedup the program? Especially, how to fully utilize the system memory?

Thanks!


Solution

  • You can either use configuration for your session:

    conf = SparkConf()
    conf.set('spark.executor.memory', '16g')
    spark_session = SparkSession.builder \
            .config(conf=conf) \
            .appName('test') \
            .getOrCreate()
    

    Or run the script with spark-submit:

    spark-sumbit --conf spark.executor.memory=16g yourscript.py
    

    You should also probably set the spark.driver.memory to something reasonable.

    Hope this helps, good luck!