[SOLVED] pyspark memory consumption is very low

pyspark memory consumption is very low

I am using anaconda python and installed pyspark on top of it. In the pyspark program, I am using the dataframe as the data structure. The program goes like this:

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("data")
df = spark_session.sql("select field1, field2 from data group by field1")
df.write.csv("result.csv")

While this works but it is slow and the memory usage is very low (~2GB). There is much more physical memory installed.

I tried to increase the memory usage by:

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')

But it does not seem to help at all.

Any ways to speedup the program? Especially, how to fully utilize the system memory?

Thanks!

Solution

You can either use configuration for your session:

conf = SparkConf()
conf.set('spark.executor.memory', '16g')
spark_session = SparkSession.builder \
        .config(conf=conf) \
        .appName('test') \
        .getOrCreate()

Or run the script with spark-submit:

spark-sumbit --conf spark.executor.memory=16g yourscript.py

You should also probably set the spark.driver.memory to something reasonable.

Hope this helps, good luck!