How to use emptyValue option in pyspark while reading a csv file?

According to docs of csv options:

Property Name Default Meaning

emptyValue (for reading), "" (for writing) Sets the string representation of an empty value.

Property Name	Default	Meaning
emptyValue	(for reading), "" (for writing)	Sets the string representation of an empty value.

But it doesn't seem to work:

with open("/dbfs/tmp/c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='emptyStr').collect()

prints:

[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]

expected the val='' for id='2' (instead of val='emptyStr').

How do I use emptyValue option? Aim is to be able to specify NULL as well as empty strings in a csv file.

Also see: How to read empty string as well as NULL values from a csv file in pyspark?

Solution

The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
4,""
''')

# Using emptyValue
df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
df.show()

+---+------------------+
| id|               val|
+---+------------------+
|  1|              NULL|
|  2|          emptyStr|
|  3|              str1|
|  4|empty_string_value|
+---+------------------+

# Without using emptyValue
df = spark.read.csv('c.csv', header=True)
df.show()

+---+--------+
| id|     val|
+---+--------+
|  1|    NULL|
|  2|emptyStr|
|  3|    str1|
|  4|    NULL|
+---+--------+

When writing back to csv, spark will convert empty strings to the emptyValue specified:

df = df.withColumn(
    "transformed_val",
    when(col("val") == "empty_string_value", "").otherwise(col("val"))
)
df.show()

+---+------------------+---------------+
| id|               val|transformed_val|
+---+------------------+---------------+
|  1|              NULL|           NULL|
|  2|          emptyStr|       emptyStr|
|  3|              str1|           str1|
|  4|empty_string_value|               |
+---+------------------+---------------+



df.write \
.mode("overwrite") \
.option("header", True) \
.option("emptyValue", "EMPTY") \
.csv('csv_path')

with open("csv_path/part-XXX.csv", "r") as f:
    print(f.read())

id,val,transformed_val
1,,
2,emptyStr,emptyStr
3,str1,str1
4,empty_value,EMPTY

If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

df = spark.read.csv('c.csv', header=True)
df = df.withColumn(
    "val",
    when(col("val") == "emptyStr", "").otherwise(col("val"))
)
df.collect()

[Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]