According to docs of csv options:
Property Name Default Meaning emptyValue (for reading), "" (for writing) Sets the string representation of an empty value.
But it doesn't seem to work:
with open("/dbfs/tmp/c.csv", "w") as f:
f.write('''id,val
1,
2,emptyStr
3,str1
''')
spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='emptyStr').collect()
prints:
[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]
expected the val=''
for id='2'
(instead of val='emptyStr'
).
How do I use emptyValue
option? Aim is to be able to specify NULL as well as empty strings in a csv file.
Also see: How to read empty string as well as NULL values from a csv file in pyspark?
The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:
with open("c.csv", "w") as f:
f.write('''id,val
1,
2,emptyStr
3,str1
4,""
''')
# Using emptyValue
df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
df.show()
+---+------------------+
| id| val|
+---+------------------+
| 1| NULL|
| 2| emptyStr|
| 3| str1|
| 4|empty_string_value|
+---+------------------+
# Without using emptyValue
df = spark.read.csv('c.csv', header=True)
df.show()
+---+--------+
| id| val|
+---+--------+
| 1| NULL|
| 2|emptyStr|
| 3| str1|
| 4| NULL|
+---+--------+
When writing back to csv, spark will convert empty strings to the emptyValue specified:
df = df.withColumn(
"transformed_val",
when(col("val") == "empty_string_value", "").otherwise(col("val"))
)
df.show()
+---+------------------+---------------+
| id| val|transformed_val|
+---+------------------+---------------+
| 1| NULL| NULL|
| 2| emptyStr| emptyStr|
| 3| str1| str1|
| 4|empty_string_value| |
+---+------------------+---------------+
df.write \
.mode("overwrite") \
.option("header", True) \
.option("emptyValue", "EMPTY") \
.csv('csv_path')
with open("csv_path/part-XXX.csv", "r") as f:
print(f.read())
id,val,transformed_val
1,,
2,emptyStr,emptyStr
3,str1,str1
4,empty_value,EMPTY
If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:
with open("c.csv", "w") as f:
f.write('''id,val
1,
2,emptyStr
3,str1
''')
df = spark.read.csv('c.csv', header=True)
df = df.withColumn(
"val",
when(col("val") == "emptyStr", "").otherwise(col("val"))
)
df.collect()
[Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]