csvapache-sparkpyspark

How to use emptyValue option in pyspark while reading a csv file?


According to docs of csv options:

Property Name Default Meaning
emptyValue (for reading), "" (for writing) Sets the string representation of an empty value.

But it doesn't seem to work:

with open("/dbfs/tmp/c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='emptyStr').collect()

prints:

[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]

expected the val='' for id='2' (instead of val='emptyStr').

How do I use emptyValue option? Aim is to be able to specify NULL as well as empty strings in a csv file.

Also see: How to read empty string as well as NULL values from a csv file in pyspark?


Solution

  • The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:

    with open("c.csv", "w") as f:
        f.write('''id,val
    1,
    2,emptyStr
    3,str1
    4,""
    ''')
    
    # Using emptyValue
    df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
    df.show()
    
    +---+------------------+
    | id|               val|
    +---+------------------+
    |  1|              NULL|
    |  2|          emptyStr|
    |  3|              str1|
    |  4|empty_string_value|
    +---+------------------+
    
    # Without using emptyValue
    df = spark.read.csv('c.csv', header=True)
    df.show()
    
    +---+--------+
    | id|     val|
    +---+--------+
    |  1|    NULL|
    |  2|emptyStr|
    |  3|    str1|
    |  4|    NULL|
    +---+--------+
    

    When writing back to csv, spark will convert empty strings to the emptyValue specified:

    df = df.withColumn(
        "transformed_val",
        when(col("val") == "empty_string_value", "").otherwise(col("val"))
    )
    df.show()
    
    +---+------------------+---------------+
    | id|               val|transformed_val|
    +---+------------------+---------------+
    |  1|              NULL|           NULL|
    |  2|          emptyStr|       emptyStr|
    |  3|              str1|           str1|
    |  4|empty_string_value|               |
    +---+------------------+---------------+
    
    
    
    df.write \
    .mode("overwrite") \
    .option("header", True) \
    .option("emptyValue", "EMPTY") \
    .csv('csv_path')
    
    with open("csv_path/part-XXX.csv", "r") as f:
        print(f.read())
    
    id,val,transformed_val
    1,,
    2,emptyStr,emptyStr
    3,str1,str1
    4,empty_value,EMPTY
    

    If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:

    with open("c.csv", "w") as f:
        f.write('''id,val
    1,
    2,emptyStr
    3,str1
    ''')
    
    df = spark.read.csv('c.csv', header=True)
    df = df.withColumn(
        "val",
        when(col("val") == "emptyStr", "").otherwise(col("val"))
    )
    df.collect()
    
    [Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]