javacsvapache-sparkjava-8apache-spark-2.3

Writing CSV file using Spark and java - handling empty values and quotes


Initial data is in Dataset<Row> and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes

result.coalesce(1).write()
            .option("delimiter", "|")
            .option("header", "true")
            .option("nullValue", "")
            .option("quoteAll", "false")
            .csv(Location);

Expected output:

"London"||"UK"
"Delhi"|"India"
"Moscow"|"Russia"

Current Output:

London||UK
Delhi|India
Moscow|Russia

If I change the "quoteAll" to "true", output I am getting is:

"London"|""|"UK"
"Delhi"|"India"
"Moscow"|"Russia"

Spark version is 2.3 and java version is java 8


Solution

  • Java answer. CSV escape is not just adding " symbols around. You should handle " inside strings. So let's use StringEscapeUtils and define UDF that will call it. Then just apply the UDF to each of the column.

    import org.apache.commons.text.StringEscapeUtils;
    import org.apache.spark.sql.Column;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import static org.apache.spark.sql.functions.*;
    import org.apache.spark.sql.expressions.UserDefinedFunction;
    import org.apache.spark.sql.types.DataTypes;
    
    import java.util.Arrays;
    
    public class Test {
    
        void test(Dataset<Row> result, String Location) {
            // define UDF
            UserDefinedFunction escape = udf(
                (String str) -> str.isEmpty()?"":StringEscapeUtils.escapeCsv(str), DataTypes.StringType
            );
            // call udf for each column
            Column columns[] = Arrays.stream(result.schema().fieldNames())
                    .map(f -> escape.apply(col(f)).as(f))
                    .toArray(Column[]::new);
    
             // save the result
            result.select(columns)
                    .coalesce(1).write()
                    .option("delimiter", "|")
                    .option("header", "true")
                    .option("nullValue", "")
                    .option("quoteAll", "false")
                    .csv(Location);
        }
    }
    

    Side note: coalesce(1) is a bad call. It collect all data on one executor. You can get executor OOM in production for huge dataset.