[SOLVED] Is there an explanation when spark-csv won't save a DataFrame to file?

Is there an explanation when spark-csv won't save a DataFrame to file?

dataFrame.coalesce(1).write().save("path") sometimes writes only _SUCCESS and ._SUCCESS.crc files without an expected *.csv.gz even on non-empty input DataFrame

file save code:

private static void writeCsvToDirectory(Dataset<Row> dataFrame, Path directory) {
    dataFrame.coalesce(1)
            .write()
            .format("csv")
            .option("header", "true")
            .option("delimiter", "\t")
            .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
            .mode(SaveMode.Overwrite)
            .save("file:///" + directory);
}

file get code:

static Path getTemporaryCsvFile(Path directory) throws IOException {
    String glob = "*.csv.gz";
    try (DirectoryStream<Path> stream = Files.newDirectoryStream(directory, glob)) {
        return stream.iterator().next();
    } catch (NoSuchElementException e) {
        throw new RuntimeException(getNoSuchElementExceptionMessage(directory, glob), e);
    }
}

file get error example:

java.lang.RuntimeException: directory /tmp/temp5889805853850415940 does not contain a file with glob *.csv.gz. Directory listing:
    /tmp/temp5889805853850415940/_SUCCESS, 
    /tmp/temp5889805853850415940/._SUCCESS.crc

I rely on this expectation, can someone explain me why it work this way?

Solution

Output file should (must by logic) contain at least the header line and some data lines. But he does not exist at all

This comment was a bit misleading. According to the code on Github, this will happen only if the Dataframe is empty, and won't produce SUCCESS files. Considering that those files are present - Dataframe is not empty and the writeCsvToDirectory from your code is triggered.

I have a couple of questions:

Does your Spark job finish without errors?
Does the timestamp of SUCCESS file gets updated?

My two main suspects are:

coalesce(1) - if you have a lot of data, this might fail
SaveMode.Overwrite - I have a feeling that those SUCCESS files are in that folder from previous runs