rcsvapache-sparkdplyrsparklyr

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?


Background

I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.

The Problem

Here's the code I'm using to output a .csv file to a folder on my hard drive:

spark_write_csv(d1, "C:/d1.csv")

When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:

enter image description here

The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".

What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?

Edit

A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.


Solution

  • Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.

    You can use a method called as coalese to achieve this.

    coalesce(df, 1)