Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R
, so I decided to use a local installation of Apache Spark and sparklyr
to be able to use my dplyr
code to manipulate it all. (I'm running Windows 10 Pro; R
is 64-bit.) I've done the work needed, and now want to output the sparklyr
table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv
. Instead I see a newly created folder called d1
, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc
files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv
to output a single file like write.csv
?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R
code that does what I want.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)