I'm fairly new to R & scripting and I'm trying to write a script that will convert each sheet in an excel file to its own csv file. This script will be used on many excel files that won't all have the same number of sheets, number of rows/columns per sheet, and may have columns with empty values. Also, the script will probably be used by other people at work besides myself, so I don't want to use too many packages that they will have to install.
The following is an outline of what I want the script to do:
I searched online & tried various methods to get excel sheets converted to csv files, but I didn’t have much luck with most of the methods I found. The one method that got me most of the way there is Yifu Yan's answer here. However, exactly as he mentioned, the column types did not convert correctly.
I tried modifying Yifu Yan’s code a bit, trying to figure out the column type conversion myself, but was unsuccessful. Here is the code I used (with readxl & purrr packages):
file <- readline("Enter file name: ")
# Use if/else, for, or while loop to check file extension, which will replace “.xlsx” in the following line with the determined extension
path <- paste0(file, “.xlsx”)
sheet_names <- readxl::excel_sheets(path)
df <- purrr::map(sheet_names, ~readxl::read_excel(path, .x, col_types = "guess", col_names = FALSE))
purrr::walk2(df, sheet_names, write.csv(.x, paste0(file, "-", .y, ".csv"), col_names = FALSE))
When I tested the code with an excel file, I found that some of the columns converted fine, but others didn’t. Here is an example of what the data should have looked like after conversion:
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
---|---|---|---|---|
2233.09 | 501 | 1234567 | ab | 3/20/2011 6:09:17 |
1122.09 | 502 | 1234569 | ac | 3/20/2011 6:12:25 |
What it actually looked like:
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
---|---|---|---|---|
2233.0900000000001 | 501 | 1234567 | ab | 40622.25644675926 |
1122.0900000000002 | 502 | 1234569 | ac | 40622.258622685185 |
The 3 things I want help with:
EDIT: Since I couldn't get anything else to work while only using one package (besides tools), I will just use readxl & tidyverse (includes readr & purrr). Not ideal, but does what I need it to.
Number 1 is the hardest and I fear you won't find a one fits all solution. Most of the Excel readers I know try really hard to determine the column type. This is not an easy task, because Excel cells contain arbitrary values, and R must find a suitable heuristic to get the proper cell type. This guessing works in my opinion most of the times and even in cases where it doesn't it is often the case that R is right and just the formatting in Excel is misleading.
So for example your first column looks totally valid to me. R recognizes the number format and reads in the floating point number.
In floating point arithmetic (cf. Circle 1 - Falling into the Floating Point Trap the two values are indeed the same:
all.equal(2233.0900000000001, 2233.09)
# [1] TRUE
For your date problem it is quite similar. Internally in Excel dates are store as numbers (and times as fractions), in the following screenshot both columns A
and B
contain the same numbers, but column B
hast datetime format:
The conversion is a bit tricky and on e needs to know that Excel dates start at 1899-12-30
(for whatever reason)
as.POSIXct(40622.25644675926 * 24 *60*60, origin = "1899-12-30", tz = "GMT")
# [1] "2011-03-20 06:09:17 GMT"
With all this subtleties and as you are anyways interested in CSV format (which does not know about types other than numbers or text), why not using text for all columns?
readxl::read_excel(path, .x, col_types = "text")
You can use tools::file_ext
. tools
is a base library so should be installed anyways, but the implementation is easy enough:
function (x) {
pos <- regexpr("\\.([[:alnum:]]+)$", x)
ifelse(pos > -1L, substring(x, pos + 1L), "")
}
Thus, you can easily get the file extension then by tools::file_ext(file)
.
base
Loopsyou can use lapply
as a direct "replacement" (actually it is the other way round, purrr::map
is a replacement of lapply
) for both map
and walk
:
df <- lapply(sheet_names,
\(.x) readxl::read_excel(path, .x, col_types = "text",
col_names = FALSE))
lapply(names(df),
\(.y) sheet_names, write.csv(df[.y], paste0(file, "-", .y, ".csv"),
col_names = FALSE))