I have a large dataset (3.6 billion lines, 20 columns, I'm working in manageable chunks of 1m rows). My cells can include entries like this string:
"sometext "some other text"... ; , "a" some more text \t . ;"
This is one cell. fread() somehow manages to understand where the next column starts, it is amazing. I want to save my chunks as txt files in a way that any future program (actually the arrow package) that is not as good as fread() in handling such cases will understand where columns end and start but I don't want to change the text within the outer "...". I'm getting lots of errors with different commands, I think this is because all the usual separators are included as characters in my string which however also includes "...". I thought I could just set a nonsensical separator like ";-.-;" or something like that which can never appear in my 3.6bnx20 cells. But apparently it needs to be one character. However, every character imaginable appears somewhere in my 3.6 billion rows at some point. How does one set separators in such a dataset?
p.s.: I guess a case in point is that stackoverflows coloring scheme does not recognize that the cell above should be one string ;)
The best way around this is to use a format other than a csv, like Parquet. You have said in a comment that your goal is to read it using arrow
.
Let's create some sample data with two columns. One is the string you included with some random Unicode characters and emojis added, and the second randomly samples those characters:
library(data.table)
str <- '"sometext \U002 \U003 😊 🙄 "some other text"... ; , "a" some more text \t . ;"'
set.seed(2024); N_ROWS <- 100
dat <- data.frame(
col1 = rep(str, N_ROWS),
col2 = sapply(seq(N_ROWS), \(., x = str) paste0(sample(strsplit(x, "")[[1]], nchar(x)), collapse = ""))
)
arrow::write_parquet
should be able to handle data like this without any pre-processing:
arrow::write_parquet(dat, "./tmp.parquet")
dat_arrow_pq <- arrow::read_parquet("./tmp.parquet") |>
data.frame()
identical(dat, dat_arrow_pq) # [1] TRUE
You could also use the feather
format:
arrow::write_feather(dat, "./tmp.feather")
dat_arrow_feather <- arrow::read_feather("./tmp.feather") |>
data.frame()
identical(dat, dat_arrow_feather) # [1] TRUE
Although in your question you have the right number of columns, I wouldn't assume that reading and writing to/from csv has not mangled the text. If we write dat
to csv and read it back in, we have the right number of columns:
fwrite(dat, "./tmp.csv")
dat_dt <- fread("./tmp.csv", data.table = FALSE)
names(dat) # [1] "col1" "col2"
names(dat_dt) # [1] "col1" "col2"
However, using the default fread()
and fwrite()
options, the quotes are now doubled:
print(dat$col1[1], quote = FALSE)
# [1] "sometext \002 \003 😊 🙄 "some other text"... ; , "a" some more text \t . ;"
print(dat_dt$col1[1], quote = FALSE)
# [1] ""sometext \002 \003 😊 🙄 ""some other text""... ; , ""a"" some more text \t . ;""
identical(dat_dt, dat) # FALSE
If you must use a plain text format, then you need a separator that either is not in the text, or that can be replaced if it is.
Here, I'll use \U001
. This is the start of heading character which is non-printable and obsolete, so is very unlikely to be in your data, and if it is there should be no problem removing it. In the very unlikely event that your data contains this character and it has meaning, Unicode has around a million other characters in many languages that you could choose for the same purpose.
Here is a function to remove "\U001"
from your data and then to save the file using it as a separator:
save_clean <- function(df, outfile = "./tmp.csv", sep = "\U001", repl = "") {
char_cols <- names(df)[sapply(df, is.character)]
df[char_cols] <- lapply(df[char_cols], \(x) gsub(sep, repl, x))
fwrite(df, outfile, sep = sep, quote = FALSE)
}
This will allow you to save and read in the file:
save_clean(dat)
dat_dt <- fread("./tmp.csv", sep = "\U001", data.table = FALSE, quote = "", strip.white = FALSE)
identical(dat, dat_dt) # TRUE
This will also work in base R and with readr
or arrow
:
dat_base <- read.csv("./tmp.csv", sep = "\U001", quote = "")
dat_readr <- readr::read_delim("./tmp.csv", delim = "\U001", quote = "") |>
data.frame()
dat_arrow_csv <- arrow::read_delim_arrow("./tmp.csv", delim = "\U001", quote = "") |>
data.frame()
identical(dat, dat_base) # [1] TRUE
identical(dat, dat_readr) # [1] TRUE
identical(dat, dat_arrow_csv) # [1] TRUE
If your text may start with a space which you want to keep then it's important to ensure that you set quote = ""
and strip.white=FALSE
.
Wasting time with gsub()
is inefficient when it is unlikely that "\U001"
appears in the data. I suggest you stream the files beforehand to see whether the character needs to be replaced. On Linux/Mac:
grep -P -l '\x01' *.csv
Or on Windows PowerShell:
Get-ChildItem -Path . -Filter *.csv -Recurse | ForEach-Object {
if (Select-String -Path $_.FullName -Pattern ([char]0x01) -Quiet) {
$_.FullName
}
}
These commands will return a list of csv files which contain the character. If none you can just save normally with fwrite(df, "./tmp.csv", sep = "\U0001", quote = FALSE)
and skip the potentially expensive substitution. However, given you cannot make any guarantees about which characters might be included in the text, and also as your data is fairly large, then (if it's an option) the best solution is to avoid a plain text format.