rfread

What separator to set in text files with complicated string entries (that include all the standard separators)


I have a large dataset (3.6 billion lines, 20 columns, I'm working in manageable chunks of 1m rows). My cells can include entries like this string:

"sometext  "some other text"... ; , "a" some more text \t . ;"

This is one cell. fread() somehow manages to understand where the next column starts, it is amazing. I want to save my chunks as txt files in a way that any future program (actually the arrow package) that is not as good as fread() in handling such cases will understand where columns end and start but I don't want to change the text within the outer "...". I'm getting lots of errors with different commands, I think this is because all the usual separators are included as characters in my string which however also includes "...". I thought I could just set a nonsensical separator like ";-.-;" or something like that which can never appear in my 3.6bnx20 cells. But apparently it needs to be one character. However, every character imaginable appears somewhere in my 3.6 billion rows at some point. How does one set separators in such a dataset?

p.s.: I guess a case in point is that stackoverflows coloring scheme does not recognize that the cell above should be one string ;)


Solution

  • Use a different format

    The best way around this is to use a format other than a csv, like Parquet. You have said in a comment that your goal is to read it using arrow.

    Let's create some sample data with two columns. One is the string you included with some random Unicode characters and emojis added, and the second randomly samples those characters:

    library(data.table)
    str <- '"sometext \U002 \U003 😊 🙄 "some other text"... ; , "a" some more text \t . ;"'
    set.seed(2024); N_ROWS <- 100
    dat <- data.frame(
        col1 = rep(str, N_ROWS),
        col2 = sapply(seq(N_ROWS), \(., x = str) paste0(sample(strsplit(x, "")[[1]], nchar(x)), collapse = ""))
    )
    

    arrow::write_parquet should be able to handle data like this without any pre-processing:

    arrow::write_parquet(dat, "./tmp.parquet")
    dat_arrow_pq <- arrow::read_parquet("./tmp.parquet") |>
        data.frame()
    identical(dat, dat_arrow_pq) # [1] TRUE
    

    You could also use the feather format:

    arrow::write_feather(dat, "./tmp.feather")
    dat_arrow_feather <- arrow::read_feather("./tmp.feather") |>
        data.frame()
    identical(dat, dat_arrow_feather) # [1] TRUE
    

    Writing to csv

    Your text may be being mangled

    Although in your question you have the right number of columns, I wouldn't assume that reading and writing to/from csv has not mangled the text. If we write dat to csv and read it back in, we have the right number of columns:

    fwrite(dat, "./tmp.csv")
    dat_dt <- fread("./tmp.csv", data.table = FALSE)
    names(dat) # [1] "col1" "col2"
    names(dat_dt) # [1] "col1" "col2"
    

    However, using the default fread() and fwrite() options, the quotes are now doubled:

    print(dat$col1[1], quote = FALSE)
    # [1] "sometext \002 \003 😊 🙄 "some other text"... ; , "a" some more text \t . ;"
    print(dat_dt$col1[1], quote = FALSE)
    # [1] ""sometext \002 \003 😊 🙄 ""some other text""... ; , ""a"" some more text \t . ;""
    identical(dat_dt, dat) # FALSE
    

    Use a separator that is not in the text

    If you must use a plain text format, then you need a separator that either is not in the text, or that can be replaced if it is.

    Here, I'll use \U001. This is the start of heading character which is non-printable and obsolete, so is very unlikely to be in your data, and if it is there should be no problem removing it. In the very unlikely event that your data contains this character and it has meaning, Unicode has around a million other characters in many languages that you could choose for the same purpose.

    Here is a function to remove "\U001" from your data and then to save the file using it as a separator:

    save_clean <- function(df, outfile = "./tmp.csv", sep = "\U001", repl = "") {
        char_cols <- names(df)[sapply(df, is.character)]
        df[char_cols] <- lapply(df[char_cols], \(x) gsub(sep, repl, x))
        fwrite(df, outfile, sep = sep, quote = FALSE)
    }
    

    This will allow you to save and read in the file:

    save_clean(dat)
    dat_dt <- fread("./tmp.csv", sep = "\U001", data.table = FALSE, quote = "", strip.white = FALSE)
    identical(dat, dat_dt) # TRUE
    

    This will also work in base R and with readr or arrow:

    dat_base <- read.csv("./tmp.csv", sep = "\U001", quote = "")
    dat_readr <- readr::read_delim("./tmp.csv", delim = "\U001", quote = "") |>
        data.frame()
    dat_arrow_csv <- arrow::read_delim_arrow("./tmp.csv", delim = "\U001", quote = "") |>
        data.frame()
    identical(dat, dat_base)      # [1] TRUE
    identical(dat, dat_readr)     # [1] TRUE
    identical(dat, dat_arrow_csv) # [1] TRUE
    

    If your text may start with a space which you want to keep then it's important to ensure that you set quote = "" and strip.white=FALSE.

    A note on character choice

    Wasting time with gsub() is inefficient when it is unlikely that "\U001" appears in the data. I suggest you stream the files beforehand to see whether the character needs to be replaced. On Linux/Mac:

    grep -P -l '\x01' *.csv
    

    Or on Windows PowerShell:

    Get-ChildItem -Path . -Filter *.csv -Recurse | ForEach-Object {
        if (Select-String -Path $_.FullName -Pattern ([char]0x01) -Quiet) {
            $_.FullName
        }
    }
    

    These commands will return a list of csv files which contain the character. If none you can just save normally with fwrite(df, "./tmp.csv", sep = "\U0001", quote = FALSE) and skip the potentially expensive substitution. However, given you cannot make any guarantees about which characters might be included in the text, and also as your data is fairly large, then (if it's an option) the best solution is to avoid a plain text format.