rcsvsqldfread.csv

How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?


I am a relative beginner to R trying to load and explore a large (7GB) CSV file.

It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).

It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...

library(sqldf)
library(here)

read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")

I get this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 10 did not have 196 elements

Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.

Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).


Solution

  • Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:

    There are ways to get a large CSV file into each of this.