rimportreadr

Loosing column of data when importing large csv file with read_csv


I'm importing a 2.2GB csv file with 24 million rows using read_csv(). One of the columns (vital sign_date_time), a character variable, is not being read and is importing with only NA values.

I've opened the .csv file up in SQLServer and can confirm the data is there in the file. I've broken the large file up into smaller chunks in macOS terminal. When I import the smaller files, again with read_csv(), the data is also present.

I'm using the import dialog box in RStudio to minimize any typing errors. In the data view section of the dialog box, it shows only NA data in the column in question and is trying to import the column as a logical field. I've tried manually changing this to character type and it still reads only NA values.

Here's a screenshot of the dialog box:

screen shot of dialog box

Any ideas about what might be happening?

Thanks.

Take care, Jeff


Solution

  • I was bitten by a similar problem recently, so this is a guess based on that experience.

    By default, if the 1000 first entries of a column are NA, readr::read_csv will automatically set all values of this column to NA. You can control this by setting the guess_max argument. Here is the documentation:

    guess_max: Maximum number of records to use for guessing column types.
    

    For example,

    library(readr)
    dat <- read_csv("file.csv", guess_max=100000)