rfreadread.csvreadrcsv-import

How to read data in R when some rows contain commas as thousand separator and " flag and the rows without decimals don´t have flag


Im loading in R a csv (separated by comma) that has quotes " wrapping each row that contains a column with a value with decimals, and the particular value is wrapped with double quotes "" , the rows without this issue, have no " wrapping

the csv file looks like this:

YEAR,COUNTRY,VALUE_A,VALUE_B
2019,SPAIN, 2000, 300
"2019,SPAIN,""2000,54"",300"
"2014,SPAIN,""2003,223"",""125,057"""
2018,FRANCE,1900,280


Solution

  • Thats a doozy... the best I could come up with is reading it as text and then using gsub to clean the thousands mark and double quotes.

    # Read the csv as text, so we can run it through gsub
    #
    file_connection <- file("path_to_csv.csv")
    text <- readLines(file_connection)
    close(file_connection)
    

    After reading the content of the csv as a string we can deal with the text "formatting"

    # 1. Remove the comma as thousand mark
    # There HAS to be a better way to do this regex but I couldn't remember
    #
    sanitized_mark <- gsub('(\\"\\"[0-9]+),([0-9]+\\"\\")', '\\1\\2', text)
    
    # 2. Remove all double quotes
    # 
    sanitized_quotes <- gsub('\\"', '', sanitized_mark)
    
    # Paste it all together adding a newline character after each element
    #
    sanitized <- paste0(sanitized_quotes, collapse="\n")
    

    The resulting string can be read as if it was the content of a .csv using the text argument

    df <- read.csv(text=sanitized)