rcsvutf-8non-ascii-characters

How do I open in R a downloaded .csv file that contains both correct accented characters and faulty ones?


I have a .csv file that contains both correct and misread accented characters. For example, on the first line I have "Veríssimo", and on the second I have "VirgÃ-nia" (was supposed to be Virgínia). If I do nothing, it opens the file with "Virgínia" misspeled. If I try one of the correction methods I know, such as saving the file with UTF-8 encoding, then "Veríssimo" is misspelled.

In R, I tried: dados_MG2 <- read_csv("dados_MG.csv") which detects UTF-8 encoding and opens with "Veríssimo" misspelled.

dados_MG <- read_csv("Dados/extra/dados_MG.csv", locale = locale(encoding = "ISO-8859-1")) I tried forcing a different encoding, and with it, "Veríssimo" is spelled correctly, but "Virgínia" is not.

Here is the link to my dataset: https://github.com/elisa-fink/THM


Solution

  • You can use nchar to find strings with invalid UTF-8 characters and replace these with strings read using the ISO-8859-1 encoding:

    dados_MG <- read_csv('dados_MG.csv')
    dados_MG.iso <- read_csv('dados_MG.csv', locale = locale(encoding = 'ISO-8859-1'))
    
    not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
    dados_MG$DS_NOME[not.utf] <- dados_MG.iso$DS_NOME[not.utf]
    
    grep('^Ver.ss|^Virg.ni', dados_MG$DS_NOME, value=T)
    # [1] "Veríssimo" "Virgínia" 
    

    A more simple variant with only one reading of the CSV file (inspired by @rps1227's answer):

    dados_MG <- read_csv('dados_MG.csv')
    not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
    dados_MG$DS_NOME[not.utf] <- 
      iconv(dados_MG$DS_NOME[not.utf], from='ISO-8859-1', to='UTF-8')