I have a .csv file that contains both correct and misread accented characters. For example, on the first line I have "Veríssimo", and on the second I have "VirgÃ-nia" (was supposed to be Virgínia). If I do nothing, it opens the file with "Virgínia" misspeled. If I try one of the correction methods I know, such as saving the file with UTF-8 encoding, then "Veríssimo" is misspelled.
In R, I tried:
dados_MG2 <- read_csv("dados_MG.csv")
which detects UTF-8 encoding and opens with "Veríssimo" misspelled.
dados_MG <- read_csv("Dados/extra/dados_MG.csv", locale = locale(encoding = "ISO-8859-1"))
I tried forcing a different encoding, and with it, "Veríssimo" is spelled correctly, but "Virgínia" is not.
Here is the link to my dataset: https://github.com/elisa-fink/THM
You can use nchar
to find strings with invalid UTF-8 characters and replace these with strings read using the ISO-8859-1 encoding:
dados_MG <- read_csv('dados_MG.csv')
dados_MG.iso <- read_csv('dados_MG.csv', locale = locale(encoding = 'ISO-8859-1'))
not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
dados_MG$DS_NOME[not.utf] <- dados_MG.iso$DS_NOME[not.utf]
grep('^Ver.ss|^Virg.ni', dados_MG$DS_NOME, value=T)
# [1] "Veríssimo" "Virgínia"
A more simple variant with only one reading of the CSV file (inspired by @rps1227's answer):
dados_MG <- read_csv('dados_MG.csv')
not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
dados_MG$DS_NOME[not.utf] <-
iconv(dados_MG$DS_NOME[not.utf], from='ISO-8859-1', to='UTF-8')