rstringcorrupt-data

extract corrupted strings


I received a file that had a weird encoding and wondered if there's any way to check for 'corrupted' strings. For e.g.

dat <- c("天脊煤化工集团股份有é\231\220å…¬å\217¸", "AB \"\"Achema\"\"", 
         "Abu Qir Fertilizers & Chemical", "Abu Zaabal Fertilizer &", 
         "ADP - Adubos De Portugal SA")

The 1 and 2 element in above vector are corrupted since they have strings and escape characters in them. How can I filter these out or generate an index of corrupted strings in the vector dat


Solution

  • error_string_idx <- which(
      is.na(
        iconv(
          dat,
          to = "ascii"
        ) 
      ) | grepl('\\\\|\\"', dat)
    )