rdata-import

why i can't import the following dataset from uci


Good afternoon ,

Assume we have the following function :

data_preprocessing<-function(link,drop_last_column=TRUE){
  
  link=as.character(link) 
  DT <- data.table::fread(link, 
                          fill = TRUE,
                          na.strings = "?") 
  DT=DT[-1,]
  DT=as.data.frame(DT)
  
  if(drop_last_column==TRUE){
    DT=as.data.frame(DT)[,-ncol(DT)]
  }
  
  
  return(DT)
  
}

When i try to import acute dataset from uci , i get the following error :

acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
 [100%] Downloaded 7276 bytes...
Error in data.table::fread(link, fill = TRUE, na.strings = "?") : 
  File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

I also tried :

acute=read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 3 appears to contain embedded nulls
4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 4 appears to contain embedded nulls
5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 5 appears to contain embedded nulls
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  embedded nul(s) found in input

Thank you for help !


Solution

  • Use read.table with appropriate encoding instead.

    data_preprocessing<-function(link,drop_last_column=TRUE){
      
      link=as.character(link) 
      DT <- read.table(link, 
                       fileEncoding="UTF-16", 
                       fill = TRUE, 
                       na.strings = "?") 
      DT=DT[-1,]
      DT=as.data.frame(DT)
      
      if(drop_last_column==TRUE){
        DT=as.data.frame(DT)[,-ncol(DT)]
      }
      
      
      return(DT)
      
    }
    
    acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
    
    head(acute)
        V1 V2  V3  V4  V5  V6  V7
    2 35,9 no  no yes yes yes yes
    3 35,9 no yes  no  no  no  no
    4 36,0 no  no yes yes yes yes
    5 36,0 no yes  no  no  no  no
    6 36,0 no yes  no  no  no  no
    7 36,2 no  no yes yes yes yes
    

    Edit: To find automatically the encoding used in the data file, you can use the guess_encoding function in readr package.

    data_preprocessing<-function(link,drop_last_column=TRUE){
      
      link=as.character(link) 
      enc_guess <- readr::guess_encoding(link)
      enc <- enc_guess[enc_guess$confidence == max(enc_guess$confidence),]$encoding
      DT <- read.table(link, 
                       fileEncoding = enc, 
                       fill = TRUE, 
                       na.strings = "?") 
      DT=DT[-1,]
      DT=as.data.frame(DT)
      
      if(drop_last_column==TRUE){
        DT=as.data.frame(DT)[,-ncol(DT)]
      }
      
      
      return(DT)
      
    }