rdata.tableheading

Dealing with ambiguous column names in data.table


I'm reading data from an Excel file into a data.table in R. The file is formatted like so:

   COL_1_STUFF COL_2_STUFF COL_3_STUFF
ID EST MOE PCT EST MOE PCT EST MOE PCT

That is, for each variable (the COLs), there is an estimate, a margin of error, and a percentage given.

The trouble is being created by read.xlsx2, which I'm using to import the file like so:

data <- as.data.table(read.xlsx2(
  "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L))

The problem is that read.xlsx2 assigns the same column name to a bunch of stuff--the import looks something like:

ID EST EST EST

even if I set header = FALSE, I'm apt to get something like

X1 X2 X2 X2

To circumvent this, I've done the following subsequent to import:

data[ , c("col1_est", "EST") := .(EST, NULL)]
data[ , c("col2_est", "EST") := .(EST, NULL)]
data[ , c("col3_est", "EST") := .(EST, NULL)]

This strikes me as an odd way to deal with the problem; can anyone suggest an alternate approach to this?


Solution

  • This can be done with the check.names argument to fread, data.table, and setDT:

    DT <- read.xlsx2(
      "file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L)
    )
    setDT(DT, check.names = TRUE)
    

    Automatically adds .1, .2, etc. to duplicated column names.