rdataframesubset

How to drop columns by name in a data frame


I have a large data set and I would like to read specific columns or drop all the others.

data <- read.dta("file.dta")

I select the columns that I'm not interested in:

var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]

and than I'd like to do something like:

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

to drop all the unwanted columns. Is this the optimal solution?


Solution

  • You should use either indexing or the subset function. For example :

    R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
    R> df
      x y z u
    1 1 2 3 4
    2 2 3 4 5
    3 3 4 5 6
    4 4 5 6 7
    5 5 6 7 8
    

    Then you can use the which function and the - operator in column indexation :

    R> df[ , -which(names(df) %in% c("z","u"))]
      x y
    1 1 2
    2 2 3
    3 3 4
    4 4 5
    5 5 6
    

    Or, much simpler, use the select argument of the subset function : you can then use the - operator directly on a vector of column names, and you can even omit the quotes around the names !

    R> subset(df, select=-c(z,u))
      x y
    1 1 2
    2 2 3
    3 3 4
    4 4 5
    5 5 6
    

    Note that you can also select the columns you want instead of dropping the others :

    R> df[ , c("x","y")]
      x y
    1 1 2
    2 2 3
    3 3 4
    4 4 5
    5 5 6
    
    R> subset(df, select=c(x,y))
      x y
    1 1 2
    2 2 3
    3 3 4
    4 4 5
    5 5 6