rdataframesubset

Function for median similar to "which.max" and "which.min" / Extracting median rows from a data.frame


I occasionally need to extract specific rows from a data.frame based on values from one of the variables. R has built-in functions for maximum (which.max()) and minimum (which.min()) that allow me to easily extract those rows.

Is there an equivalent for median? Or is my best bet to just write my own function?

Here's an example data.frame and how I would use which.max() and which.min():

set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), 
                 V4 = sample(1:20, 10, replace=T))

# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]

For this particular example, since there are an even number of observations, I would need to have two rows returned, in this case, rows 2 and 10.

Update

It would seem that there is not a built-in function for this. As such, using the reply from Sacha as a starting point, I wrote this function:

which.median = function(x) {
  if (length(x) %% 2 != 0) {
    which(x == median(x))
  } else if (length(x) %% 2 == 0) {
    a = sort(x)[c(length(x)/2, length(x)/2+1)]
    c(which(x == a[1]), which(x == a[2]))
  }
}

I'm able to use it as follows:

# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]

Are there any suggestions to improve this?


Solution

  • While Sacha's solution is quite general, the median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x) (instead of sort (x) for the quantile values).

    Looking into quantile, types 1 or 3 could be used, all others lead to (weighted) averages of two values in certain cases.

    I chose type 3, and a bit of copy & paste from quantile leads to:

    which.quantile <- function (x, probs, na.rm = FALSE){
      if (! na.rm & any (is.na (x)))
      return (rep (NA_integer_, length (probs)))
    
      o <- order (x)
      n <- sum (! is.na (x))
      o <- o [seq_len (n)]
    
      nppm <- n * probs - 0.5
      j <- floor(nppm)
      h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
      j <- j + h
    
      j [j == 0] <- 1
      o[j]
    }
    

    A little test:

    > x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
    > probs <- c (0, .23, .5, .6, 1)
    > which.quantile (x, probs, na.rm = TRUE)
    [1] 10  1  6  6  4
    > x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)
    
      0%  23%  50%  60% 100% 
    TRUE TRUE TRUE TRUE TRUE 
    

    Here's your example:

    > dat [which.quantile (dat$V4, c (0, .5, 1)),]
      V1         V2          V3 V4
    7  7  0.4874291 -0.01619026  1
    2  2  0.1836433  0.38984324 13
    1  1 -0.6264538  1.51178117 17