rdataframesubsethighest

Subset/extract the highest values of groups in one column based on values of groups in other columns


I'm looking at oxygen concentrations in relation to bottom trawling at different depths in inner Danish waters for the last 40 years. I have a data frame (Oxy) with four columns: ID, Date, Depth and Oxygen. The Oxygen has been measured throughout many years (Date), at many different locations (ID) and at many different Depths down the water column, spanning from 0-50 meters.

I would like to create a data frame where I have the Oxygen for the last 4 meters (Depth) (from the bottom and 4 meters up in the water column) for each station and the corresponding date. The measurements are not by every whole meter but at varying depths. The Depth where the Oxygen has been measured are not the same for each ID, so for one ID it has been sampled at 0.2, 0.4, 0.6 meters etc. and for another ID it has been sampled at 0.67, 1.3, 1.55 meters etc. The Depth for each ID also varies, so for one station the deepest measurement is at 30 meters and for another one it's 46 meters.

I have about 4 million rows, so this is just an output of my data:

ID         Date          Depth    Oxygen
------    ----------     -----    ------
957001    2002-01-14     1.20     12.10
967503    2002-01-28     2.00     11.60
957001    2002-01-22    25.00      7.80
965206    2002-01-28     5.40     11.70
953001    2002-01-31    23.60     10.30
941101    2002-01-22     8.67     12.00
940201    2002-01-17     5.00     11.70
965404    2002-01-30    38.80      9.40
952003    2002-01-08    23.40      6.30
957101    2002-01-15     6.00     11.60

I have been searching on google for an answer but can't seem to find the right one. I can extract the highest value or the top 5 highest values by using arrange(), group_by() and slice(). However, that wouldn't work for my data frame because the measurement intervals vary in depth and it needs to be similar for all ID's and Dates.

I imagine that it could be something like; take the highest value and then keep the values that are within -4 from that highest value.

So, I need to end up with all the deepest (last 4 meters for Depth) measurements for Oxygen dependent on ID and Date.

It would look something like this:

ID         Date          Depth     Oxygen
------    ----------     -----     ------
957001    2002-01-14     30.20      2.10
967503    2002-01-28     28.00      1.60
957001    2002-01-22     29.00      7.80
965206    2002-01-28     30.40      5.70
953001    2002-01-31     23.60     10.30
941101    2002-01-22     28.67      7.00
940201    2002-01-17     30.00      8.70
965404    2002-01-30     38.80      9.40
952003    2002-01-08     23.40      6.30
957101    2002-01-15     46.00      1.60

Solution

  • Just as you said, filter to Depth greater than max() - 4 within each ID. Using dplyr:

    library(dplyr)
    
    oxy %>%
      group_by(ID) %>%
      filter(Depth >= max(Depth) - 4) %>%
      ungroup() 
    
    # A tibble: 9 × 4
          ID Date       Depth Oxygen
       <dbl> <date>     <dbl>  <dbl>
    1 967503 2002-01-28  2      11.6
    2 957001 2002-01-22 25       7.8
    3 965206 2002-01-28  5.4    11.7
    4 953001 2002-01-31 23.6    10.3
    5 941101 2002-01-22  8.67   12  
    6 940201 2002-01-17  5      11.7
    7 965404 2002-01-30 38.8     9.4
    8 952003 2002-01-08 23.4     6.3
    9 957101 2002-01-15  6      11.6