rcustom-function

Rollify with custom function


First four columns are the starting data, Onset column is desired output:

Day Hr Min Cnts/min Onset
1 6 0 9.7 False
1 6 10 0.7 False
1 6 20 0.5 False
1 6 30 32.9 False
1 6 40 2.9 False
1 6 50 0.1 False
1 7 10 12.3 False
1 7 20 0.1 False
1 7 30 34.3 TRUE
1 7 40 23.3 False
1 7 50 26.3 False
1 8 10 2.3 False

and so on... there are 2,790 rows for roughly 20 days worth of data.

The Goal: Find activity onset on each day.

  1. Re-organize data so that each new column after the day, Hr, Min columns, represents one day.
  2. Apply a rolling function on each of these columns per day. Specifically, get the max for that column, then compare a rolling window of 6 cells. If there are at least 3 cells >20% of the maximum count for that column, return TRUE otherwise return FALSE (or perhaps even better return the TIME of onset per column i.e. Hr:Min)

The following appends a column called result based on a test case of the x2 column, but returning all NAs. What am I doing wrong?

library(zoo)
library(janitor)
library(tidyverse)


# load bin10 file
bin10 <- read_csv("./data/bin10.csv", skip=3)

bin10_wide <- bin10 %>%
  group_by(Day, Hr, Min) %>%
  summarize(`Cnts/min` = mean(`Cnts/min`, na.rm = TRUE)) %>%
  pivot_wider(names_from = Day, values_from = `Cnts/min`) %>% clean_names()


custom_function <- function(x, max_val) {
  # Check if the current cell is >20% of the max
  if (x[1] > 0.2 * max_val) {
    # Check if there are at least 3 cells in the next 6 that satisfy the criteria
    count <- sum(x[2:7] > 0.2 * max_val)
    return(count >= 3)
  } else {
    return(FALSE)
  }
}

data <- bin10_wide %>%
  mutate(result = rollapply(x2, width = 7, FUN = function(x) custom_function(x, max(x2, na.rm = TRUE)), 
                            by.column = FALSE, align = "right", fill = NA))



Solution

  • look for the first time in the day where there is an amount of activity above a certain threshold, and that activity is sustained for at least 3 bins of the next 6 bins.

    "next" is a contradiction to right-alignment. Please clarify. I suspect you are looking for right-alignment (a look behind) which means you are incorrectly using the term "next.

    Assumptions

    Approach

    based on base R and zoo::rollapplyr.

    We use with() to write xyzzy less often. ave() allows us to do an operation on groups, on days (day) here. To the FUN-argument of ave() we give zoo::rollapplyr(), where the suffix r stands for right-alignment: for the first five observations we cannot compute something, since the window is of width 6. Normally, we fill those values with NA (cp. fill = NA).

    > with(xyzzy, ave(Cnts.min, Day, FUN = \(x) {
    +   zoo::rollapplyr(data = x > max(x, na.rm = TRUE) / 5, 
    +                   width = 6, FUN = sum, fill = NA) > 2 })) 
     [1] NA NA NA NA NA  0  0  0  1  1  1  1
    

    As expected, the first five values are NA. Then, within a window of width 6 (a look behind), we find 4 occurences where the sum of days meeting the condition is greater than 2. As you are only interested in the first occurence, we can use which.max(). To my understanding, if there are several maximum values, which.max() returns the index of the first occurence.

    xyzzy$Onset = 
      with(xyzzy, ave(Cnts.min, Day, FUN = \(x) {
      ires = zoo::rollapplyr(data = x > max(x, na.rm = TRUE) / 5, 
                      width = 6, FUN = sum, fill = NA) > 2   
      # just the first occurence is relevant: 
      res = rep(0, length(ires)); res[which.max(ires)] = 1; res
      })) 
    

    Result

    > xyzzy
       Day Hr Min Cnts.min Onset
    1    1  6   0      9.7     0
    2    1  6  10      0.7     0
    3    1  6  20      0.5     0
    4    1  6  30     32.9     0
    5    1  6  40      2.9     0
    6    1  6  50      0.1     0
    7    1  7  10     12.3     0
    8    1  7  20      0.1     0
    9    1  7  30     34.3     1
    10   1  7  40     23.3     0
    11   1  7  50     26.3     0
    12   1  8  10      2.3     0
    

    Since you mentioned that you like to return "Hr:Min" for some values, I am not using logical. E.g. if you like to replace the TRUE (1) values:

    > with(xyzzy, ifelse(Onset == 1, paste0(Hr, ":", Min), Onset))
     [1] "0"    "0"    "0"    "0"    "0"    "0"    "0"    "0"    "7:30" "0"   
    [11] "0"    "0" 
    

    Notice, that ifelse() is considered to be slow, but easy-to-understand. In R, we can do rapidly fast operations based on vector indices and logicals (if necessary). However, speed is usally not of concern for medium-sized data (~2,700 observations).

    Data

    xyzzy = read.table(text = "Day  Hr  Min     Cnts/min    
    1   6   0   9.7     
    1   6   10  0.7     
    1   6   20  0.5     
    1   6   30  32.9    
    1   6   40  2.9     
    1   6   50  0.1     
    1   7   10  12.3    
    1   7   20  0.1     
    1   7   30  34.3    
    1   7   40  23.3    
    1   7   50  26.3    
    1   8   10  2.3", header = TRUE)