rcoding-efficiency

How to sum a frequency table into smaller intervals without loops?


I would like to sum a frequency table into a more compact frequency table by filtering by a variable and certain intervals without using loops and manually creating most of the data frame.

What I'm working with:

head(ride_duration)
  member_casual   duration frequency
1        casual 0.00000000       235
2        member 0.00000000       228
3        casual 0.01666667       567
4        member 0.01666667       813
5        casual 0.03333333      1527
6        member 0.03333333      3570
...

tail(ride_duration)
      member_casual duration frequency
45193        casual 35820.63         1
45194        member 35820.63         0
45195        casual 36257.80         1
45196        member 36257.80         0
45197        casual 40705.02         1
45198        member 40705.02         0

Desired outcome:

   member_casual        interval frequency
1         member      0-1 minute   sum of duration <=1
2         member     1-5 minutes   sum of duration >1<=5
3         member    5-10 minutes   sum of duration >5<=10
4         member   10-15 minutes         .
5         member   15-30 minutes         .
6         member   30-60 minutes         .
7         member  60-120 minutes         .
8         member  120-180 minute         .
9         member 180-240 minutes         .
10        member    240+ minutes         .
11        casual      0-1 minute   sum of duration <=1
12        casual     1-5 minutes   sum of duration >1<=5
13        casual    5-10 minutes   sum of duration >5<=10
14        casual   10-15 minutes         .
15        casual   15-30 minutes         .
16        casual   30-60 minutes         .
17        casual  60-120 minutes         .
18        casual  120-180 minute         .
19        casual 180-240 minutes         .
20        casual    240+ minutes         .

I was able to solve this problem using loops, which took me quite a while to figure out, but I feel like there should have been a function that would have made this process significantly easier. Here's what I ended up doing that worked:

# Create table with set minute intervals with frequencies
ride_duration_pyramid <- data.frame("member_casual"=c("casual","casual","casual","casual","casual","casual","casual","casual","casual","casual",
                                                      "member","member","member","member","member","member","member","member","member","member"),
                                    "interval"=c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes", "15-30 minutes",
                                                 "30-60 minutes","60-120 minutes","120-180 minute", "180-240 minutes", "240+ minutes",
                                                 "0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes", "15-30 minutes",
                                                 "30-60 minutes","60-120 minutes","120-180 minute", "180-240 minutes", "240+ minutes"),
                                    "frequency"=replicate(20,0))
current_interval <- 1 # Starting interval
interval_high <- c(1,5,10,15,30,60,120,180,240,40706) # Interval maxs
for (i in 1:nrow(ride_duration)) { # Scan all rows
  row <- ride_duration[i,]
  # Walk through intervals to find the row this data point falls into.
  while (row[2] >= interval_high[current_interval]) {
    current_interval <- current_interval + 1;
  }
  if (row[1] == "casual") {
    ride_duration_pyramid[current_interval,3] <- ride_duration_pyramid[current_interval,3] + row[3];
  } else {
    ride_duration_pyramid[current_interval+10,3] <- ride_duration_pyramid[current_interval+10,3] + row[3];
  }
}

Solution

  • Here is a way with cut and aggregate.

    Note that the breaks vector is not exactly the same as posted in the question. It's not a maxima vector, it starts at the lowest possible value and ends at the greatest possible one. The breaks vector must cover the entire data range and since the upper limit is described as "240+", Inf is used.

    ride_duration <- "
    member_casual   duration frequency
    1        casual 0.00000000       235
    2        member 0.00000000       228
    3        casual 0.01666667       567
    4        member 0.01666667       813
    5        casual 0.03333333      1527
    6        member 0.03333333      3570
    45193        casual 35820.63         1
    45194        member 35820.63         0
    45195        casual 36257.80         1
    45196        member 36257.80         0
    45197        casual 40705.02         1
    45198        member 40705.02         0
    "
    ride_duration <- read.table(textConnection(ride_duration), header = TRUE)
    
    #
    interval_high <- c(0,1,5,10,15,30,60,120,180,240,Inf)
    interval_labels <- c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes", 
                         "15-30 minutes", "30-60 minutes", "60-120 minutes", "120-180 minute", 
                         "180-240 minutes", "240+ minutes")
    interval <- cut(ride_duration$duration, interval_high, labels = interval_labels, include.lowest = TRUE)
    
    aggregate(frequency ~ interval + member_casual, ride_duration, sum, na.rm = TRUE)[c(2:1, 3)]
    #>   member_casual     interval frequency
    #> 1        casual   0-1 minute      2329
    #> 2        casual 240+ minutes         3
    #> 3        member   0-1 minute      4611
    #> 4        member 240+ minutes         0
    

    Created on 2022-12-05 with reprex v2.0.2