I would like to sum a frequency table into a more compact frequency table by filtering by a variable and certain intervals without using loops and manually creating most of the data frame.
What I'm working with:
head(ride_duration)
member_casual duration frequency
1 casual 0.00000000 235
2 member 0.00000000 228
3 casual 0.01666667 567
4 member 0.01666667 813
5 casual 0.03333333 1527
6 member 0.03333333 3570
...
tail(ride_duration)
member_casual duration frequency
45193 casual 35820.63 1
45194 member 35820.63 0
45195 casual 36257.80 1
45196 member 36257.80 0
45197 casual 40705.02 1
45198 member 40705.02 0
Desired outcome:
member_casual interval frequency
1 member 0-1 minute sum of duration <=1
2 member 1-5 minutes sum of duration >1<=5
3 member 5-10 minutes sum of duration >5<=10
4 member 10-15 minutes .
5 member 15-30 minutes .
6 member 30-60 minutes .
7 member 60-120 minutes .
8 member 120-180 minute .
9 member 180-240 minutes .
10 member 240+ minutes .
11 casual 0-1 minute sum of duration <=1
12 casual 1-5 minutes sum of duration >1<=5
13 casual 5-10 minutes sum of duration >5<=10
14 casual 10-15 minutes .
15 casual 15-30 minutes .
16 casual 30-60 minutes .
17 casual 60-120 minutes .
18 casual 120-180 minute .
19 casual 180-240 minutes .
20 casual 240+ minutes .
I was able to solve this problem using loops, which took me quite a while to figure out, but I feel like there should have been a function that would have made this process significantly easier. Here's what I ended up doing that worked:
# Create table with set minute intervals with frequencies
ride_duration_pyramid <- data.frame("member_casual"=c("casual","casual","casual","casual","casual","casual","casual","casual","casual","casual",
"member","member","member","member","member","member","member","member","member","member"),
"interval"=c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes", "15-30 minutes",
"30-60 minutes","60-120 minutes","120-180 minute", "180-240 minutes", "240+ minutes",
"0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes", "15-30 minutes",
"30-60 minutes","60-120 minutes","120-180 minute", "180-240 minutes", "240+ minutes"),
"frequency"=replicate(20,0))
current_interval <- 1 # Starting interval
interval_high <- c(1,5,10,15,30,60,120,180,240,40706) # Interval maxs
for (i in 1:nrow(ride_duration)) { # Scan all rows
row <- ride_duration[i,]
# Walk through intervals to find the row this data point falls into.
while (row[2] >= interval_high[current_interval]) {
current_interval <- current_interval + 1;
}
if (row[1] == "casual") {
ride_duration_pyramid[current_interval,3] <- ride_duration_pyramid[current_interval,3] + row[3];
} else {
ride_duration_pyramid[current_interval+10,3] <- ride_duration_pyramid[current_interval+10,3] + row[3];
}
}
Here is a way with cut
and aggregate
.
cut
the numeric vector of durations by the break points and set the factor's labels;aggregate
the data.Note that the breaks vector is not exactly the same as posted in the question. It's not a maxima vector, it starts at the lowest possible value and ends at the greatest possible one. The breaks vector must cover the entire data range and since the upper limit is described as "240+"
, Inf
is used.
ride_duration <- "
member_casual duration frequency
1 casual 0.00000000 235
2 member 0.00000000 228
3 casual 0.01666667 567
4 member 0.01666667 813
5 casual 0.03333333 1527
6 member 0.03333333 3570
45193 casual 35820.63 1
45194 member 35820.63 0
45195 casual 36257.80 1
45196 member 36257.80 0
45197 casual 40705.02 1
45198 member 40705.02 0
"
ride_duration <- read.table(textConnection(ride_duration), header = TRUE)
#
interval_high <- c(0,1,5,10,15,30,60,120,180,240,Inf)
interval_labels <- c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-15 minutes",
"15-30 minutes", "30-60 minutes", "60-120 minutes", "120-180 minute",
"180-240 minutes", "240+ minutes")
interval <- cut(ride_duration$duration, interval_high, labels = interval_labels, include.lowest = TRUE)
aggregate(frequency ~ interval + member_casual, ride_duration, sum, na.rm = TRUE)[c(2:1, 3)]
#> member_casual interval frequency
#> 1 casual 0-1 minute 2329
#> 2 casual 240+ minutes 3
#> 3 member 0-1 minute 4611
#> 4 member 240+ minutes 0
Created on 2022-12-05 with reprex v2.0.2