I have a large dataset (> 9 million rows) of times and locations when individual animals were detected at stations. I would like to calculate the distance between each station along each animal's path as it travelled between stations, as well as the time it took to travel between stations. And then I would like to summarize the total distance and time across all sections of the path.
For each individual in this dataset, the data is organized with each time it was detected at a stationary points. If the individual was at the stationary point for a long, consecutive period of time, then there are multiple records (each ~30 s apart) for this time period.
I can summarize the data below to get 1 row for each time an individual was at a station (see below). However, the output doesn't recognize when an individual travels to the same station more than once.
E.g.
id <- c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B")
site <- c("a", "a", "b", "a", "c", "c", "c", "d", "a", "b")
time <- seq(1:10)
lat <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2)
lon <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2)
df <- data.frame(id, site, time, lat, lon)
df %>% group_by(id, site, lat, lon) %>%
summarize(timeStart = min(time),
timeEnd = max(time))
# A tibble: 6 x 6
# Groups: id, site, lat [?]
id site lat lon timeStart timeEnd
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 A a 1 1 1 4
2 A b 2 2 3 3
3 A c 3 3 5 7
4 A d 4 4 8 8
5 B a 1 1 9 9
6 B b 2 2 10 10
I an approach to group the data so that the multiple visits to the same station (with trips to other stations in between) are recognized as a separate "leg" of the trip.
Then, I need to calculate the great circle distance between each station, as well as the time difference in time between timeEnd (1st station) and timeStart (2nd station).
This may not be your complete solution but it is a good start. This will find the distance and time difference between each row of data and sets the values to NA when the id changes between rows.
df <- data.frame(id, site, time, lat, lon)
library(geosphere)
library(dplyr)
#sort data by id and time
df<-df[order(df$id, df$time), ]
#find distance between each point in column
# Note longitude is the first column
df$distance<-c(NA, distGeo(df[,c("lon", "lat")]))
#find delta time between each row for each id
df<-df %>% group_by(id) %>% mutate(dtime=case_when(site != lag(site) ~ time-lag(time),
TRUE ~ NA_integer_))
#remove distances where there was no delta time (row pairs with different ids)
df$distance[is.na(df$dtime)]<-NA
#id summary
df%>% summarize(disttraveled=sum(distance, na.rm=TRUE), totaltime=sum(dtime, na.rm=TRUE))