I am using the flight dataset that is freely available in R.
flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")
Now, lets say i want to find all flight that have been flying for at least three consecutive years: so there are dates available for three years in the date
column. Basically i am only interested in the year
part of the data.
i was thinking of the following approach: create a unique list of all plane names and then for each plane get all the dates and see if there are three consecutive years.
I started as follows:
NOyears = 3
planes <- unique(flights$plane)
# at least 3 consecutive years
for (plane in planes){
plane = "N576AA"
allyears <- which(flights$plane == plane)
}
but i am stuck here. This whole approach start looking too complicated to me. Is there an easier/faster way? Considering that i am working on a very large dataset...
Note: I want to be able to specify the number of year later on, that is why i included NOyears = 3
in the first place.
EDIT:
I have just noticed this question on SO. Very interesting use of diff
and cumsum
which are both new to me. Maybe a similiar approach is possible here using data.table?
Here is another option using data.table
:
#summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]
#add a dummy flight to demonstrate consecutive years
yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))
setkey(yearly, carrier, flight, year)
yearly[, c("rl", "rw") := {
iscons <- cumsum(c(0L, diff(year)!=1L))
.(iscons, rowid(carrier, flight, iscons))
}]
yearly[rl %in% yearly[rw>=3L]$rl]
output:
carrier flight year rl rw
1: ZZ 111 2011 5117 1
2: ZZ 111 2012 5117 2
3: ZZ 111 2013 5117 3
4: ZZ 111 2014 5117 4