rfunctionloopsdata.tableposixlt

finding all flights that have at least three years of data in R


I am using the flight dataset that is freely available in R.

flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")

Now, lets say i want to find all flight that have been flying for at least three consecutive years: so there are dates available for three years in the date column. Basically i am only interested in the year part of the data.

i was thinking of the following approach: create a unique list of all plane names and then for each plane get all the dates and see if there are three consecutive years.

I started as follows:

NOyears = 3
planes <- unique(flights$plane) 

# at least 3 consecutive years 
for (plane in planes){
  plane = "N576AA"
  allyears <- which(flights$plane == plane)
}

but i am stuck here. This whole approach start looking too complicated to me. Is there an easier/faster way? Considering that i am working on a very large dataset...

Note: I want to be able to specify the number of year later on, that is why i included NOyears = 3 in the first place.

EDIT:

I have just noticed this question on SO. Very interesting use of diff and cumsum which are both new to me. Maybe a similiar approach is possible here using data.table?


Solution

  • Here is another option using data.table:

    #summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
    yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]
    
    #add a dummy flight to demonstrate consecutive years
    yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))
    
    setkey(yearly, carrier, flight, year)    
    yearly[, c("rl", "rw") := {
        iscons <- cumsum(c(0L, diff(year)!=1L))
        .(iscons, rowid(carrier, flight, iscons))
    }]
    
    yearly[rl %in% yearly[rw>=3L]$rl]
    

    output:

       carrier flight year   rl rw
    1:      ZZ    111 2011 5117  1
    2:      ZZ    111 2012 5117  2
    3:      ZZ    111 2013 5117  3
    4:      ZZ    111 2014 5117  4