rsortingfilterconditional-statementslarge-files

Ordering rows based on multiple conditions in R


I am working with a large dataset of over 1.3 million fish detection points. I am trying to calculate distances and swim speed based on the location and time stamp data. However, I noticed that random portions of the data are collected out of order. For instance, Fish #1 has locations recorded every 2-3 seconds, but in some cases, record A with a time stamp 22:00:05 is followed by record B with a timestamp of 22:00:01. This is a problem for me, because I am generating swim speeds based on the difference between consecutive records. So in this case, I end up with a time difference value of -4, which produces negative swim speeds...

Here is a sample block of code of what my data looks like.

Treatment <- c("Sound", "Sound", "Feed", "Control")
Time <- c("20:00:05", "20:00:01", "20:00:06", "21:00:01")
FishID = c("2897", "2897", "2897", "2744")
Date = c("5/13/2024", "5/13/2024", "5/13/2024", "5/14/2024")
FishData <- data.frame(FishID, Treatment, Time, Date)
FishData
FishID Treatment     Time      Date
1   2897     Sound 20:00:05 5/13/2024
2   2897     Sound 20:00:01 5/13/2024
3   2897      Feed 20:00:06 5/13/2024
4   2744   Control 21:00:01 5/14/2024

I want to re-order the columns so they are in chronological order, while retaining the various categorical groupings. For example, I want row 2 to precede row 1.

I don't know how to perform complex conditional filtering tasks.


Solution

  • Base function order can stably sort date time vectors.
    From the documentation:

    Method "radix" stably sorts logical, numeric and character vectors in linear time. It outperforms the other methods, although there are drawbacks, especially for character vectors (see sort).

    And from the documentation for sort, my emphasis:

    Method "radix" relies on simple hashing to scale time linearly with the input size, i.e., its asymptotic time complexity is O(n). The specific variant and its implementation originated from the data.table package and are due to Matt Dowle and Arun Srinivasan. For small inputs (< 200), the implementation uses an insertion sort (O(n^2)) that operates in-place to avoid the allocation overhead of the radix sort. For integer vectors of range less than 100,000, it switches to a simpler and faster linear time counting sort. In all cases, the sort is stable; the order of ties is preserved. It is the default method for integer vectors and factors.

    So if you create an permutation of the data frame's rows sorting by date and time, you will preserve any preexisting order.

    Treatment <- c("Sound", "Sound", "Feed", "Control")
    Time <- c("20:00:05", "20:00:01", "20:00:06", "21:00:01")
    FishID = c("2897", "2897", "2897", "2744")
    Date = c("5/13/2024", "5/13/2024", "5/13/2024", "5/14/2024")
    FishData <- data.frame(FishID, Treatment, Time, Date)
    
    i <- FishData |>
      with(as.POSIXct(paste(Date, Time), format = "%m/%d/%Y %H:%M:%S")) |>
      order(method = "radix")
    FishData[i,]
    #>   FishID Treatment     Time      Date
    #> 2   2897     Sound 20:00:01 5/13/2024
    #> 1   2897     Sound 20:00:05 5/13/2024
    #> 3   2897      Feed 20:00:06 5/13/2024
    #> 4   2744   Control 21:00:01 5/14/2024
    

    Created on 2024-09-21 with reprex v2.1.0


    Edit

    This pipe is clearer.

    i <- with(FishData, paste(Date, Time)) |>
      as.POSIXct(format = "%m/%d/%Y %H:%M:%S") |>
      order(method = "radix")