databasenosqlarangodbaql

What's the best way to return a sample of data over the a period?


Let's say I have a collection with 3600 documents — one per second for the last hour — and each document has two fields: timestamp and value.

What is the best (read: most performant) method to select a sample of this data, say, 12 documents, with five minutes between? Or 60 documents, one per minute?

In reality, this collection will have tens of millions of records, and the query will be ran quite often, so performance really is key. With an index on the two fields a query filtering by timestamp > {one hour ago} is relatively quick on a collection with 200,000 records.

This post has been succeeded by Aggregating averages from large datasets for number of steps over period of time in ArangoDB.


Solution

  • The short answer to this is:

    LET steps = 24
    LET stepsRange = 0..23
    LET diff = @end - @start
    LET interval = diff / steps
    
    FOR step IN stepsRange
    RETURN FIRST(
        LET stepStart = start + (interval * step)
        LET stepEnd = stepStart + interval
    
        RETURN FIRST(
            FOR f IN filteredObservations
            FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
            COLLECT AGGREGATE temperature = AVG(f.temperature)
            RETURN temperature
        )
    )
    

    For more details, see the answer in my question with superseded this one: https://stackoverflow.com/a/72886996/1138620