pythonsqlduckdb

Generating key - value map from aggregates


I have raw data that appears like this:

┌─────────┬────────┬─────────────────────┐
│  price  │  size  │      timestamp      │
│  float  │ uint16 │      timestamp      │
├─────────┼────────┼─────────────────────┤
│  1697.0 │     11 │ 2009-09-27 18:00:00 │
│  1697.0 │      5 │ 2009-09-27 18:00:00 │
│  1697.0 │      5 │ 2009-09-27 18:00:00 │
│  1697.0 │      5 │ 2009-09-27 18:00:00 │
│  1697.0 │      5 │ 2009-09-27 18:00:00 │
│  1697.0 │      4 │ 2009-09-27 18:00:00 │
│  1697.0 │      1 │ 2009-09-27 18:00:00 │
│  1697.0 │      1 │ 2009-09-27 18:00:00 │
│  1697.0 │      1 │ 2009-09-27 18:00:00 │
│  1697.5 │      3 │ 2009-09-27 18:00:00 │
│  1697.5 │      2 │ 2009-09-27 18:00:00 │
│  1697.0 │      1 │ 2009-09-27 18:00:00 │
│  1698.0 │      1 │ 2009-09-27 18:00:01 │
│ 1698.25 │      1 │ 2009-09-27 18:00:01 │
│ 1698.25 │     10 │ 2009-09-27 18:00:02 │
│ 1698.25 │      4 │ 2009-09-27 18:00:02 │
│ 1697.25 │      6 │ 2009-09-27 18:00:02 │
│ 1697.25 │      2 │ 2009-09-27 18:00:02 │
│  1697.0 │     28 │ 2009-09-27 18:00:02 │
│ 1697.25 │      6 │ 2009-09-27 18:00:03 │
├─────────┴────────┴─────────────────────┤
│ 20 rows                      3 columns │

Using DuckDB, I wanted to create histograms for each timestamp, both the price and size.

My attempt:

    vp = conn.query(f"""
      SET enable_progress_bar = true;
      SELECT
        timestamp,
        histogram(price)
      FROM 'data/tickdata.parquet'
      GROUP BY timestamp
      ORDER BY timestamp

                    """)

This produces the following:

┌─────────────────────┬─────────────────────────────────────────────────────────────────┐
│      timestamp      │                        histogram(price)                         │
│      timestamp      │                       map(float, ubigint)                       │
├─────────────────────┼─────────────────────────────────────────────────────────────────┤
│ 2009-09-27 18:00:00 │ {1697.0=10, 1697.5=2}                                           │
│ 2009-09-27 18:00:01 │ {1698.0=1, 1698.25=1}                                           │
│ 2009-09-27 18:00:02 │ {1697.0=1, 1697.25=2, 1698.25=2}                                │
│ 2009-09-27 18:00:03 │ {1696.0=2, 1696.5=2, 1697.0=2, 1697.25=1}                       │
│ 2009-09-27 18:00:04 │ {1696.0=2, 1696.25=2, 1696.75=1, 1697.0=1, 1697.25=3, 1697.5=1} 

At first glance, it "appears correct", however, the "values" associated with each key are not the SUM of the size but the COUNTs of the size. What I would expect to see:

┌─────────────────────┬─────────────────────────────────────────────────────────────────┐
│      timestamp      │                        histogram(price)                         │
│      timestamp      │                       map(float, ubigint)                       │
├─────────────────────┼─────────────────────────────────────────────────────────────────┤
│ 2009-09-27 18:00:00 │ {1697.0=39, 1697.5=5}                                           │
│ 2009-09-27 18:00:01 │ {1698.0=1, 1698.25=1}                                           │
│ 2009-09-27 18:00:02 │ {1697.0=28, 1697.25=8, 1698.25=14}    

Alternatively: I am able to generate the following table, but unsure if there is a way I can map it into the above example?

┌─────────────────────┬─────────┬───────────┐
│      timestamp      │  price  │ sum(size) │
│      timestamp      │  float  │  int128   │
├─────────────────────┼─────────┼───────────┤
│ 2009-09-27 18:00:00 │  1697.0 │        39 │
│ 2009-09-27 18:00:00 │  1697.5 │         5 │
│ 2009-09-27 18:00:01 │  1698.0 │         1 │
│ 2009-09-27 18:00:01 │ 1698.25 │         1 │
│ 2009-09-27 18:00:02 │ 1698.25 │        14 │
│ 2009-09-27 18:00:02 │ 1697.25 │         8 │
│ 2009-09-27 18:00:02 │  1697.0 │        28 │

Solution

  • Use this query to calculate the total size (sum_size) for each price at every timestamp in your dataset.

    WITH aggregated_data AS (
      SELECT
        timestamp,
        price,
        SUM(size) AS sum_size
      FROM tickdata
      GROUP BY timestamp, price
    )
    SELECT
      timestamp,
      MAP(ARRAY_AGG(price), ARRAY_AGG(sum_size)) AS histogram
    FROM aggregated_data
    GROUP BY timestamp
    ORDER BY timestamp;