I have raw data that appears like this:
┌─────────┬────────┬─────────────────────┐
│ price │ size │ timestamp │
│ float │ uint16 │ timestamp │
├─────────┼────────┼─────────────────────┤
│ 1697.0 │ 11 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 5 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 5 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 5 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 5 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 4 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 1 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 1 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 1 │ 2009-09-27 18:00:00 │
│ 1697.5 │ 3 │ 2009-09-27 18:00:00 │
│ 1697.5 │ 2 │ 2009-09-27 18:00:00 │
│ 1697.0 │ 1 │ 2009-09-27 18:00:00 │
│ 1698.0 │ 1 │ 2009-09-27 18:00:01 │
│ 1698.25 │ 1 │ 2009-09-27 18:00:01 │
│ 1698.25 │ 10 │ 2009-09-27 18:00:02 │
│ 1698.25 │ 4 │ 2009-09-27 18:00:02 │
│ 1697.25 │ 6 │ 2009-09-27 18:00:02 │
│ 1697.25 │ 2 │ 2009-09-27 18:00:02 │
│ 1697.0 │ 28 │ 2009-09-27 18:00:02 │
│ 1697.25 │ 6 │ 2009-09-27 18:00:03 │
├─────────┴────────┴─────────────────────┤
│ 20 rows 3 columns │
Using DuckDB, I wanted to create histograms for each timestamp, both the price and size.
My attempt:
vp = conn.query(f"""
SET enable_progress_bar = true;
SELECT
timestamp,
histogram(price)
FROM 'data/tickdata.parquet'
GROUP BY timestamp
ORDER BY timestamp
""")
This produces the following:
┌─────────────────────┬─────────────────────────────────────────────────────────────────┐
│ timestamp │ histogram(price) │
│ timestamp │ map(float, ubigint) │
├─────────────────────┼─────────────────────────────────────────────────────────────────┤
│ 2009-09-27 18:00:00 │ {1697.0=10, 1697.5=2} │
│ 2009-09-27 18:00:01 │ {1698.0=1, 1698.25=1} │
│ 2009-09-27 18:00:02 │ {1697.0=1, 1697.25=2, 1698.25=2} │
│ 2009-09-27 18:00:03 │ {1696.0=2, 1696.5=2, 1697.0=2, 1697.25=1} │
│ 2009-09-27 18:00:04 │ {1696.0=2, 1696.25=2, 1696.75=1, 1697.0=1, 1697.25=3, 1697.5=1}
At first glance, it "appears correct", however, the "values" associated with each key are not the SUM of the size but the COUNTs of the size. What I would expect to see:
┌─────────────────────┬─────────────────────────────────────────────────────────────────┐
│ timestamp │ histogram(price) │
│ timestamp │ map(float, ubigint) │
├─────────────────────┼─────────────────────────────────────────────────────────────────┤
│ 2009-09-27 18:00:00 │ {1697.0=39, 1697.5=5} │
│ 2009-09-27 18:00:01 │ {1698.0=1, 1698.25=1} │
│ 2009-09-27 18:00:02 │ {1697.0=28, 1697.25=8, 1698.25=14}
Alternatively: I am able to generate the following table, but unsure if there is a way I can map it into the above example?
┌─────────────────────┬─────────┬───────────┐
│ timestamp │ price │ sum(size) │
│ timestamp │ float │ int128 │
├─────────────────────┼─────────┼───────────┤
│ 2009-09-27 18:00:00 │ 1697.0 │ 39 │
│ 2009-09-27 18:00:00 │ 1697.5 │ 5 │
│ 2009-09-27 18:00:01 │ 1698.0 │ 1 │
│ 2009-09-27 18:00:01 │ 1698.25 │ 1 │
│ 2009-09-27 18:00:02 │ 1698.25 │ 14 │
│ 2009-09-27 18:00:02 │ 1697.25 │ 8 │
│ 2009-09-27 18:00:02 │ 1697.0 │ 28 │
Use this query to calculate the total size (sum_size) for each price at every timestamp in your dataset.
WITH aggregated_data AS (
SELECT
timestamp,
price,
SUM(size) AS sum_size
FROM tickdata
GROUP BY timestamp, price
)
SELECT
timestamp,
MAP(ARRAY_AGG(price), ARRAY_AGG(sum_size)) AS histogram
FROM aggregated_data
GROUP BY timestamp
ORDER BY timestamp;