sqlpostgresqltimescaledb

Return the last n entries


I'm trying to build a pandas df with an SQL Postgres timescale db query and this works below but the problem is the query builds an enormous file for the one point I am after /5231/REGCHG total_power. The data has been recording for a few years on one minute intervals and I can get it all ordered nicely by the ORDER BY minute but I don't need that much data. I only need to go back 3 days from the current day.

I'm trying to average the data by 15 minute increments so 3 days in 15 minute increments is the last 288 rows of this query below. Would a quick hack be just returning the tail(288) of this query in SQL?

I was also trying to use the #ORDER BY minute limit 3000 which I have commented out because it would only limit the first 3000 entries or when the data first started recording I need the last entries not the first.

import psycopg2
from sqlalchemy import create_engine
import pandas as pd
import datetime


SQL_PASS = 'asdf'
SQL_USER = 'asf'
SQL_HOST = '10.10.10.10'
dsn = f'postgresql://{SQL_USER}:{SQL_PASS}@{SQL_HOST}:5432/my_db'
dbengine = create_engine(dsn)

qry = '''
SELECT
  time_bucket('15 minute', "time") AS minute,
  metric,
  avg(value)
FROM slipstream_volttron
WHERE metric LIKE '/5231/REGCHG total_power' 
GROUP BY minute, metric
ORDER BY minute
'''
#ORDER BY minute limit 3000

dbconn = psycopg2.connect(host=SQL_HOST, dbname='my_db', user=SQL_USER, password=SQL_PASS, connect_timeout=5)


t1 = datetime.datetime.utcnow()
df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
t2 = datetime.datetime.utcnow()

duration = t2 - t1
seconds = duration.total_seconds()

hours = seconds // 3600
minutes = (seconds % 3600) // 60
seconds = seconds % 60

elapsed_time = f' db retrieval time is {minutes} minutes, {seconds} seconds'

print(df.columns)
print(df.head())
print(df.describe())

df.to_csv('main_meter_data.csv')
dbconn.close()

Edit

In Pandas when I print the dataframe I am noticing that the SQL query isn't most recent data. For example in my script if I add in a print of time and the df:

df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
print(time.ctime())
print("df: ",df)

This will return the current time is 9:13 AM but the request of data drops off 2023-01-27 15:00:00 34898.357143 which is way out into the future... is this UTC time?:

Fri Jan 27 09:13:34 2023
df:                                avg
minute                           
2023-01-24 15:15:00  35075.933333
2023-01-24 15:30:00  33908.866667
2023-01-24 15:45:00  33375.666667
2023-01-24 16:00:00  34529.133333
2023-01-24 16:15:00  34748.533333
...                           ...
2023-01-27 14:00:00  35977.200000
2023-01-27 14:15:00  35458.333333
2023-01-27 14:30:00  37695.933333
2023-01-27 14:45:00  36792.000000
2023-01-27 15:00:00  34898.357143

Solution

  • I need the last entries not the first

    So use DESCENDING sort order:

    SELECT time_bucket('15 minute', "time") AS minute
         , avg(value)
    FROM  (
       SELECT time, value
       FROM   slipstream_volttron
       WHERE  metric = '/5231/REGCHG total_power' 
       ORDER  BY time DESC  -- !!!
       LIMIT  4320  -- 24 * 60 * 3
       ) sub
    GROUP  BY minute
    ORDER  BY minute DESC;
    

    And limit in a subquery before you aggregate. This way, an index on (metric, time) gives you a (bitmap) index scan. Append INCLUDE (value) to the index if that gives you index-only scans. See:

    And maybe add something like

    WHERE time > '2023-1-1’
    

    to the index to make it a very small partal index. And mirror rhe same filter in the query (logically redundantly) to match the index.

    Since Postgres 14 you can also use date_bin() from core Postgres instead of time_bucket(). See: