The below ASOF query that I would expect to return quickly is currently timing out. It returns last 100 option trades matched to the most recent stock trade.
SELECT
o.timestamp AS option_ts,
o.symbol AS option_symbol,
o.underlying_ticker,
o.price AS option_price,
s.timestamp AS stock_ts,
s.price AS stock_price,
s.symbol AS stock_symbol
FROM realtime_option_trades o
ASOF JOIN realtime_stock_trades s ON o.underlying_ticker = s.symbol ORDER BY o.timestamp desc LIMIT 100;
I have 2 tables, realtime_option_trades
, and realtime_stock_trades
CREATE TABLE IF NOT EXISTS realtime_option-trades (
timestamp TIMESTAMP,
symbol SYMBOL CAPACITY 1048576, -- Option ticker symbol
exchange INT, -- Exchange ID
price DOUBLE, -- Trade price
size LONG, -- Trade size/volume
conditions VARCHAR, -- Trade conditions (comma-separated list of ints)
sequence_number LONG, -- Sequence number
underlying_ticker SYMBOL CAPACITY 32768, -- Extracted underlying stock symbol
expiration_date DATE, -- Option expiration date
option_type CHAR, -- Option type (C/P)
strike_price DOUBLE -- Strike price
) TIMESTAMP(timestamp)
PARTITION BY HOUR
TTL 1 DAY; -- Keep data for 1 day
CREATE TABLE IF NOT EXISTS realtime_stock_trades (
timestamp TIMESTAMP,
symbol SYMBOL CAPACITY 32768, -- Stock ticker symbol
exchange INT, -- Exchange ID
price DOUBLE, -- Trade price
size LONG, -- Trade size/volume (using LONG for safety)
conditions VARCHAR, -- Trade conditions (comma-separated list of ints)
trade_id VARCHAR, -- Trade ID
tape INT, -- Tape (1=NYSE, 2=AMEX, 3=Nasdaq)
sequence_number LONG, -- Sequence number (using LONG for safety)
trf_id INT, -- Trade Reporting Facility ID
trf_timestamp TIMESTAMP -- Trade Reporting Facility Timestamp
) TIMESTAMP(timestamp)
PARTITION BY HOUR
TTL 1 DAY; -- Keep data for 1 day
For context, we ingest ~500 rows/s for realtime_option_trades and then ~5000 rows/s realtime_stock_trades if that helps. There's ~100 miillion rows in realtime_stock_trades and ~7 million rows in realtime_option_trades.
We have ~4000 tickers to match and some tickers aren't as active as others. Some tickers won't have a match as well. So that's prob the reason it's slow because it needs to keep looking back?
You can estimate how many rows must be scanned by finding the min timestamp in your result, and the max, and then counting the rows between them. If you find the problem is that some symbols don't appear recently, and that slows everything else, there might be some workaround to mitigate it.
I would suggest putting time-bounds on the query, to current day or similar. This removes the worst case scenario, and means you catch your active symbols in the query.
Then on the application side, you have a slow-path process to handle the inactive symbols, which you fall back to if they do not appear in the first query. Essentially, a 2-step process.