I'm trying to select rows with the max value based on a metric. So, each row should be "compared" to its own max value based on the metric of that row. This can be done with the following query in PostgreSQL: This SQL runs correctly in PostgreSQL:
DROP TABLE IF EXISTS test_max_date;
CREATE TABLE IF NOT EXISTS test_max_date
(
timestamp date,
value double precision,
metric text,
accept_date date
);
INSERT INTO test_max_date (timestamp, value, metric, accept_date)
VALUES ('1990-01-01', 1.1, 'foo', '1999-02-01'), -- Set 1
('1990-02-01', 2.2, 'foo', '1999-02-01'),
('1990-03-01', 3.3, 'foo', '1999-02-01'),
('1990-01-01', 1.2, 'bar', '2021-02-01'), -- Set 2
('1990-02-01', 2.3, 'bar', '2021-02-01'),
('1990-03-01', 3.4, 'bar', '2021-02-01'),
('1990-01-01', 1.1, 'foo', '1999-04-01'), -- Set 3
('1990-02-01', 2.2, 'foo', '1999-04-01'),
('1990-03-01', 3.3, 'foo', '1999-04-01'),
('1990-01-01', 1.2, 'bar', '2022-02-01'), -- Set 4
('1990-02-01', 2.3, 'bar', '2022-02-01'),
('1990-03-01', 3.4, 'bar', '2022-02-01')
;
SELECT timestamp, value, metric, accept_date
FROM test_max_date tmd1
WHERE
accept_date = (SELECT max(accept_date) FROM test_max_date tmd2 WHERE tmd2.metric = tmd1.metric)
But, the outer query column cannot be used in the inner query WHERE clause in Clickhouse. The exact same SQL does not run in Clickhouse:
DROP TABLE IF EXISTS test_max_date;
CREATE TABLE IF NOT EXISTS test_max_date
(
timestamp date,
value double precision,
metric text,
accept_date date
)
ENGINE = ReplacingMergeTree
ORDER BY (metric, accept_date, timestamp);
INSERT INTO test_max_date (timestamp, value, metric, accept_date)
VALUES ('1990-01-01', 1.1, 'foo', '1999-02-01'),
('1990-02-01', 2.2, 'foo', '1999-02-01'),
('1990-03-01', 3.3, 'foo', '1999-02-01'),
('1990-01-01', 1.2, 'bar', '2021-02-01'),
('1990-02-01', 2.3, 'bar', '2021-02-01'),
('1990-03-01', 3.4, 'bar', '2021-02-01'),
('1990-01-01', 1.1, 'foo', '1999-04-01'),
('1990-02-01', 2.2, 'foo', '1999-04-01'),
('1990-03-01', 3.3, 'foo', '1999-04-01'),
('1990-01-01', 1.2, 'bar', '2022-02-01'),
('1990-02-01', 2.3, 'bar', '2022-02-01'),
('1990-03-01', 3.4, 'bar', '2022-02-01')
;
SELECT timestamp, value, metric, accept_date
FROM test_max_date tmd1
WHERE
accept_date = (SELECT max(accept_date) FROM test_max_date tmd2 WHERE tmd2.metric = tmd1.metric)
The error is:
Code: 47. DB::Exception: Missing columns: 'tmd1.metric' while processing query: 'SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE metric = tmd1.metric', required columns: 'metric' 'tmd1.metric' 'accept_date', maybe you meant: ['metric','accept_date']: While processing (SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE tmd2.metric = tmd1.metric) AS _subquery805: While processing accept_date = ((SELECT max(accept_date) FROM test_max_date AS tmd2 WHERE tmd2.metric = tmd1.metric) AS _subquery805). (UNKNOWN_IDENTIFIER) (version 22.12.1.1752 (official build)) , server ClickHouseNode [uri=http://localhost:8123/default, options={custom_http_params=session_id=DataGrip_d050598f-af8a-4c56-b9d4-e88fbec47962}]@-1760860067
I assume there's something different between the two, but I was unable to find anything about outer query column access in clickhouse.
1)
SELECT timestamp, value, metric, accept_date
FROM test_max_date tmd1
WHERE (accept_date,metric) in (SELECT max(accept_date),metric FROM test_max_date group by metric)
order by 1,2
┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
│ 1990-01-01 │ 1.1 │ foo │ 1999-04-01 │
│ 1990-01-01 │ 1.2 │ bar │ 2022-02-01 │
│ 1990-02-01 │ 2.2 │ foo │ 1999-04-01 │
│ 1990-02-01 │ 2.3 │ bar │ 2022-02-01 │
│ 1990-03-01 │ 3.3 │ foo │ 1999-04-01 │
│ 1990-03-01 │ 3.4 │ bar │ 2022-02-01 │
└────────────┴───────┴────────┴─────────────┘
2)
SELECT timestamp, value, metric, accept_date
from ( SELECT timestamp, value, metric, accept_date,
rank() over (partition by metric order by accept_date desc) r
FROM test_max_date tmd1 ) t
WHERE r=1
order by 1,2
┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
│ 1990-01-01 │ 1.1 │ foo │ 1999-04-01 │
│ 1990-01-01 │ 1.2 │ bar │ 2022-02-01 │
│ 1990-02-01 │ 2.2 │ foo │ 1999-04-01 │
│ 1990-02-01 │ 2.3 │ bar │ 2022-02-01 │
│ 1990-03-01 │ 3.3 │ foo │ 1999-04-01 │
│ 1990-03-01 │ 3.4 │ bar │ 2022-02-01 │
└────────────┴───────┴────────┴─────────────┘
3)
select timestamp, value, metric, accept_date
from (
SELECT metric, accept_date, groupArray( (timestamp, value) ) ga
FROM test_max_date
group by metric, accept_date
order by metric, accept_date desc
limit 1 by metric
) array join ga.1 as timestamp, ga.2 as value
order by 1, 2
┌──timestamp─┬─value─┬─metric─┬─accept_date─┐
│ 1990-01-01 │ 1.1 │ foo │ 1999-04-01 │
│ 1990-01-01 │ 1.2 │ bar │ 2022-02-01 │
│ 1990-02-01 │ 2.2 │ foo │ 1999-04-01 │
│ 1990-02-01 │ 2.3 │ bar │ 2022-02-01 │
│ 1990-03-01 │ 3.3 │ foo │ 1999-04-01 │
│ 1990-03-01 │ 3.4 │ bar │ 2022-02-01 │
└────────────┴───────┴────────┴─────────────┘