i have 2 table something like this. i'm running a hive query and windows function seems pretty limited in hive.
Table dept
id | name |
1 | a |
2 | b |
3 | c |
4 | d |
Table time (build with heavy load query so it's make a very slow process if i need to join to another newly created table time.)
id | date | first | last |
1 | 1992-01-01 | 1 | 1 |
2 | 1993-02-02 | 1 | 2 |
2 | 1993-03-03 | 2 | 1 |
3 | 1993-01-01 | 1 | 3 |
3 | 1994-01-01 | 2 | 2 |
3 | 1995-01-01 | 3 | 1 |
i need to retrieve something like this :
SELECT d.id,d.name,
t.date AS firstdate,
td.date AS lastdate
FROM dbo.dept d LEFT JOIN dbo.time t ON d.id=t.id AND t.first=1
LEFT JOIN time td ON d.id=td.id AND td.last=1
How the most optimized answer ?
GROUP BY
operation that will be done in a single map-reduce job
select id
,max(name) as name
,max(case when first = 1 then `date` end) as firstdate
,max(case when last = 1 then `date` end) as lastdate
from (select id
,null as name
,`date`
,first
,last
from time
where first = 1
or last = 1
union all
select id
,name
,null as `date`
,null as first
,null as last
from dept
) t
group by id
;
+----+------+------------+------------+
| id | name | firstdate | lastdate |
+----+------+------------+------------+
| 1 | a | 1992-01-01 | 1992-01-01 |
| 2 | b | 1993-02-02 | 1993-03-03 |
| 3 | c | 1993-01-01 | 1995-01-01 |
| 4 | d | (null) | (null) |
+----+------+------------+------------+