sqlhivequery-optimizationpartitionhive-partitions

Hive: read table partitions defined in subselect


I have a Hive table which is partitioned by partitionDate field. I can read partition of my choice via simple

select * from myTable where partitionDate = '2000-01-01'

My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.

I have written a query which looks like

select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate

The query works but looks like partitions are not used. The query works too long.

I tried another approach:

select * from myTable where partitionDate in (select reportDate from thatTable)

.. and again I see that the query works too slowly.

Is there a way to implement this in Hive?

update: create table for myTable

CREATE TABLE `myTable`(            
  `theDate` string,            
 ')            
PARTITIONED BY (           
  `partitionDate` string) 
TBLPROPERTIES (             
  'DO_NOT_UPDATE_STATS'='true',         
  'STATS_GENERATED_VIA_STATS_TASK'='true',                
  'spark.sql.create.version'='2.2 or prior',              
  'spark.sql.sources.schema.numPartCols'='1',    
  'spark.sql.sources.schema.numParts'='2',          
  'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...         
  'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',               
  'spark.sql.sources.schema.partCol.0'='partitionDate')  

Solution

  • If you are running Hive on Tez execution engine, try

    set hive.tez.dynamic.partition.pruning=true;
    

    Read more details and related configuration in the Jira HIVE-7826

    and at the same time try to rewrite as a LEFT SEMI JOIN:

    select * 
      from myTable t 
           left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate 
    

    If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344

    Or this one: https://stackoverflow.com/a/53279839/2700344

    Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins