sqlhiveimpala

Insert Overwrite in Hive (Impala) a single file instead of multiple small files


I'm inserting into the table db.A_full using this query:

INSERT OVERWRITE db.A_full
PARTITION (year, month, day)
SELECT DISTINCT
    A.ID
    , A.doc_type
    , A.city
    , A.gender
    , A.age
    , A.year
    , A.month
    , A.day
FROM db.A_parcial as A

db.A_parcial only contains a single date and it's a relatively small table that should be saved in a single file of 148Mb in db.A_full; but when I look in the clouse SHOW PARTITIONS db.A_full the single partition was saved as 138 files. How can I do the insert avoiding small files creation?


Solution

  • There is a way to compact an insert by setting some gonfig options:

    -- this next command tells impala to use 1 node 
    -- it has to be set JUST for the insert
    
    set num_nodes = 1 ;  
    set PARQUET_FILE_SIZE = 256m ;  -- this tells impala the filesize of result
     
    insert into table my_table as select fields from other_table ;
     
    set num_nodes = 0 ;  -- this has to be set immediatelly after the insert
     
    compute stats my_table ;  -- this is just good manners
    

    This is an easy way to set an insert to reduce the number of files on an insert. However, letting the set num_nodes = 1 will impact your performance, so it only needs to be set on an insert.