hivemissing-datamanagedhive-partitionshiveddl

Hive - incomplete rows in select from managed partitioned table


I need to copy data from a CSV file to a managed partitioned table in Hive.

CSV file rows are:

id,nome,cognome,ruolo  
16,Mike,Maignan,Portiere    
23,Fikayo,Tomori,Centrale   
24,Simon,Kjaer,Centrale   
19,Theo,Hernandez,Terzino 
...

-------

I created a managed partitioned table on ruolo column.

create table squadre_part 
(id int, nome string, cognome string) 
partitioned by (ruolo string)  
row format delimited fields terminated by ','  
stored as textfile  
TBLPROPERTIES ("skip.header.line.count"="1") ;  

-------

then I created an external table to load data from CSV file (then I will select data from external table and copy them into managed partitioned table)

create external table external_squadre  
(id int, nome string, cognome string, ruolo string)    
row format delimited fields terminated by ','    
stored as textfile    
location '/ulisse/prove/external/'    
TBLPROPERTIES ("skip.header.line.count"="1")  ;

-------

First of all I set these 2 properties:

set hive.exec.dynamic.partition=true; 
set hive.exec.dynamic.partition.mode=nonstrict; 

--------

when I put the CSV file in /ulisse/prove/external/ HDFS dir and execute select from external table, I see ALL THE ROWS.

-------

After the "copy" from external table to managed table:

insert into squadre_part partition (ruolo) select * from external_squadre;

I see into managed table ONLY FEW ROWS (???).

The strange thing is that in HDSF, under /user/hive/warehouse/<mydb>/<managed table>/...

I see all sub directories (and text files inside sub dir), with ALL THE ROWS of the original CSV file.

-------

The command:

msck repair table squadre_part 

(after the insert into command...) didn't solve the problem.

Thank you in advance for any reply.

Moreno


Solution

  • Some rows can be lost because you have TBLPROPERTIES ("skip.header.line.count"="1") in managed table DDL and actually there are no headers created during INSERT. Then one row from each file will be lost. If there are many files, then many rows will be lost. Remove the property from managed table.

    Use skip.header.line.count property if you are loading files with headers using LOAD command or by directly putting files with headers into table location.