I have a table in Hive.
When I ran the command show tblproperties myTableName
, It gives below result:
numFiles 12
numRows 1688092
rawDataSize 934923162
totalSize 936611254
That means rawDataSize is 934.92 MB and totalSize is 936.61 MB
And when I ran command to calculate data size on HDFS table location for the same table.
[user@server1 ~]$ hdfs dfs -du -h -s /apps/hive/warehouse/test.db/myTableName
893.2 M /apps/hive/warehouse/test.db/myTableName
The result data size is 893.2 MB
I see that there is big difference in datasize here for the same table. I am trying to understand why there is difference in the data size here for the same table and looking for detailed explanation.
Table Type - MANAGED_TABLE
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
936611254 / 1024 / 1024 = 893.2 M