hawq

Difference between external table and internal table when using Apache HAWQ?


I am using HAWQ to handle a column-based file. While reading the Pivotal document, they suggest that user should use gpfdist to read and write the readable external table in order to quickly process the data in a parallel way.

I made a table as recommended in the documentation and confirmed my data by SQL as below statement.

CREATE EXTERNAL TABLE ext_data
(col1 text, col2 text,col3 text, col4 text, col5 int, col6 int, col7 int,col8 int)
LOCATION ('gpfdist://hawq2:8085/*.csv')
FORMAT 'CSV'(DELIMITER ',');

SELECT gp_segment_id,count(*) from ext_data GROUP BY gp_segment_id;

The data was evenly distributed on all the slave nodes.

Previous my goal was creating the table, reading the data from the file and identifying the loaded data was distributed well. It was achieved by above procedure using gpfdist.

But the question is the difference between the external table and internal table. What is the reason of being using external or internal table even though two methods were same functionality.

I found some blogs that some users follow below procedures when using HAWQ or Greenplume database.

1. making external table using gpfdist
2. making internal table again
3. reading the data from external data into internal data.

I didn't fully get the idea of this behavior. Above all, I don't know why external and internal table exist and should be used for handling data using Apache Hawq or greenplume database.


Solution

  • An External Table that uses gpfdist

    Internal Table

    External Tables just make it easier to load data into the database and make it do so faster.

    Think about this scenario. You get a file from your Accounting system that you need to load. You could do this:

    1. scp the file to an edge node
    2. hdfs put the file into hdfs
    3. Create an external table in HAWQ using PXF to read the file
    4. Insert the data into HAWQ table

    That will work and PXF will read the file in HDFS in parallel. However, step 2 is a single process and a bottleneck. Instead, do this:

    1. scp the file to an edge node
    2. Start a gpfdist process
    3. Create an external table in HAWQ using gpfdist to read the file
    4. Insert the data into HAWQ table

    Now the "put" into HDFS is done in parallel because HAWQ will start virtual segments on each node to put the data. This are typically 6 virtual segments per data node so in a 10 node cluster, you'll have 60 processes putting data into HDFS rather than a single one.