hawq

Where can I find the location of distributed file on slaves using Apache HAWQ?


I am using Apache HAWQ and trying to handle some data. I have one master node and two hawq slaves.

I made table, inserted the data and identified the data that I inserted using postgreSQL. I thought that the data was mostly distributed on slaves.

When executing below command, multiple gp_segment_id appeared, giving an impression of using multiple slaves.

my_db=# select gp_segment_id, count(*) from retail_demo.order_lineitems_hawq GROUP BY gp_segment_id;
 gp_segment_id | count  
---------------+--------
             3 | 170682
             4 | 170680
             2 | 170696
             1 | 341397
             0 | 170703
(5 rows)

Now, I really want to see my data distributed on the slave nodes. However, I don't know how to do it although I searched some blogs. I found this https://discuss.pivotal.io/hc/en-us/articles/204072646-Pivotal-HAWQ-find-data-files-for-specific-tables, and followed the command indicated as below.

When executing SELECT * FROM pg_filespace_entry WHERE fselocation LIKE '%gpseg0'; statement, but the empty tables returned to me, making me a little be confused.

What I want to do is pinpointing the location of distributed data on the slave nodes. However, the query statement (select gp_segment_id, count(*) from retail_demo.order_lineitems_hawq GROUP BY gp_segment_id;) was not sufficient for me to identify the location of distributed data.

And also, where can I specify the number of gp segments? Whenever running the queries, the number of gp_segments are different depending on the number of input data size.

In summary, my questions are as follows:

1) Where can I find the location of distributed data on hdfs on slave nodes? If not, How can I believe that my data was distributed well?

2) Is there a way to specify the number of segments? It would be more better for me to understand the HAWQ system if the number of segments will be fixed regardless of the input data size.

Any help will be appreciated :D


Solution

  • Your table, retail_demo.order_lineitems_hawq, must be distributed with a hash. When you do this in HAWQ, the number of buckets is determined by default_hash_table_bucket_number which is set when the database is initialized. There will be a file in HDFS for each bucket and this is because there will be a fixed number of virtual segments, or vsegs, with hash distributed tables.

    You can specify the number of buckets two ways. One is to do it when you create the table.

    CREATE TABLE foo (id int) WITH (BUCKETNUM=4) DISTRIBUTED BY (id);
    

    If you don't specify the BUCKETNUM, the number of buckets is set by default_hash_table_bucket_number.

    The other way is to use random distribution and let the optimizer figure out how many vsegs it needs to use. This is the recommended approach too. You should use random distribution in HAWQ.

    CREATE TABLE foo (id int) DISTRIBUTED RANDOMLY;
    

    With randomly distributed tables, you can specify the number of vsegs the database can use by adjusting the GUC hawq_rm_nvseg_perquery_perseg_limit. The default is 6 for clusters with 1 to 85 nodes and you can increase this to get better performance if you have enough resources.

    And where the data is located...

    gpadmin=# CREATE TABLE foo (id int) DISTRIBUTED RANDOMLY;
    CREATE TABLE
    gpadmin=# insert into foo select i from generate_series(1,100) as i;
    INSERT 0 100
    

    Now get the oid of the database.

    gpadmin=# select oid from pg_database where datname = 'gpadmin';
      oid  
    -------
     16508
    (1 row)
    

    And get the oid of the table.

    gpadmin=# select oid from pg_class where relname = 'foo';
      oid  
    -------
     16519
    (1 row)
    

    The file path format is a string like hdfs://hostname:port/hawqfilespacename/tablespace_oid/database_oid/table_oid/filenumber. When the database was initialized, the location in HDFS is set and it is typically set to "/hawq_data"(filespace). In that directory, there is always a sub-directory named "16385", it's the tablespace oid. In this directory, there is a directory for the database and then a directory under that for the table.

    hdfs dfs -ls /hawq_data/16385/16508/16519
    Found 1 items
    -rw-------   1 gpadmin hdfs       1016 2017-03-10 10:17 /hawq_data/16385/16508/16519/1
    

    That single file is the data for foo. Since I'm using a randomly distributed table, there is only one file.

    Changing this to a hash distributed table, gives me more files.

    gpadmin=# drop table foo;
    DROP TABLE
    gpadmin=# CREATE TABLE foo (id int) DISTRIBUTED BY (id);
    CREATE TABLE
    gpadmin=# insert into foo select i from generate_series(1,100) as i;
    INSERT 0 100
    gpadmin=# select oid from pg_class where relname = 'foo';
      oid  
    -------
     16524
    (1 row)
    

    I have two files in HDFS because my single node cluster has been configured to default the bucket number to 2.

    [gpadmin@hdb210 ~]$ hdfs dfs -ls /hawq_data/16385/16508/16524/
    Found 2 items
    -rw-------   1 gpadmin hdfs        520 2017-03-10 10:24 /hawq_data/16385/16508/16524/1
    -rw-------   1 gpadmin hdfs        520 2017-03-10 10:24 /hawq_data/16385/16508/16524/2
    [gpadmin@hdb210 ~]$ psql
    psql (8.2.15)
    Type "help" for help.
    
    gpadmin=# show default_hash_table_bucket_number;
     default_hash_table_bucket_number 
    ----------------------------------
     2
    (1 row)