hdfsverticadbt

Can I connect to Vertica and HDFS in the same project using dbt core?


I want to connect to Vertica and HDFS in the same project. I created a dbt project with command dbt init and try to connect to Vertica. It works but I don't know how to connect to hdfs to read data and load it to Vertica.

I need to read and load data from HDFS to Vertica database. Is it possible?


Solution

  • You could start reading the docu here: https://docs.vertica.com/23.4.x/en/data-load/working-with-external-data/creating-external-tables/

    An example for Parquet on HDFS:

    Step 1: Let Vertica infer the Parquet file's definition:

    \a
    Output format is unaligned.
    \t
    Showing only tuples.
    SELECT infer_table_ddl(
      'hdfs:///data/tpch/region/'
    ||'3d5f4653-v_sbx_node0001-140498341713664-0.parquet' 
      USING PARAMETERS 
        format     = 'Parquet'
      , table_name = 'region'
      , table_type ='external'
    )
    vsql:infer.sql:7: WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
    create external table "region"(
      "r_regionkey" int, 
      "r_name" varchar, 
      "r_comment" varchar
    ) as copy from 'hdfs:///data/tpch/region/3d5f4653-v_sbx_node0001-140498341713664-0.parquet' parquet;
    
    

    I would then suggest that you change the generated DDL statement to reflect the maximum lengths of r_name and r_comment (varchar(25) and varchar(152) , for example) before running it.

    Once the command is executed, you can select from region as if it were a normal table.