I want to connect to Vertica and HDFS in the same project. I created a dbt project with command dbt init and try to connect to Vertica. It works but I don't know how to connect to hdfs to read data and load it to Vertica.
I need to read and load data from HDFS to Vertica database. Is it possible?
You could start reading the docu here: https://docs.vertica.com/23.4.x/en/data-load/working-with-external-data/creating-external-tables/
An example for Parquet on HDFS:
Step 1: Let Vertica infer the Parquet file's definition:
\a
Output format is unaligned.
\t
Showing only tuples.
SELECT infer_table_ddl(
'hdfs:///data/tpch/region/'
||'3d5f4653-v_sbx_node0001-140498341713664-0.parquet'
USING PARAMETERS
format = 'Parquet'
, table_name = 'region'
, table_type ='external'
)
vsql:infer.sql:7: WARNING 9311: This generated statement contains one or more varchar/varbinary columns which default to length 80
create external table "region"(
"r_regionkey" int,
"r_name" varchar,
"r_comment" varchar
) as copy from 'hdfs:///data/tpch/region/3d5f4653-v_sbx_node0001-140498341713664-0.parquet' parquet;
I would then suggest that you change the generated DDL statement to reflect the maximum lengths of r_name
and r_comment
(varchar(25) and varchar(152) , for example) before running it.
Once the command is executed, you can select from region
as if it were a normal table.