postgresqlapache-sparkhortonworks-data-platformhawq

Spark + HAWQ Integration (HDP 2.4.2)


I am using HDP 2.4.2 I want to connect Spark with HAWQ for data ingestion.

Please let me know if there is any recommended/correct approach, currently I am using postgress jdbc driver for connecting spark with HAWQ. I am facing issues like

-DataFrame creates table automatically in HAWQ if table is not present.

-Records ingestion is too slow.

-Intermittently is showing errors such as "org.postgresql.util.PSQLException: ERROR: relation "table_name" already exists".


Solution

  • Please see this example Scala project for reading HAWQ data via Spark RDD: https://github.com/kdunn926/sparkHawq

    If you are hoping to read data generated by Spark with HAWQ, your best option will be to write to HDFS from Spark and use PXF to read it with HAWQ. See the documentation here: http://hdb.docs.pivotal.io/200/hawq/pxf/PivotalExtensionFrameworkPXF.html