[SOLVED] Spark Warehouse VS Hive Warehouse

Spark Warehouse VS Hive Warehouse

Hortonworks data platform HDP 3.0 has spark 2.3 and Hive 3.1, By default spark 2.3 applications (pyspark/spark-sql etc) uses spark data warehouse and Spark 2.3 has different way of integrating with Apache Hive using Hive Warehouse Connector.

integrating-apache-hive-with-apache-spark-hive-warehouse-connector

I could see 2 default databases in Hive metastore(MySQL). The one pointing to Hive location and other to spark location.

mysql> SELECT NAME, DB_LOCATION_URI FROM hive.DBS;
+--------+----------------------------------------------------------+
| NAME   | DB_LOCATION_URI                                          |
+--------+----------------------------------------------------------+
| default| hdfs://<hostname>:8020/warehouse/tablespace/managed/hive |
| default| hdfs://<hostname>:8020/apps/spark/warehouse              |
+--------+----------------------------------------------------------+

mysql>

Can any one explain me what is the difference between these 2 type of warehouses, I could not find any article regarding this, can we use spark warehouse instead of Hive (I understand that spark warehouse would not be accessible through Hive, or is there any way?). What are pros and cons of these 2 (spark warehouse and hive warehouse)?

Solution

From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.

By default spark uses spark catalog and below article explain how Apache Hive table can be accessed through Spark.

Integrating Apache Hive with Apache Spark - Hive Warehouse Connector

Github link to some additional details:

HiveWarehouseConnector - Github