apache-sparkpysparkspark-thriftserver

spark ETL and spark thrift server


Some details:

ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads

BI tool is connect via odbc driver

After activating Spark Thrift Server I'm unable to run pyspark script using spark-submit as they both use the same metastore_db

error:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384, see the next exception for details.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
        ... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /tmp/metastore_db.

I need to be able to run PySpark (Spark ETL) while having spark thrift server up for BI tool queries. Any workaround for it?

Thanks!


Solution

  • In my case the solution was to move the metastore_db to a database server like MySql (in my case) or Postgresql.

    You will have to configure $SPARK_HOME/conf/hive-site.xml and include your jdbc driver in $SPARK_HOME/jars path

    hive-site.xml example for MySQL connection

    <configuration>
      <!-- Hive Execution Parameters -->
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://XXX.XXX.XXX.XXX:3306/metastore?createDatabaseIfNotExist=true&amp;useSSL=FALSE&amp;autoReconnect=true&amp;nullCatalogMeansCurrent=true</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>YOUR_USER</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>YOUR_PASSWORD</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.cj.jdbc.Driver</value>
    </property>
    <property>
      <name>hive.server2.transport.mode</name>
      <value>http</value>
    </property>
    <property>
      <name>hive.server2.thrift.http.port</name>
      <value>10000</value>
    </property>
    <property>
      <name>hive.server2.http.endpoint</name>
      <value>cliservice</value>
    </property>
    
    
    <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description/>
    </property>
      <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>true</value>
      </property>
      <property>
        <name>datanucleus.fixedDatastore</name>
        <value>true</value>
      </property>
      <property>
        <name>datanucleus.autoCreateTables</name>
        <value>true</value>
      </property>
      <property>
        <name>datanucleus.schema.autoCreateTables</name>
        <value>true</value>
      </property>
    
    </configuration>