apache-sparkpysparkhivecontext

How to Stop or Delete HiveContext in Pyspark?


I'm facing the following problem:

def my_func(table, usr, psswrd):
    from pyspark import SparkContext, SQLContext, HiveContext, SparkConf

    sconf = SparkConf()
    sconf.setAppName('TEST')
    sconf.set("spark.master", "local[2]")

    sc = SparkContext(conf=sconf)
    hctx = HiveContext(sc)

    ## Initialize variables

    df = hctx.read.format("jdbc").options(url=url,
                                          user=usr,
                                          password=psswd,
                                          driver=driver,
                                          dbtable=table).load()
    pd_df = df.toPandas()

    sc.stop()
    return pd_df

The problem here is the persistence of HiveContext (i.e if I do hctx._get_hive_ctx() it returns JavaObject id=Id) So if I use my_func several times in the same script it will failed at the second time. I would try remove the HiveContext which is apparently not deleted when I stop the SparkContext.

Thanks


Solution

  • Removing HiveContext is not possible as some state persists after sc.stop() that makes it not work in some cases.

    But you could have a workaround for this (caution!! it's dangerous) if it's feasible for you. You have to delete the metastore_db everytime you start/stop your sparkContext. Again, see if it's feasible for you. The code is Java is below (in your case you have to modify it in Python).

    File hiveLocalMetaStorePath = new File("metastore_db");
    FileUtils.deleteDirectory(hiveLocalMetaStorePath);
    

    You can better understand it from the following links.

    https://issues.apache.org/jira/browse/SPARK-10872

    https://issues.apache.org/jira/browse/SPARK-11924