apache-sparkhivekerberosazure-hdinsight

AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] when using Hive warehouse


We recently enabled Kerberos authentication on our Spark cluster, but we found that when we submit Spark jobs in cluster mode, the code cannot connect to Hive. Should we be using Kerberos to authenticate to Hive, and if yes, how? As detailed below, I think we have to specify keytab and principal, but I don't know what exactly.

This is the exception we get:

Traceback (most recent call last):
  File "/mnt/resource/hadoop/yarn/local/usercache/sa-etl/appcache/application_1649255698304_0003/container_e01_1649255698304_0003_01_000001/__pyfiles__/utils.py", line 222, in use_db
    spark.sql("CREATE DATABASE IF NOT EXISTS `{db}`".format(db=db))
  File "/usr/hdp/current/spark3-client/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/hdp/current/spark3-client/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/hdp/current/spark3-client/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: java.lang.RuntimeException: java.io.IOException: DestHost:destPort hn1-pt-dev.MYREALM:8020 , LocalHost:localPort wn1-pt-dev/10.208.3.12:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

Additionally, I saw this exception:

org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over hn0-pt-dev.myrealm/10.208.3.15:8020

This is the script that produces the exception, that as you can see, happens on the CREATE DATABASE:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
spark.sql("CREATE DATABASE IF NOT EXISTS TestDb")

Environment and relevant information

We have an ESP enabled HDInsight Cluster in Azure, it is inside a virtual network. AADDS works fine for logging into the cluster. The cluster is connected to a Storage Account, communicating to it with ABFS and storing the Hive warehouse on there. We are using Yarn. We want to execute Spark jobs using PySpark from the Azure Data Factory, which uses Livy, but if we can get it to work with spark-submit cli it will hopefully also work with Livy. We are using Spark 3.1.1 and Kerberos 1.10.3-30.

The exception only occurs when we use spark-submit --deploy-mode cluster, when using client mode there is no exception and the database is created.

When we remove the .enableHiveSupport the exception also disappears, so it apparently has something to do with the authentication to Hive. We do need the Hive warehouse though, because we need to access tables from within multiple Spark sessions so they need to be persisted.

We can access HDFS, also in cluster mode, as sc.textFile('/example/data/fruits.txt').collect() works fine.

Similar questions and possible solutions

In the exception, I see that it is the worker node which tries to access the head node. The port is 8020, which is I think the namenode port, so this sounds indeed HDFS related - except that to my understanding we can access HDFS, but not Hive.

Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: for principal: myusername@MYREALM from keytab /etc/krb5.keytab javax.security.auth.login.LoginException: Unable to obtain password from user

Maybe I have the wrong keytab file though, because when I klist -k /etc/krb5.keytab the file I only get slots with entries like HN0-PT-DEV@MYREALM and host/hn0-pt-dev.myrealm@MYREALM. If I look in the keytabs for hdfs/hive in /etc/security/keytabs I also see only entries for hdfs/hive users.

When I try adding all the extraJavaOptions specified in How to use Apache Spark to query Hive table with Kerberos? but don't specify principal/keytab, I get KrbException: Cannot locate default realm even though the default realm in /etc/krb5.conf is correct.

In Ambari, I can see the settings spark.yarn.keytab={{hive_kerberos_keytab}} and spark.yarn.principal={{hive_kerberos_principal}}.

It appears that many other answers/websites also suggest to specify principal/keytab explicitly:

Other questions:

For a Spark application to interact with HDFS, HBase and Hive, it must acquire the relevant tokens using the Kerberos credentials of the user launching the application —that is, the principal whose identity will become that of the launched Spark application. This is normally done at launch time: in a secure cluster Spark will automatically obtain a token for the cluster’s HDFS filesystem, and potentially for HBase and Hive.

Well, the user launching the application has valid ticket, as can be seen in the output of klist. The user has contributor access to the blob storage (not sure if that is actually needed). I don't understand what is meant with "Spark will automatically obtain a token for Hive [at launch time]" though. I did restart all services on the cluster, but that didn't help.

in yarn-cluster mode, the Spark client uses the local Kerberos ticket to connect to Hadoop services and retrieve special auth tokens that are then shipped to the YARN container running the driver; then the driver broadcasts the token to the executors

Possible things to try:


Updates

When logged in as Hive user:

kinit then supply hive password:

Password for hive/hn0-pt-dev.myrealm@MYREALM: 
kinit: Password incorrect while getting initial credentials


hive@hn0-pt-dev:/tmp$ klist -k /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
hive@hn0-pt-dev:/tmp$ kinit -k /etc/security/keytabs/hive.service.keytab
kinit: Client '/etc/security/keytabs/hive.service.keytab@MYREALM' not found in Kerberos database while getting initial credentials

Solution

  • In general, you have to complete a [kinit successfully]/[pass a principle/keytab] to be able to use Kerberos with spark/hive. Their are some settings that complicate the use of hive. (Impersonation)

    Generally speaking if you can kinit and use hdfs to write to your own folder your keytab is working:

    kinit #enter user info
    hdfs dfs -touch /home/myuser/somefile #gurantees you have a home directory... spark needs this
    

    Once you know that is working you should check if you can write to hive:

    Either use a JDBC connection or use beeline with a connection string like below

    jdbc:hive2://HiveHost:10001/default;principal=myuser@HOST1.COM;
    

    This helps to find were the issue is.

    If you are looking at an issue with hive you need to check impersonation:

    HiveServer2 Impersonation Important: This is not the recommended method to implement HiveServer2 authorization. Cloudera recommends you use Sentry to implement this instead. HiveServer2 impersonation lets users execute queries and access HDFS files as the connected user rather than as the super user. Access policies are applied at the file level using the HDFS permissions specified in ACLs (access control lists). Enabling HiveServer2 impersonation bypasses Sentry from the end-to-end authorization process. Specifically, although Sentry enforces access control policies on tables and views within the Hive warehouse, it does not control access to the HDFS files that underlie the tables. This means that users without Sentry permissions to tables in the warehouse may nonetheless be able to bypass Sentry authorization checks and execute jobs and queries against tables in the warehouse as long as they have permissions on the HDFS files supporting the table.

    If you are on windows, you should look watch out for the ticket cache. You should consider setting up your own personal ticket cache location, because typically windows uses one generic location for all users. (Which allows users to login over top of each other creating weird errors.)

    If you are having hive issues, the hive logs themselves often help you to understand why the process isn't working. (But you will only have a log if some of the kerberos was successful, if it was completely unsuccessful you won't see anything. )

    Check Ranger and see if there are any Errors.

    Using a Keytab By providing Spark with a principal and keytab (e.g. using spark-submit with --principal and --keytab parameters), the application will maintain a valid Kerberos login that can be used to retrieve delegation tokens indefinitely.

    Note that when using a keytab in cluster mode, it will be copied over to the machine running the Spark driver. In the case of YARN, this means using HDFS as a staging area for the keytab, so it’s strongly recommended that both YARN and HDFS be secured with encryption, at least.

    If you are using Livy --proxy-user will conflict with --principal, but that's easy to fix. ( use: livy.impersonation.enabled=false )