I am running simplest Driver alone long running job to reproduce this error
Hadoop Version 2.7.3.2.6.5.0-292
Spark-core version 2_11.2.3.0.2.6.5.0-292
Code:
FileSystem fs = tmpPath.getFileSystem(sc.hadoopConfiguration())
log.info("Path {} is ",path,fs.exists(tmpPath);
Behaviour: My job runs without any problem for ~17-18 hours, After that new keys are released as a part of
HadoopFSDelagationTokenProvider
and job continued to run with newly issued delegated tokens, But within next 1hour of delegation token renew, job fails with error token can't be found in cache. I have gone ahead and generated my own dfs.adddelegationtoken programatically for involved namenodes, and i see same behaviour.Question:
- What are the chances that delegation token gets removed from server and what properties controls this?.
- What server side logs shows this token is about get removed or removed from cache.
Path /test/abc.parquet is true
Path /test/abc.parquet is true
INFO Successfully logged into KDC
INFO getting token for DFS[DFSClient][clientName=DFSClient_NONMAPREDUCE_2324234_29,ugi=qa_user@ABC.com(auth:KERBEROS)](org.apache.spark.deploy.security.HadoopFSDelagationTokenProvider)
INFO Created HDFS_DELEGATION_TOKEN token 31615466 for qa_user on ha:hdfs:hacluster
INFO getting token for DFS[DFSClient][clientName=DFSClient_NONMAPREDUCE_2324234_29,ugi=qa_user@ABC.com(auth:KERBEROS)](org.apache.spark.deploy.security.HadoopFSDelagationTokenProvider)
INFO Created HDFS_DELEGATION_TOKEN token 31615467 for qa_user on ha:hdfs:hacluster
INFO writing out delegation tokens to hdfs://abc/user/qa/.sparkstaging/application_121212.....tmp
INFO delegation tokens written out successfully, renaming file to hdfs://.....
INFO delegation token file rename complete(org.apache.spark.deploy.yarn.security.AMCredentialRenewer)
Scheduling login from keytab in 64799125 millis
Path /test/abc.parquet is true
Path /test/abc.parquet is true
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 31615466 for qa_user) can't be found in cache
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1498)
at org.apache.hadoop.ipc.Client.call(Client.java:1398)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at com.sun.proxy.$Proxy13.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:620)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
FYI submitted in yarn-cluster-mode with: --keytab /path/to/the/headless-keytab, --principal principalNameAsPerTheKeytab --conf spark.hadoop.fs.hdfs.impl.disable.cache=true Note that Token renewer is issuing new keys and new keys are working too, But it;s somehow gets revoked from server, AM logs doesn't have any clue on the same.
Answering my own question:
There are couple of very important points to take from here.
- Delegation token is a single copy stored in UserGroupInformation.getCredentials.getAllTokens(), This can get updated by any other thread running in the save JVM. My problem was fixed by setting
mapreduce.job.complete.cancel.delegation.tokens=false
for all other jobs running in the same contexts, especially the ones run MAPREDUCE contexts.- HadoopFSDelagationTokenProvider should renew keys for every
(fraction*renewal time)
i.e. default0.75*24 hrs
if you have submitted job with --keytab and --principal- Make sure you set
fs.disable.cache
for hdfs filesystem i.e. every time you get new filesystem object, it's costly operation but you get fresh fsObject with new keys for sure instead of getting it fromCACHE.get(fsname)
.In case if none work you can create your own delegation tokens by calling with new Credentials() https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/fs/FileSystem.html#addDelegationTokens(java.lang.String,%20org.apache.hadoop.security.Credentials) but this method must be called with
kerberosUGI.doAS({});