javahivekerberosmetastore

Hive metastore suffering from kerberos "Clock skew too great" error


Recently we have been encountering a problem as described in the title once a month. On the metastore node, we have installed and started the ntpd service to synch time with kerberos server. The krb5.conf on the node looks like this:

[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = true
dns_lookup_kdc = true
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

So it seems less likely that the time on metastore out of synch with kerberos server(>=5min) resulting to the problem or due to network block.
Seen from the metastore log, the "Clock skew too great" exception logging time is out of order, such as,

2016-01-16 18:18:48,071 ERROR [pool-3-thread-63735]
2016-01-16 19:07:03,699 ERROR [pool-3-thread-63798]
2016-01-16 19:06:55,998 ERROR [pool-3-thread-63796]
2016-01-16 19:06:41,653 ERROR [pool-3-thread-63812]
2016-01-16 19:04:28,659 ERROR [pool-3-thread-63806]
2016-01-16 19:04:13,937 ERROR [pool-3-thread-63804]
2016-01-16 19:02:19,312 ERROR [pool-3-thread-63809]
2016-01-16 19:02:13,115 ERROR [pool-3-thread-63794]
2016-01-16 19:02:06,028 ERROR [pool-3-thread-63800]
2016-01-16 19:01:50,767 ERROR [pool-3-thread-63795]
2016-01-16 18:59:36,926 ERROR [pool-3-thread-63810]
2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]

The exception stack:

2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]: transport.TSaslTransport (TSaslTransport.java:open(296)) - SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))]
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:177)
        at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:509)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:264)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:678)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:675)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1536)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge20S.java:675)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))
        at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)  
        ... 10 more

env:

 java version "1.7.0_45"
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 hive-0.13.1.2.1.10.0-hdp

So what should I do if i want to figure out the root cause? Any suggestions? Many thanks.


Solution

  • I've seen this error as well, and in my case, the root cause had nothing to do with Kerberos. If you are using a MySql database as your data store, there is a pretty severe memory leak, https://issues.apache.org/jira/browse/HIVE-15551, that was introduced in 0.13 and not fixed until Hive 1.3.0. Basically, whoever originally wrote the code either forgot or didn't realize that you have to explicitly close JDBC statements, and this leads to excessive garbage collection when your process reaches its memory limits. Once this happens, everything in the process gets progressively slower, to the point where you start to see these clock skew errors.

    You can tell if this is your issue by running a jmap live histogram on the metastore process. If you see JDBC objects towards the top of the list (in my case it was com.mysql.jdbc.JDBC42ResultSet and com.mysql.jdbc.StatementImpl), you are likely hitting this issue. I recommend that you either apply the patch, upgrade to Hive 1.3.0, or use the workaround mentioned in the issue to see if this clears things up.