jdbchiveapache-zeppelinhortonworks-sandbox

Apache Zeppelin configuration for connect to Hive on HDP Virtualbox


I've been struggling with the Apache Zeppelin notebook version 0.10.0 setup for a while. The idea is to be able to connect it to a remote Hortonworks 2.6.5 server that runs locally on Virtualbox in Ubuntu 20.04. I am using an image downloaded from the:

https://www.cloudera.com/downloads/hortonworks-sandbox.html

Of course, the image has pre-installed Zeppelin which works fine on port 9995, but this is an old 0.7.3 version that doesn't support Helium plugins that I would like to use. I know that HDP version 3.0.1 has updated Zeppelin version 0.8 onboard, but its use due to my hardware resource is impossible at the moment. Additionally, from what I remember, enabling Leaflet Map Plugin there was a problem either.

The first thought was to update the notebook on the server, but after updating according to the instructions on the Cloudera forums (unfortunately they are not working at the moment, and I cannot provide a link or see any other solution) it failed to start correctly. A simpler solution seemed to me now to connect the newer notebook version to the virtual server, unfortunately, despite many attempts and solutions from threads here with various configurations, I was not able to connect to Hive via JDBC. I am using Zeppelin with local Spark 3.0.3 too, but I have some geodata in Hive that I would like to visualize this way.

I used, among others, the description on the Zeppelin website:

https://zeppelin.apache.org/docs/latest/interpreter/jdbc.html#apache-hive

This is my current JDBC interpreter configuration:

hive.driver     org.apache.hive.jdbc.HiveDriver
hive.url        jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 
hive.user       hive
Artifact        org.apache.hive:hive-jdbc:3.1.2

Depending on the driver version, there were different errors, but this time after typing:

%jdbc(hive)
SELECT * FROM mydb.mytable;

I get the following error:

Could not open client transport for any of the Server URI's in ZooKeeper: Could not establish connection to jdbc:hive2://sandbox-hdp.hortonworks.com:10000/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;hive.server2.proxy.user=hive;?tez.application.tags=paragraph_1645270946147_194101954;mapreduce.job.tags=paragraph_1645270946147_194101954;: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:mapreduce.job.tags=paragraph_1645270946147_194101954, set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, hive.server2.proxy.user=hive, use:database=default, set:hiveconf:tez.application.tags=paragraph_1645270946147_194101954})

I will be very grateful to everyone for any help. Regards.


Solution

  • So, after many hours and trials, here's a working solution. First of all, the most important thing is to use drivers that correlate with your version of Hadoop. Needed are jar files like 'hive-jdbc-standalone' and 'hadoop-common' in their respective versions and to avoid adding all of them in the 'Artifact' field of the %jdbc interpreter in Zeppelin it is best to use one complete file containing all required dependencies. Thanks to Tim Veil it is available in his Github repository below:

    https://github.com/timveil/hive-jdbc-uber-jar/

    This is my complete Zeppelin %jdbc interpreter settings:

    default.url                                     jdbc:postgresql://localhost:5432/
    default.user                                    gpadmin
    default.password
    default.driver                                  org.postgresql.Driver
    default.completer.ttlInSeconds                  120
    default.completer.schemaFilters     
    default.precode     
    default.statementPrecode        
    common.max_count                                1000    
    zeppelin.jdbc.auth.type                         SIMPLE 
    zeppelin.jdbc.auth.kerberos.proxy.enable        false 
    zeppelin.jdbc.concurrent.use                    true    
    zeppelin.jdbc.concurrent.max_connection         10  
    zeppelin.jdbc.keytab.location       
    zeppelin.jdbc.principal         
    zeppelin.jdbc.interpolation                     false   
    zeppelin.jdbc.maxConnLifetime                   -1
    zeppelin.jdbc.maxRows                           1000    
    zeppelin.jdbc.hive.timeout.threshold            60000 
    zeppelin.jdbc.hive.monitor.query_interval       1000    
    hive.driver                                     org.apache.hive.jdbc.HiveDriver     
    hive.password       
    hive.proxy.user.property                        hive.server2.proxy.user     
    hive.splitQueries                               true    
    hive.url                                        jdbc:hive2://sandbox-hdp.hortonworks.com:10000/default  
    hive.user                                       hive
    
    Dependencies
    Artifact
    /opt/zeppelin/interpreter/jdbc/hive-jdbc-uber-2.6.5.0-292.jar
    

    Next step is to go to Ambari http://localhost:8080/ and login as admin. To do that first you must login on Hadoop root account via SSH:

    ssh root@127.0.0.1 -p 2222
    root@127.0.0.1's password: hadoop
    

    After successful login, you will be prompted to change your password immediately, please do that and next set Ambari admin password with command:

    [root@sandbox-hdp ~]# ambari-admin-password-reset
    

    After that you can use admin account in Ambari (login and click Hive link in the left panel):

    Ambari -> Hive -> Configs -> Advanced -> Custom hive-site

    Click Add Property

    Insert followings into the opening window:

    hive.security.authorization.sqlstd.confwhitelist.append=tez.application.tags
    

    And after saving, restart all Hive services in Ambari. Everything should be working now if you set the proper Java path in 'zeppelin-env.sh' and port in 'zeppelin-site.xml' (you must copy and rename 'zeppelin-env.sh.template' and 'zeppelin-site.xml.template' in Zeppelin/config directory, please remember that Ambari also use 8080 port!).

    In my case, the only thing left to do is add or uncomment the fragment responsible for the Helium plug-in repository (in 'zeppelin-site.xml'):

    <property>
      <name>zeppelin.helium.registry</name>
      <value>helium,https://s3.amazonaws.com/helium-package/helium.json</value>
      <description>Enable helium packages</description>
    </property>
    

    Now you can go to the Helium tab in the top right corner of the Zeppelin sheet and install the plugins of your choice, in my case it is 'zeppelin-leaflet' visualization. And voilà! Sample vizualization from this Kaggle dataset in Hive:

    https://www.kaggle.com/kartik2112/fraud-detection

    Zeppelin fraud map

    Have a nice day!