hadoopwebhdfshttpfs

HttpFs benefit over high availability and nameservice


I am using Apache Hadoop-2.7.1 on cluster that consists of three nodes

nn1 master name node 
nn2 (second name node)   
dn1 (data node)

we know that if we configure high availability in this cluster

we will have two main nodes, one is active and another is standby

and if we configure the cluster to be called by name service too the following scenario will be ok

the scenario is:

1- nn1 is active and nn2 is stand by

so if we want to get file(called myfile) from dn1 we can send this url from browser (webhdfs request)

http://nn1/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

2- name node daemon in nn1 is killed so according to high availability nn1 is standby and nn2 is active so we can get myfile now by sending this web request to nn2 because it is active now

http://nn2/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

so configuring name service with high availability is enough for name node failure and for webhdfs to work fine then

so what is the benefit of adding httpfs here because webhdfs with high availibility is not supported and we have to configure httpfs


Solution

  • I understand that this is a follow up of your previous question here.

    WebHDFS and HttpFs are two different things. WebHDFS is part of the Namenode and it is the NN that handles the WebHDFS API calls whereas HttpFs is a separate service independent of the Namenodes and the HttpFs server handles the API calls.

    what is the benefit of adding httpfs

    Your REST API calls will remain the same irrespective of which NN is in Active state. HttpFs, being HA aware, will direct the request to the current Active NN.

    Let us assume HttpFs server is started in nn1.

    WebHDFS GET request

    curl http://nn1:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
    

    This is served by the Namenode daemon running in nn1.
    Scenario 1: nn1 is Active. The request will be rewarded with a valid response.
    Scenario 2: nn2 is Active. Making the same request will fail as there is no Active NN running in nn1.

    So, the REST call must be modified to request the nn2

    curl http://nn2:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
    

    Now, this will be served by the NN daemon running in nn2.

    HttpFs GET request

    curl http://nn1:14000/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
    

    This request is served by the HttpFs service running in nn1.
    Scenario 1: nn1 is Active. HttpFs server running in nn1 will direct the request to the current Active Namenode nn1.
    Scenario 2: nn2 is Active. HttpFs server running in nn1 will direct the request to the current Active Namenode nn2.

    In both scenario, the REST call is same. The request will fail if the HttpFs server is down.

    configuring name service with high availability is enough for name node failure and for webhdfs to work fine

    nameservice is the logical name given to the pair of Namenodes. This nameservice is not an actual Host and cannot be replaced with the Host parameter in the REST API calls.