I am using Apache Hadoop-2.7.1 on cluster that consists of three nodes
nn1 master name node
nn2 (second name node)
dn1 (data node)
we know that if we configure high availability in this cluster
we will have two main nodes, one is active and another is standby
and if we configure the cluster to be called by name service too the following scenario will be ok
the scenario is:
1- nn1 is active and nn2 is stand by
so if we want to get file(called myfile) from dn1 we can send this url from browser (webhdfs request)
http://nn1/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
2- name node daemon in nn1 is killed so according to high availability nn1 is standby and nn2 is active so we can get myfile now by sending this web request to nn2 because it is active now
http://nn2/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
so configuring name service with high availability is enough for name node failure and for webhdfs to work fine then
so what is the benefit of adding httpfs here because webhdfs with high availibility is not supported and we have to configure httpfs
I understand that this is a follow up of your previous question here.
WebHDFS
and HttpFs
are two different things. WebHDFS is part of the Namenode and it is the NN that handles the WebHDFS
API calls whereas HttpFs is a separate service independent of the Namenodes and the HttpFs
server handles the API calls.
what is the benefit of adding httpfs
Your REST API calls will remain the same irrespective of which NN is in Active state. HttpFs
, being HA aware, will direct the request to the current Active NN.
Let us assume HttpFs
server is started in nn1
.
WebHDFS GET
request
curl http://nn1:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
This is served by the Namenode daemon running in nn1
.
Scenario 1: nn1
is Active. The request will be rewarded with a valid response.
Scenario 2: nn2
is Active. Making the same request will fail as there is no Active NN running in nn1
.
So, the REST call must be modified to request the nn2
curl http://nn2:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
Now, this will be served by the NN daemon running in nn2
.
HttpFs GET
request
curl http://nn1:14000/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN
This request is served by the HttpFs
service running in nn1
.
Scenario 1: nn1
is Active. HttpFs
server running in nn1
will direct the request to the current Active Namenode nn1
.
Scenario 2: nn2
is Active. HttpFs
server running in nn1
will direct the request to the current Active Namenode nn2
.
In both scenario, the REST call is same. The request will fail if the HttpFs
server is down.
configuring name service with high availability is enough for name node failure and for webhdfs to work fine
nameservice
is the logical name given to the pair of Namenodes. This nameservice
is not an actual Host and cannot be replaced with the Host parameter in the REST API calls.