I am trying to configure a 3-node Apache Hadoop cluster. I already did it in docker environment and everything worked fine there. Now, I am trying to move to Open Nebula environment. I have 3 VMs with Ubuntu and Hadoop. When I start hadoop using ./sbin/start-dfs.sh, Hadoop opens up datanodes on all the slaves and everything looks fine till this point. But if I use "./bin/hdfs dfsadmin -report", it only shows me 1 live data node. Check out the following
Here is the result of JPS command on my master:
JPS command on Slave:
I am also able to SSH all the machines. My guess is that something is wrong with my host file because my slaves are not been able to reach the master. Here is my master /etc/hosts.
<my_ip_1> master
<my_ip_2> slave-1
<my_ip_3> slave-2
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I have not modified my /etc/hostname file but it looks like this. where "my_ip_1" represents the current IP of the VM.
<my_ip_1>.cloud.<domain>.de
Further, if I run the hadoop PI example using the command
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 100 10000000
I get the following error in the slave-1 and slave-2 log file. But the master node solves the PI problem on its own.
2015-08-25 15:27:03,249 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/<my_ip_1>:54310. Already tried 10 time(s); maxRetries=45
2015-08-25 15:27:23,270 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/<my_ip_1>:54310. Already tried 11 time(s); maxRetries=45
2015-08-25 15:27:43,290 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/<my_ip_1>:54310. Already tried 12 time(s); maxRetries=45
I have already tried: http://www.quora.com/The-master-node-shows-only-one-live-data-node-when-I-am-running-multi-node-cluster-in-Hadoop-What-should-I-do
Ok I managed to figure out the problem and found the fix.
Problem:
My slave nodes were not communicating with the master. So, I checked the firewall settings on my machines (Ubuntu) using the following command
sudo ufw status verbose
The output of the command
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), disabled (routed)
New profiles: skip
Solution:
So, my machines were denying any incoming requests. So, I disabled my firewall to verify the assumption.
sudo ufw disable
Before disabling the firewall, telnet <my_ip_1> 54310
was giving me connection timeout. But after disabling the firewall, it worked fine. Then I disabled the firewall on all the machine and ran the PI example of hadoop again. It worked.
Then I reenabled the firewall on all machines
sudo ufw enable
And I added some firewall rules for incoming requests from my own IP addresses like
sudo ufw allow from XXX.XXX.XXX.XXX
Or if you want to allow an IP range from 0-255 then
sudo ufw allow from XXX.XXX.XXX.0/24
Since, I had 3 machines. So, for each machine I added the IP address of other 2 machines. Everything went fine.