Two of my three slave VMs are down and I can't ssh them. We have performed a hard reboot but still they are down. Any idea how to bring them back or how to debug to find the reason. Here's what jps
:
3542 RunJar
9920 SecondaryNameNode
10094 ResourceManager
10244 NodeManager
8677 DataNode
31634 Jps
8536 NameNode
Here's also another detail:
ubuntu@anmol-vm1-new:~$ sudo netstat -atnp | grep 8020
tcp 0 0 10.0.1.190:8020 0.0.0.0:* LISTEN 8536/java
tcp 0 0 10.0.1.190:50957 10.0.1.190:8020 ESTABLISHED 8677/java
tcp 0 0 10.0.1.190:8020 10.0.1.190:50957 ESTABLISHED 8536/java
tcp 0 0 10.0.1.190:8020 10.0.1.193:46627 ESTABLISHED 8536/java
tcp 0 0 10.0.1.190:44300 10.0.1.190:8020 TIME_WAIT -
tcp 0 0 10.0.1.190:8020 10.0.1.190:44328 ESTABLISHED 8536/java
tcp 0 0 10.0.1.190:8020 10.0.1.193:44610 ESTABLISHED 8536/java
tcp6 0 0 10.0.1.190:44292 10.0.1.190:8020 TIME_WAIT -
tcp6 0 0 10.0.1.190:44328 10.0.1.190:8020 ESTABLISHED 10244/java
tcp6 0 0 10.0.1.190:44252 10.0.1.190:8020 TIME_WAIT -
tcp6 0 0 10.0.1.190:44247 10.0.1.190:8020 TIME_WAIT -
tcp6 0 0 10.0.1.190:44287 10.0.1.190:8020 TIME_WAIT -
When I run the following command:
hadoop fsck /
the result is:
The filesystem under path '/' is CORRUPT
Here's more details in this pastebin.
If they are down and if you cannot ssh them, that means your filesystem can be full. You have to login using VM console and clean up the file system, ssh will not work any more.