One of ehcache nodes mistakenly tries to connect to 127.0.0.1

Could you guys help me understand why one of ehcache nodes mistakenly tries to connect to 127.0.0.1 ?

I'm using ehcache 2.8.3. One of my node is running under VMWare in NAT mode. Therefore host machine has ip 192.168.10.1 (Windows 7) and one in VMWare is 192.168.10.128 (CentOS 6).

I have following ehcache config

<cacheManagerPeerProviderFactory class="net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
                                 properties="peerDiscovery=manual, rmiUrls=//192.168.10.128:51000/myCache1|//192.168.10.1:51000/myCache1"/>

<cacheManagerPeerListenerFactory class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
                                 properties="hostName=0.0.0.0,port=51000,socketTimeoutMillis=2000"/>

<diskStore path="java.io.tmpdir"/>

<defaultCache
        maxEntriesLocalHeap="10000"
        eternal="false"
        timeToIdleSeconds="120"
        timeToLiveSeconds="120"
        diskSpoolBufferSizeMB="30"
        maxEntriesLocalDisk="10000000"
        diskExpiryThreadIntervalSeconds="120"
        memoryStoreEvictionPolicy="LRU"
        statistics="false">
    <persistence strategy="localTempSwap"/>
</defaultCache>

<cache name="myCache1"
       maxEntriesLocalHeap="10000"
       maxEntriesLocalDisk="10000"
       eternal="false"
       diskSpoolBufferSizeMB="20"
       timeToIdleSeconds="300"
       timeToLiveSeconds="600"
       memoryStoreEvictionPolicy="LFU"
       transactionalMode="off">
    <persistence strategy="localTempSwap"/>

    <cacheEventListenerFactory class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"/>
</cache>

Messages from 192.168.10.128 to 192.168.10.1 get routed successfully. But opposite direction doesn't work. I get following error in log on 192.168.10.1

2014-07-11 02:02:19.260 +0400 DEBUG Lookup URL //192.168.10.128:51000/myCache1
2014-07-11 02:02:20.262 +0400 DEBUG Lookup URL //192.168.10.1:51000/myCache1
2014-07-11 02:02:21.264 +0400 WARN  Unable to send message to remote peer.  Message was: Connection refused to host: 127.0.0.1; nested exception is:
        java.net.ConnectException: Connection refused: connect
java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:
        java.net.ConnectException: Connection refused: connect
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619) ~[na:1.7.0_60]
        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216) ~[na:1.7.0_60]
        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) ~[na:1.7.0_60]
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:129) ~[na:1.7.0_60]
        at net.sf.ehcache.distribution.RMICachePeer_Stub.send(Unknown Source) ~[services.jar:1.1]
        at net.sf.ehcache.distribution.RMIAsynchronousCacheReplicator.writeReplicationQueue(RMIAsynchronousCacheReplicator.java:314) [services.jar:1.1]
        at net.sf.ehcache.distribution.RMIAsynchronousCacheReplicator.replicationThreadMain(RMIAsynchronousCacheReplicator.java:127) [services.jar:1.1]
        at net.sf.ehcache.distribution.RMIAsynchronousCacheReplicator.access$000(RMIAsynchronousCacheReplicator.java:58) [services.jar:1.1]
        at net.sf.ehcache.distribution.RMIAsynchronousCacheReplicator$ReplicationThread.run(RMIAsynchronousCacheReplicator.java:389) [services.jar:1.1]
Caused by: java.net.ConnectException: Connection refused: connect
        at java.net.DualStackPlainSocketImpl.connect0(Native Method) ~[na:1.7.0_60]
        at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:79) ~[na:1.7.0_60]
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[na:1.7.0_60]
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) ~[na:1.7.0_60]
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[na:1.7.0_60]
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) ~[na:1.7.0_60]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.7.0_60]
        at java.net.Socket.connect(Socket.java:579) ~[na:1.7.0_60]
        at java.net.Socket.connect(Socket.java:528) ~[na:1.7.0_60]
        at java.net.Socket.<init>(Socket.java:425) ~[na:1.7.0_60]
        at java.net.Socket.<init>(Socket.java:208) ~[na:1.7.0_60]
        at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40) ~[na:1.7.0_60]
        at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:147) ~[na:1.7.0_60]
        at net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory.createSocket(ConfigurableRMIClientSocketFactory.java:71) ~[services.jar:1.1]
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613) ~[na:1.7.0_60]
        ... 8 common frames omitted

Why is it trying to connect to 127.0.0.1 if I don't it anywhere in config file ?

I can telnet to 192.168.10.128:51000 from 192.168.10.1.

I also tried enabling bootstraping and started seeing following log message

2014-07-11 02:35:30.515 +0400 DEBUG cache peers: [RMICachePeer_Stub[UnicastRef2 [liveRef: [endpoint:[127.0.0.1:18405,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@7d0](remote),objID:[-43892557:1472247d06b:-7fff, -5287536613776006259]]]]]
2014-07-11 02:35:30.516 +0400 DEBUG Bootstrapping myCache1 from RMICachePeer_Stub[UnicastRef2 [liveRef: [endpoint:[127.0.0.1:18405,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@7d0](remote),objID:[-43892557:1472247d06b:-7fff, -5287536613776006259]]]]

Why does it think I have peer on 127.0.0.1:18405 ?

Solution

After many hours of debugging in JDK and ehcache source codes I figured it out.

My core incorrect assumption was that something is wrong with Windows node where I see the error. Turned out it was linux node that supplies incorrect address.

Official Ehcache FAQ says:

This is caused by a 2008 change to the Ubuntu/Debian Linux default network configuration. Essentially, the Java call InetAddress.getLocalHost(); always returns the loopback address, which is 127.0.0.1. Why? Because in these recent distros, a system call of $ hostname always returns an address mapped onto the loopback device, and this causes Ehcache's RMI peer creation logic to always assign the loopback address, which causes the error you are seeing. All you need to do is crack open the network config and make sure that the hostname of the machine returns a valid network address accessible by other peers on the network.

Linux node was getting "127.0.0.1" in following method of class java.rmi.registry.LocateRegistry

public static Registry getRegistry(String host, int port, RMIClientSocketFactory csf) throws RemoteException
{
    Registry registry = null;

    if (port <= 0)
        port = Registry.REGISTRY_PORT;

    if (host == null || host.length() == 0) {
        // If host is blank (as returned by "file:" URL in 1.0.2 used in
        // java.rmi.Naming), try to convert to real local host name so
        // that the RegistryImpl's checkAccess will not fail.
        try {
            host = java.net.InetAddress.getLocalHost().getHostAddress();
        } catch (Exception e) {
            // If that failed, at least try "" (localhost) anyway...
            host = "";
        }
    }

    LiveRef liveRef = new LiveRef(new ObjID(ObjID.REGISTRY_ID), new TCPEndpoint(host, port, csf, null), false);
    RemoteRef ref = (csf == null) ? new UnicastRef(liveRef) : new UnicastRef2(liveRef);

    return (Registry) Util.createProxy(RegistryImpl.class, ref, false);
}

My Windows node was receiving it in following method of class net.sf.ehcache.distribution.ManualRMICacheManagerPeerProvider where it calls lookupRemoteCachePeer

public final synchronized List listRemoteCachePeers(Ehcache cache) throws CacheException {
    List remoteCachePeers = new ArrayList();
    List staleList = new ArrayList();
    for (Iterator iterator = peerUrls.keySet().iterator(); iterator.hasNext();) {
        String rmiUrl = (String) iterator.next();
        String rmiUrlCacheName = extractCacheName(rmiUrl);

        if (!rmiUrlCacheName.equals(cache.getName())) {
            continue;
        }
        Date date = (Date) peerUrls.get(rmiUrl);
        if (!stale(date)) {
            CachePeer cachePeer = null;
            try {
                cachePeer = lookupRemoteCachePeer(rmiUrl);
                remoteCachePeers.add(cachePeer);
            } catch (Exception e) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Looking up rmiUrl " + rmiUrl + " through exception " + e.getMessage()
                            + ". This may be normal if a node has gone offline. Or it may indicate network connectivity"
                            + " difficulties", e);
                }
            }
        } else {
                LOG.debug("rmiUrl {} should never be stale for a manually configured cluster.", rmiUrl);
            staleList.add(rmiUrl);
        }

    }

    //Remove any stale remote peers. Must be done here to avoid concurrent modification exception.
    for (int i = 0; i < staleList.size(); i++) {
        String rmiUrl = (String) staleList.get(i);
        peerUrls.remove(rmiUrl);
    }
    return remoteCachePeers;
}

Official advice from Terracotta is to modify hosts file which seems too brutal to me. I concluded that it would be easier for Ops team to simply supply correct binding address in command line of my server which looks like this

java -Djava.rmi.server.hostname=192.168.10.128 -jar services.jar