Device Managers on Machines with external 10GigE devices causing problems

The issue I am having is when I go to start a device manager on a remote PC and I have an external hardware device with 10GigE ports connected the registerDevice message from the device manager registers the IP address of the 10GigE interface used by the external device connected to the device manager PC instead of the device manager machines actual IP address.

Network Setup:

PC1
  Domain Manager:192.168.5.10
  Device Manager A(GPP):192.168.5.10 (on same machine as the domain manager)

PC2
  Device Manager B(GPP):192.168.5.11 
  Device Interface: 192.168.100.10 (connected to external hardware)

If I run my scenario without the external device connected to PC2, the device manager on that machine registers with IP address: 192.168.5.11. If I connect the external hardware to PC2 and the 10GigE interface comes online, the device manager registers with IP address: 192.168.100.10 and the entire REDHAWK domain hangs.

I verified this issue by going through wireshark logs on both PC1 and PC2. We have not had this problem when connecting UHD devices other than the UHD devices with 10GigE ports. It is important to note that no devices or device managers for them are actually being used at this point. The devices are just powered on and a node with only the GPP is started. In the UHD and external hardware case, both 10GigE ports are custom and implement a limited 10GigE interface. When connected to another PC with 10GigE instead of a device with a limited 10GigE implementation, the device manager works.

If I connect the 10GigE device after the node is active, the FE 2.0 devices work perfectly fine. This scenario, however, will not work for us as physically walking over and powering on the device is not valid for our use case. In addition, running with the devices on a domain started on the same computer does not exhibit these issues. This problem only occurs when the domain is on a remote PC.

We are currently working with the following REDHAWK versions and it has the same problem on both.

CentOS 6.6 with REDHAWK 2.0.3 and OmniORB 4.1
Fedora 24 with REDHAWK 2.0.3 and OmniORB 4.2

Has anyone else had this problem and is there anything I can do about it?

Solution

Lets use docker to run through a thorough example. You'll need to bring up 3 terminals and have docker installed but we can do this all on one host. I'll refer to the 3 terminals as the "Domain", "Sandbox", and "Host system".

In the Domain Terminal, spin up a fresh redhawk 2.0.2 instance:

docker run -it --name=domain axios/redhawk:2.0.2 bash -l

In the Sandbox Terminal, spin up another redhawk 2.0.2 instance:

docker run -it --name=sandbox axios/redhawk:2.0.2 bash -l

If you aren't familiar with docker, these two docker instances have unique filesystems, memory, and networking. Do an ifconfig to check the IP addresses of each and note them down. Note that they are on the same subnet and can ping one another.

We can now emulate your 10GigE ports by creating two new networks which cannot reach each other. On the Host Terminal use docker to create two separate fake networks and assign them to your container instances.

docker network create -o "com.docker.network.bridge.host_binding_ipv4"="1.1.1.1" bad_net_1
docker network create -o "com.docker.network.bridge.host_binding_ipv4"="2.2.2.2" bad_net_2
docker network connect bad_net_1 domain
docker network connect bad_net_2 sandbox

Back in the Domain and Sandbox terminals rerun ifconfig, notice you now have an eth0 and eth1 interface where eth1 on the Domain and Sandbox instance are on unique subnets and cannot communicate.

Your IP addresses may be different but for me I have:

Domain:
  eth0: 172.17.0.2
  eth1: 172.19.0.2

Sandbox:
  eth0: 172.17.0.3
  eth1: 172.20.0.2

I'll now configure the Domain to be the omniNames host, set omniORB connection timeout so we don't hang on corba calls, and INCORRECTLY configure the endPoint so that the WRONG IP address is advertised.

On the Domain machine:

sudo tee /etc/omniORB.cfg << EOF
InitRef = NameService=corbaname::172.17.0.2:2809
supportBootstrapAgent = 1
InitRef = EventService=corbaloc::172.17.0.2:11169/omniEvents
endPoint = giop:tcp:172.19.0.2:
serverCallTimeOutPeriod = 5000
clientConnectTimeOutPeriod = 5000
clientCallTimeOutPeriod = 5000
EOF

On the Sandbox machine:

sudo tee /etc/omniORB.cfg << EOF
InitRef = NameService=corbaname::172.17.0.2:2809
supportBootstrapAgent = 1
InitRef = EventService=corbaloc::172.17.0.2:11169/omniEvents
endPoint = giop:tcp:172.20.0.2:
serverCallTimeOutPeriod = 5000
clientConnectTimeOutPeriod = 5000
clientCallTimeOutPeriod = 5000
EOF

On the Domain machine start omniNames and events via cleanomni which will also clear out any stale state:

cleanomni

On the sandbox machine run nameclt list to view the omniORB objects. Note that it does not work since the endPoint address advertised for the domain is wrong. If we up the logging in /etc/omniORB.cfg via traceLevel=40 we can even see in the message the incorrect IP address.

omniORB: inputMessage: from giop:tcp:172.17.0.2:2809 236 bytes
omniORB:
4749 4f50 0100 0101 e000 0000 0000 0000 GIOP............
0400 0000 0000 0000 0000 0000 2a00 0000 ............*...
4944 4c3a 6f6d 672e 6f72 672f 436f 734e IDL:omg.org/CosN
616d 696e 672f 4269 6e64 696e 6749 7465 aming/BindingIte
7261 746f 723a 312e 3000 0000 0100 0000 rator:1.0.......
0000 0000 9400 0000 0101 0200 0b00 0000 ................
3137 322e 3139 2e30 2e32 0000 23ae 0000 172.19.0.2..#...
0e00 0000 ff00 bb05 0a58 0100 003c 0000 .........X...<..
0002 0000 0400 0000 0000 0000 0800 0000 ................
0100 0000 0054 5441 0100 0000 1c00 0000 .....TTA........
0100 0000 0100 0100 0100 0000 0100 0105 ................
0901 0100 0100 0000 0901 0100 0300 0000 ................
1600 0000 0100 0000 0b00 0000 3137 322e ............172.
3137 2e30 2e32 0000 f90a 0000 0354 5441 17.0.2.......TTA
0800 0000 bb05 0a58 0100 003c           .......X...<

On the domain terminal, using vim or emacs fix the endPoint in /etc/omniORB.cfg and run cleanomni to clear out any old references and restart the omni sevices. From the sandbox terminal you can now properly run nameclt list.

On the domain terminal start up a domain using nodeBooter -D and from the sandbox terminal connect to the domain via python and confirm you can interact with it as expected.

>>> from ossie.utils import redhawk
>>> dom = redhawk.attach('REDHAWK_DEV')
>>> dom.name
'REDHAWK_DEV'
>>> fs = dom.fileManager
>>> fs.list('.')

Note that so far, we have only been making calls from the Sandbox to the Domain so only the advertised endPoint of the Domain Machine has mattered. Calls like "start" and "stop" are made from you down to a component but calls like pushPacket are made from the component out to you. We can confirm this by connecting a SigGen on the Domain Machine to a HardLimit on the Sandbox machine. Remember, right now the domain machine is properly configured and the sandbox machine is not.

On the Domain machine, stop the domain and run the following to install a waveform and launch the domain with a GPP:

sudo yum install -y rh.basic_components_demo
nodeBooter -D -d /var/redhawk/sdr/dev/nodes/DevMgr_12ef887a9000/DeviceManager.dcd.xml

Now back in the sandbox machine in python:

from ossie.utils import redhawk, sb
import time
dom = redhawk.attach('REDHAWK_DEV')
app = dom.createApplication('/waveforms/rh/basic_components_demo/basic_components_demo.sad.xml')
siggen = app.comps[0]
siggen.start()
hardlimit = sb.launch('rh.HardLimit')
sink = sb.DataSink()
hardlimit.connect(sink)
siggen.connect(hardlimit)
sb.start()
time.sleep(1)
sink.getData()

You should get no data in the sink since the wrong endPoint is advertised by the sandbox machine. Now exit out of python fix the endPoint on the Sandbox instance and rerun this experiment. This time you get data since both endPoints are properly configured.

Lastly, what happens if you do not set an endPoint at all? (As I imagine was your case) From the omniORB sample configuration file:

By default, no endPoint configuration is defined. In this case the ORB will create just 1 tcp endpoint as if the line: endPoint = giop:tcp:: is specified in the configuration file

and

The host and port parameter are optional. If either or both are missing, the ORB fills in the blank. For example, "giop:tcp::" will cause the ORB to pick an arbitrary tcp port as the endpoint and it will pick one IP address of the host as the host address.

So you may get very strange behavior. Hopefully this example has helped and is easy enough for everyone to run through / reproduce.

Now that we are done we can clean up our docker instances and docker networks with:

docker rm -f domain sandbox
docker network rm bad_net_1 bad_net_2