network-programmingsshvirtual-machineqemurancheros

Can't SSH into RancherOS which is installed in iohyve in FreeNAS within a virtual machine


I'm preparing for a Server upgrade, but before doing so I want to have a dry-run within a VM first.

I'm running Linux Mint on a laptop. Currently I have FreeNAS v9.10.2-U6 installed within QEMU and RancherOS v1.5.6 installed into a VM via iohyve.

[laptop]
  |_ [QEMU]
    |_ [FreeNAS]
      |_ [iohyve]
        |_ [RancherOS]

I'm able to SSH into FreeNAS with no problem, but I can't SSH into Rancher. When trying to connect to Rancher it eventually times out. When I run the ssh command with -vvv it seems to hang on debug1: Connecting to <RANCHER_IP> [<RANCHER_IP>] port 22. before eventually timing out.

This is what I've tried so far:

This is my first time dealing with networking within nested VM's so I'm not certain if there's something simple I'm missing. I look forward to any insight the community may have.


Solution

  • TL;DR, I had to disable Hardware Offloading within the FreeNAS VM. For a persistent fix, within FreeNas' GUI I went to Init/Shutdown Scripts and created a Post-Init Command script that ran

    ifconfig vtnet0 -rxcsum -txcsum -rxcsum6 -txcsum6 -vlanmtu -vlanhwtag -vlanhwfilter -vlanhwtso -tso -tso4 -tso6 -lro -vlanhwtso -vlanhwcsum
    

    Full Troubleshooting Steps:

    1. Verified the MTU for Host, FreeNAS, and Rancher were all the same (1500)
      • Host: ifconfig | grep mtu
      • FreeNAS: ifconfig | grep mtu
      • Rancher: ifconfig | grep MTU
    2. Verified Rancher has outside access: ping google.com
    3. Verified the Host, FreeNAS, and Rancher could communicate
      • Host to FreeNAS: ping <FREENAS_IP>
      • Host to Rancher: ping <RANCHER_IP>
      • FreeNAS to Host: ping <HOST_IP>
      • FreeNAS to Rancher: ping <RANCHER_IP>
      • Rancher to Host: ping <HOST_IP>
      • Rancher to FreeNAS: ping <FREENAS_IP>
    4. Verified sshd is running in the Rancher VM: ps -ef | grep sshd
      • Also tried restarting sshd: sudo system-docker restart console in case there was some sort of race condition.
    5. Verified the SSH port is being listened to in the Rancher VM: netstat -nl | grep :22.
    6. Verified routing tables, and that there was a default gateway for all
      • Host: route
      • FreeNAS: netstat -r
      • Rancher: route
    7. Tried adding a dedicated SSH port and listening IP for Rancher, and verified via netstat that just that IP and Port were being listened to. This was to rule out any possible port conflicts.
    8. Checked iptables rules on the Host and Rancher (FreeNAS doesn't have a firewall) and there weren't any rules that blocking communication.
      • Turned the Firewall rules off, then restarted Rancher's sshd (nadda), then rebooted the FreeNAS VM (nadda).
      • There is a firewall tool in FreeNAS, but verified that nothing was set up with: ipfw table all list.
    9. While in FreeNAS I checked network traffic to see if my SSH request was even getting there. For each case I had 2 terminals open, one connected to FreeNAS, the other was to connect to Rancher. Since the output is so long in the Live env (because the SSH connection did complete), I'm only adding one of the logged items for each case since the pertinent info is in the first log.
      • On Live: sudo tcpdump -nnvvS '(src <HOST_IP> and dst <RANCHER_IP>) or (src <RANCHER_IP> and dst <HOST_IP>)'.
        tcpdump: listening on ix0, link-type EN10MB (Ethernet), capture size 65535 bytes
        15:01:53.957264 IP (tos 0x0, ttl 64, id 56881, offset 0, flags [DF], proto TCP (6), length 60)
             <HOST_IP>.60648 > <RANCHER_IP>.22: Flags [S], cksum 0xfae8 (correct), seq 468317589, win 64240, options [mss 1460,sackOK,TS val 2321761697 ecr 0,nop,wscale 7], length 0
        
      • On VM: sudo tcpdump -nnvvS '(src <HOST_IP> and dst <RANCHER_IP>) or (src <RANCHER_IP> and dst <HOST_IP>)'
        tcpdump: listening on vtnet0, link-type EN10MB (Ethernet), capture size 65535 bytes
        14:59:03.029922 IP (tos 0x0, ttl 64, id 25421, offset 0, flags [DF], proto TCP (6), length 60)
             <HOST_IP>.45688 > <RANCHER_IP>.22: Flags [S], cksum 0x8403 (incorrect -> 0x69a6), seq 3645881181, win 64240, options [mss 1460,sackOK,TS val 1007017042 ecr 0,nop,wscale 7], length 0
        
      • Noticed that cksum had incorrect a lot, so I ran this on the Host ethtool --show-offload <ETHERNET_INTERFACE_NAME> | grep tx-checksumming and it told me it was on. Ran sudo ethtool -K <ETHERNET_INTERFACE_NAME> tx off to disable it, re-ran tcpdump and ssh command, still got incorrect for cksum, so I renabled checksumming sudo ethtool -K <ETHERNET_INTERFACE_NAME> tx on. At least I thought the last command reset things, after a reboot of FreeNAS the network was no longer reachable. I ended up running sudo ethtool --reset <ETHERNET_INTERFACE_NAME> all, and eventually recreating the VM from scratch and rebooting my system to get things reset.
    10. Finally came across the solution in this post after a Google search for iohyve tap0 or epair of all things. Quoting the relevant info in case the post disappears at some point.

      I ran into a very similar situation recently. I could ping the jails to & from bhyve guests but I could not pass any actual traffic. From other physical devices I had no issue passing traffic. The problem ended up being the hardware offloaders (TSO, HWSUM, etc) were causing the issue, which I found kind of ironic considering the traffic was not making it to the hardware in my case. I used tcpdump and could see the traffic had checksum errors. I turn off the hardware offloaders and everything started working, took me two weeks to figure this out. In hindsight I should of ran tcpdump on the first day.

      Try turning off the hardware offloading, then rerun ifconfig -v if it took effect, then test to see if you can pass actual traffic.

      Disable hardware offloading:

      ifconfig igb0 -rxcsum -txcsum -rxcsum6 -txcsum6 -vlanmtu -vlanhwtag -vlanhwfilter -vlanhwtso -tso -tso4 -tso6 -lro -vlanhwtso -vlanhwcsum
      
      • So for my use case I SSH'd into FreeNAS, made sure the Rancher VM was stopped, disabled the off-loading (replaced igb0 with vtnet0), started the Rancher VM back up, and finally tried to SSH into Rancher... and succeeded. Basically my previous attempt to disable offloading was correct, but I needed to do it within FreeNAS, not the Host... which is a bit counter intuitive to me considering it's a bridged network and I'm passing my exact hardware resources through to the VMs.