bashsshptyjob-controlprocess-group

Why a background ssh can take over the tty from Bash?


(I'm using Bash 4.4.12 on Debian 8. Question also asked in the bash mailing list.)

See the following steps to reproduce the problem.

From tty #1 (pts/2):

[STEP 101] # tty
/dev/pts/2
[STEP 102] # ssh -o ControlMaster=yes -o ControlPath=/tmp/socket.ssh -N -f 127.0.0.1
[STEP 103] # ps -C ssh u
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       1390  0.0  0.0  36440   656 ?        Ss   11:33   0:00 ssh -o ControlMaster=yes -o ControlPath=/tmp/so
[STEP 104] #
[STEP 105] # ssh -o ControlMaster=no -o ControlPath=/tmp/socket.ssh \
             127.0.0.1 sleep 3600 &
[1] 1396
[STEP 106] #    <-- Here I cannot input anything except <CTRL-C>

STEP 102 started the multiplexed SSH connection running as a daemon. STEP 105 tries to use the multiplexed connection to run a sleep command. But then I cannot input anything into the current shell. If I kill the ssh ... sleep & process then Bash will be able to accept my input again. Seems like all input is consumed by the background ssh process.

Go to tty #2 (pts/3):

[STEP 201] # tty
/dev/pts/3
[STEP 202] # ps t pts/2 j
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
   723   1353   1353   1353 pts/2      1353 Ss+      0   0:00 bash
  1353   1396   1396   1353 pts/2      1353 S        0   0:00 ssh -o ControlMaster=no -o ControlPath=/tmp/socket.ssh 127.0.0.1 sleep 3600
[STEP 203] # ps s 1396
  UID    PID  PENDING  BLOCKED  IGNORED    CAUGHT STAT TTY    TIME COMMAND
    0   1396 00000000 00000000 00001000 188004003 S    pts/2  0:00 ssh -o ControlMaster=no -o ControlPath=/tmp/socket.ssh 127.0.0.1 sleep 3600
[STEP 204] #

I decoded the sig masks:

PENDING (00000000):
BLOCKED (00000000):
IGNORED (00001000):
  13 PIPE
CAUGHT (188004003):
   1 HUP
   2 INT
  15 TERM
  28 WINCH
  32
  33

Here we can see the ssh process does not catch the SIGTTIN signal. That's what confuses me because a background job (process group) should receive SIGTTIN and be stopped when it attempts to read from the tty.


Solution

  • I think I've figured out what's happening. Let me explain it.

    From tty #1 (pts/2):

    [STEP 300] # tty
    /dev/pts/2
    [STEP 301] # ssh -o ControlMaster=yes -o ControlPath=/tmp/socket.ssh -N -f 127.0.0.1 < /dev/null >& /dev/null
    [STEP 302] # ps -C ssh j
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
        1  4052  4052  4052 ?           -1 Ss       0   0:00 ssh -o ControlMaster=yes -o ControlPath=/tmp/socket.ssh -N -f 127.0.0.1
    [STEP 303] # ls -l /proc/4052/fd/
    total 0
    lr-x------ 1 root root 64 2017-06-12 22:59 0 -> /dev/null
    l-wx------ 1 root root 64 2017-06-12 22:59 1 -> /dev/null
    l-wx------ 1 root root 64 2017-06-12 22:59 2 -> /dev/null
    lrwx------ 1 root root 64 2017-06-12 22:59 3 -> socket:[370151]
    lrwx------ 1 root root 64 2017-06-12 22:59 4 -> socket:[370201]
    [STEP 304] # ssh -o ControlMaster=no -o ControlPath=/tmp/socket.ssh 127.0.0.1 sleep 3600 &
    [1] 4062
    [STEP 305] #    <-- Cannot input anything
    

    Go go tty #2 (pts/3):

    [STEP 401] # tty
    /dev/pts/3
    [STEP 402] # ps t pts/2 j
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
      579  3552  3552  3552 pts/2     3552 Ss+      0   0:00 bash
     3552  4062  4062  3552 pts/2     3552 S        0   0:00 ssh -o ControlMaster=no -o ControlPath=/tmp/socket.ssh 127.0.0.1 sleep 3600
    [STEP 403] # ls -l /proc/4062/fd/    # The `ssh ... sleep' process
    total 0
    lrwx------ 1 root root 64 2017-06-12 23:00 0 -> /dev/pts/2
    lrwx------ 1 root root 64 2017-06-12 23:00 1 -> /dev/pts/2
    lrwx------ 1 root root 64 2017-06-12 23:00 2 -> /dev/pts/2
    lrwx------ 1 root root 64 2017-06-12 23:00 3 -> socket:[370349]
    [STEP 404] # ls -l /proc/4052/fd/    # The `ssh -o ControlMaster=yes' process
    total 0
    lr-x------ 1 root root 64 2017-06-12 22:59 0 -> /dev/null
    l-wx------ 1 root root 64 2017-06-12 22:59 1 -> /dev/null
    l-wx------ 1 root root 64 2017-06-12 22:59 2 -> /dev/null
    lrwx------ 1 root root 64 2017-06-12 22:59 3 -> socket:[370151]
    lrwx------ 1 root root 64 2017-06-12 22:59 4 -> socket:[370201]
    lrwx------ 1 root root 64 2017-06-12 23:02 5 -> socket:[370350]
    lrwx------ 1 root root 64 2017-06-12 23:02 6 -> /dev/pts/2
    lrwx------ 1 root root 64 2017-06-12 23:02 7 -> /dev/pts/2
    lrwx------ 1 root root 64 2017-06-12 23:02 8 -> /dev/pts/2
    [STEP 405] #
    

    STEP 403's output shows that the ssh ... sleep process' stdin/stdout/stderr are opened on pts/2. This is normal.

    But STEP 404's output (compared to STEP 303) shows that the ssh -o ControlMaster=yes process is also opening pts/2. I believe this is how the multiplexed SSH works — the new ssh ... sleep process passes its open file descriptors to the ssh -o ControlMaster=yes process through the UNIX domain socket (-o ControlPath=/tmp/socket.ssh). So it's actually the ssh -o ControlMaster=yes process that's consuming all input from pts/2. And since the ssh -o ControlMaster=yes process is not in the same process session as the bash process (and ssh ... sleep) so the job control mechanism does not apply to it even it's running in background (as a daemon) and reading from the pts/2.

    Put it another way: SIGTTIN is only sent to a process which runs as a background job and tries to read from its controlling terminal. Here the ssh -o ControlMaster=yes process is running in background but it's not a job of the bash process session and it does not have a controlling terminal at all.


    A bit more about passing FDs between processes through the UNIX domain sockets (from Wikipedia):

    In addition to sending data, processes may send file descriptors across a Unix domain socket connection using the sendmsg() and recvmsg() system calls. This allows the sending processes to grant the receiving process access to a file descriptor for which the receiving process otherwise does not have access.