clinuxcontainersmountlinux-namespaces

Mount filesystem after clone with CLONE_NEWNS flag


I'm trying to implement the following scenario:

  1. clone() main process with CLONE_NEWNS flag (it means new mount namespace)
  2. mount() new filesystem in child process
  3. child process finished and all created in this process filesystems are unmounted

But it doesn't work as I expected and I still see mounted filesystems in main process. What am I doing wrong?

Sources are here https://github.com/dmitrievanthony/sprat/blob/master/src/container.c#L47

System is default AWS Ubuntu,

ubuntu@ip-172-31-31-112:~/sprat$ uname -a
Linux ip-172-31-31-112 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Solution

  • Short answer: It looks like the type of mount propagation isn't properly set.


    Explanation

    The Linux kernel defaults all mounts to MS_PRIVATE, but systemd overrides this during early boot to MS_SHARED, for the convenience of nspawn. This can be observed by looking at the optional fields of /proc/$PID/mountinfo. For instance, something like this might be expected:

    $ cat /proc/self/mountinfo
      . . .
    25 0 8:6 / / rw,relatime shared:1 - ext4 /dev/sda6 rw,errors=remount-ro,data=ordered
                             ^^^^^^
      . . .
    

    Notice the underlined(by me) shared:1 field above, indicating that the current propagation type of / mount point is MS_SHARED, and the peer group ID is 1 (we won't care about peer group ID at all in our case).

    When using the CLONE_NEWNS flag on clone(2) a new mount namespace is created, which is initialized as a copy of the caller's mount namespace. The new, replicated mount points of the new namespace join the same peer group as their respective original mount points in the caller's mount namespace.

    The propagation type of a new mount point whose parent's propagation type is MS_SHARED, is MS_SHARED too. Thus, when your "contained" process mount()s the filesystem on the loop device, the mount is by default MS_SHARED. Later, all the mounts under it, are propagated to "main" process's namespace too, and that's the reason "main" process can see them.

    For your request to be satisfied (for the "main" process not to see "contained" process's mount points), the mount propagation type you seek is either MS_SLAVE or MS_PRIVATE, depending on whether you want your "contained" process's root mount point to receive propagation events from other peers or not, respectively. Obviously, MS_PRIVATE offers greater isolation than MS_SLAVE.

    Thus, in your case, it should be sufficient to change the propagation type of "contained" process's root mount point to MS_PRIVATE or MS_SLAVE before you mount the rest of the filesystems, so the mounts won't be propagated to "main" process's namespace.


    The code

    At first, one would try to set the propagation type properly when the "contained" process creates its root mount point.

    However, I noticed the following in man 8 mount (quoting):

    Note that the Linux kernel does not allow to change multiple propagation flags with a single mount(2) system call, and the flags cannot be mixed with other mount options.

    Since util-linux 2.23 the mount command allows to use several propagation flags together and also together with other mount operations. This feature is EXPERIMENTAL. The propagation flags are applied by additional mount(2) system calls when the preceding mount operations were successful.

    Looking at your code, the "contained" process, after it mount()s the filesystem on the loop device, it issues chroot() to it. At this point, you could set its propagation type by injecting this mount(2) call:

    if (chroot(".") < 0) {
        // handle error
    }
    
    if (mount("/", "/", c->fstype, MS_PRIVATE, "") < 0) {
        // handle error
    }
    
    if (mkdir(...)) {
        // handle error
    }
    

    Now that the propagation type is set to MS_PRIVATE, all the subsequent mounts that "contained" process does under / won't be propagated, thus won't be visible in "main" process's namespace, as you can observe in /proc/mounts or /proc/$PID/mountinfo.


    Resources