Docker(containers) cgroup/namespace setup vs running Dockerfile commands as root?

From my understanding, docker sets up the required cgroup's and namespace's so containers(i.e container processes) run in isolation (isolated environment on the host system) and have limited permissions and access to the host system. So, even if the process is running as root in the container, it will not have root access on the host system.

But from this article: processes-in-containers-should-not-run-as-root, i see that it is still possible for a container process running as root to access the host files which are only accessible to root on the host system.

On host system:

root@srv:/root# ls -l
total 4
-rw------- 1 root root 17 Sep 26 20:29 secrets.txt

Dockerfile -

FROM debian:stretch
CMD ["cat", "/tmp/secrets.txt"]

On running corresponding image of above Dockerfile,

marc@srv:~$ docker run -v /root/secrets.txt:/tmp/secrets.txt <img>
top secret stuff

If, top secret stuff is readable, how is it possible. Then what is the point of container isolation. What am i missing, seems there is something more I am missing.

(has it to do with how i use docker run, by default are all permissions/capabilities given to the container based on the user running the docker run command.

Solution

A container can only access the host filesystem if the operator explicitly gives it access. For example, try without any docker run -v options:

docker run        \
  --rm            \  # clean up the container when done
  -u root         \  # explicitly request root user
  busybox         \  # image to run
  cat /etc/shadow    # dumps the _container's_ password file

More generally, the rule (on native Linux without user namespace remapping) is that, if files are bind-mounted from the host into a container, they are accessible if the container's numeric user or group IDs match the file's ownership and permissions. If a file is owned by uid 1000 on the host with mode 0600, it can be read by uids 0 or 1000 in the container, regardless of the corresponding container and host users' names.

The corollary to this is that anyone who can run any docker run command at all can pretty trivially root the entire host.

docker run             \
  --rm                 \
  -u root              \
  -v /:/host           \  # bind-mount the host filesystem into the container
  busybox              \
  cat /host/etc/shadow    # dumps the host's encrypted password file

The root user in a container is further limited by Linux capabilities: without giving special additional Docker options, even running as root, a container can't change filesystem mounts, modify the network configuration, load kernel modules, reboot the host, or do several other extra-privileged things. (And it's usually better to do these things outside a container than to give extra permission to Docker; don't casually run containers --privileged.)

It's still generally better practice to run containers as non-root users. The user ID doesn't need to match any user ID in particular, it just needs to not be 0 (matching a specific host uid isn't portable across hosts and isn't recommended). The files in the container generally should be owned by root, so they can't be accidentally overwritten.

FROM debian

# Create the non-root user
RUN adduser --system --no-create-home nonroot

# Do the normal installation, as root
COPY ...  # no --chown option
RUN ...   # does not run chown either

# Specify the non-root user only for the final container
EXPOSE 12345
USER nonroot
CMD the main container command

If the container does need to read or (especially) write host files, bind-mount the host directory into some data-specific directory in the container (do not overwrite the application code with this mount) and use the docker run -u option to specify the host uid that the container needs to run as. The user does not specifically need to exist in the container's /etc/passwd file.

docker run            \
  -v "$PWD:/app/data" \  # bind-mount the current directory as data
  -u $(id -u)         \  # specify the user ID to use
  ...