Why does loading seccomp filter affect permitted and effective capability set?

I'm recently writing programs with libcap and libseccomp, and I notice a problem when using them together.

In the following minimal reproducible example, I first set the current process' capability to P(inheritable) = CAP_NET_RAW only, with other capability sets cleared. Then, I initialize a seccomp filter with SCMP_ACT_ALLOW action (by default allowing all system calls), load it, and clean it up.

Finally, this program prints its current capabilities and executes capsh --print to show its capabilities after executing execve().

#include <linux/capability.h>
#include <sys/capability.h>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <seccomp.h>

#define CAPSH "/usr/sbin/capsh"

int main(void) {
    cap_value_t net_raw = CAP_NET_RAW;

    cap_t caps = cap_init();
    cap_set_flag(caps, CAP_INHERITABLE, 1, &net_raw, CAP_SET);
    if (cap_set_proc(caps)) {
        perror("cap_set_proc");
    }
    cap_free(caps);

    scmp_filter_ctx ctx;
    if ((ctx = seccomp_init(SCMP_ACT_ALLOW)) == NULL) {
        perror("seccomp_init");
    }

    int rc = 0;
    rc = seccomp_load(ctx); // comment this line later
    if (rc < 0)
        perror("seccomp_load");
    seccomp_release(ctx);

    ssize_t y = 0;
    printf("Process capabilities: %s\n", cap_to_text(cap_get_proc(), &y));
    
    char *argv[] = {
        CAPSH,
        "--print",
        NULL
    };
    execve(CAPSH, argv, NULL);
    return -1;

}

Compile with -lcap and -lseccomp, execute it under root user (UID=EUID=0), and get this:

Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+i
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

It indicates that the current process and executed capsh all have inheritable set not empty only. However, if I comment line rc = seccomp_load(ctx);, things are different:

Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+eip cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

Before execve(), the result is the same as above. But after that, all other capabilities are back in permitted and effective sets.

I looked up capabilities(7), and found the following in manual:

Capabilities and execution of programs by root
       In order to mirror traditional UNIX semantics, the kernel performs
       special treatment of file capabilities when a process with UID 0
       (root) executes a program and when a set-user-ID-root program is exe‐
       cuted.

       After having performed any changes to the process effective ID that
       were triggered by the set-user-ID mode bit of the binary—e.g.,
       switching the effective user ID to 0 (root) because a set-user-ID-
       root program was executed—the kernel calculates the file capability
       sets as follows:

       1. If the real or effective user ID of the process is 0 (root), then
          the file inheritable and permitted sets are ignored; instead they
          are notionally considered to be all ones (i.e., all capabilities
          enabled).  (There is one exception to this behavior, described
          below in Set-user-ID-root programs that have file capabilities.)

       2. If the effective user ID of the process is 0 (root) or the file
          effective bit is in fact enabled, then the file effective bit is
          notionally defined to be one (enabled).

       These notional values for the file's capability sets are then used as
       described above to calculate the transformation of the process's
       capabilities during execve(2).

       Thus, when a process with nonzero UIDs execve(2)s a set-user-ID-root
       program that does not have capabilities attached, or when a process
       whose real and effective UIDs are zero execve(2)s a program, the cal‐
       culation of the process's new permitted capabilities simplifies to:

           P'(permitted)   = P(inheritable) | P(bounding)

           P'(effective)   = P'(permitted)

       Consequently, the process gains all capabilities in its permitted and
       effective capability sets, except those masked out by the capability
       bounding set.  (In the calculation of P'(permitted), the P'(ambient)
       term can be simplified away because it is by definition a proper sub‐
       set of P(inheritable).)

       The special treatments of user ID 0 (root) described in this subsec‐
       tion can be disabled using the securebits mechanism described below.

And this is what I feel confused: the inheritable set is not empty, and by the simplified rule, permitted and effective sets shall all not empty. However, "loading seccomp filter" seems to violate this rule.

Solution

Seccomp itself doesn't do this, but libseccomp does.

Using strace, you can see seccomp_load actually performs three syscalls:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
seccomp(SECCOMP_SET_MODE_STRICT, 1, NULL) = -1 EINVAL (Invalid argument)
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=7, filter=0x5572a6213930}) = 0

Note how the first one looks suspicious.

From the kernel documentation on no_new_privs:

With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.

And from capabilities(7) that you quoted:

If the real or effective user ID of the process is 0 (root), then the file inheritable and permitted sets are ignored; instead they are notionally considered to be all ones (i.e., all capabilities enabled).

Your code creates an empty capability set (cap_t caps = cap_init()) and only adds CAP_NET_RAW as inheritable, with no capabilities permitted (as in = cap_net_raw+i). Then, because NO_NEW_PRIVS is set for this thread, when calling execve, the permitted set is not restored to a full set as it normally would for a root process (UID = 0 or EUID = 0). This explains what you see from capsh --print before and after employing seccomp_load().

The NO_NEW_PRIVS flag cannot be reset once it's set (prctl(2)), and there's a reason seccomp_load() sets it by default.

To prevent seccomp_load() from setting NO_NEW_PRIVS, add the following code before loading the context:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0);

See seccomp_attr_set(3) for more details.

However, you probably should do it the right way by adding desired capabilities to the permitted set as well.

cap_set_flag(caps, CAP_PERMITTED, 1, &net_raw, CAP_SET);