I'm recently writing programs with libcap
and libseccomp
, and I notice a problem when using them together.
In the following minimal reproducible example, I first set the current process' capability to P(inheritable) = CAP_NET_RAW
only, with other capability sets cleared. Then, I initialize a seccomp filter with SCMP_ACT_ALLOW
action (by default allowing all system calls), load it, and clean it up.
Finally, this program prints its current capabilities and executes capsh --print
to show its capabilities after executing execve()
.
#include <linux/capability.h>
#include <sys/capability.h>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <seccomp.h>
#define CAPSH "/usr/sbin/capsh"
int main(void) {
cap_value_t net_raw = CAP_NET_RAW;
cap_t caps = cap_init();
cap_set_flag(caps, CAP_INHERITABLE, 1, &net_raw, CAP_SET);
if (cap_set_proc(caps)) {
perror("cap_set_proc");
}
cap_free(caps);
scmp_filter_ctx ctx;
if ((ctx = seccomp_init(SCMP_ACT_ALLOW)) == NULL) {
perror("seccomp_init");
}
int rc = 0;
rc = seccomp_load(ctx); // comment this line later
if (rc < 0)
perror("seccomp_load");
seccomp_release(ctx);
ssize_t y = 0;
printf("Process capabilities: %s\n", cap_to_text(cap_get_proc(), &y));
char *argv[] = {
CAPSH,
"--print",
NULL
};
execve(CAPSH, argv, NULL);
return -1;
}
Compile with -lcap
and -lseccomp
, execute it under root user (UID=EUID=0), and get this:
Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+i
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)
It indicates that the current process and executed capsh
all have inheritable set not empty only. However, if I comment line rc = seccomp_load(ctx);
, things are different:
Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+eip cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)
Before execve()
, the result is the same as above. But after that, all other capabilities are back in permitted and effective sets.
I looked up capabilities(7), and found the following in manual:
Capabilities and execution of programs by root
In order to mirror traditional UNIX semantics, the kernel performs
special treatment of file capabilities when a process with UID 0
(root) executes a program and when a set-user-ID-root program is exe‐
cuted.
After having performed any changes to the process effective ID that
were triggered by the set-user-ID mode bit of the binary—e.g.,
switching the effective user ID to 0 (root) because a set-user-ID-
root program was executed—the kernel calculates the file capability
sets as follows:
1. If the real or effective user ID of the process is 0 (root), then
the file inheritable and permitted sets are ignored; instead they
are notionally considered to be all ones (i.e., all capabilities
enabled). (There is one exception to this behavior, described
below in Set-user-ID-root programs that have file capabilities.)
2. If the effective user ID of the process is 0 (root) or the file
effective bit is in fact enabled, then the file effective bit is
notionally defined to be one (enabled).
These notional values for the file's capability sets are then used as
described above to calculate the transformation of the process's
capabilities during execve(2).
Thus, when a process with nonzero UIDs execve(2)s a set-user-ID-root
program that does not have capabilities attached, or when a process
whose real and effective UIDs are zero execve(2)s a program, the cal‐
culation of the process's new permitted capabilities simplifies to:
P'(permitted) = P(inheritable) | P(bounding)
P'(effective) = P'(permitted)
Consequently, the process gains all capabilities in its permitted and
effective capability sets, except those masked out by the capability
bounding set. (In the calculation of P'(permitted), the P'(ambient)
term can be simplified away because it is by definition a proper sub‐
set of P(inheritable).)
The special treatments of user ID 0 (root) described in this subsec‐
tion can be disabled using the securebits mechanism described below.
And this is what I feel confused: the inheritable set is not empty, and by the simplified rule, permitted and effective sets shall all not empty. However, "loading seccomp filter" seems to violate this rule.
Seccomp itself doesn't do this, but libseccomp does.
Using strace
, you can see seccomp_load
actually performs three syscalls:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
seccomp(SECCOMP_SET_MODE_STRICT, 1, NULL) = -1 EINVAL (Invalid argument)
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=7, filter=0x5572a6213930}) = 0
Note how the first one looks suspicious.
From the kernel documentation on no_new_privs
:
With
no_new_privs
set,execve
promises not to grant the privilege to do anything that could not have been done without theexecve
call.
And from capabilities(7)
that you quoted:
If the real or effective user ID of the process is 0 (root), then the file inheritable and permitted sets are ignored; instead they are notionally considered to be all ones (i.e., all capabilities enabled).
Your code creates an empty capability set (cap_t caps = cap_init()
) and only adds CAP_NET_RAW as inheritable, with no capabilities permitted (as in = cap_net_raw+i
). Then, because NO_NEW_PRIVS is set for this thread, when calling execve
, the permitted set is not restored to a full set as it normally would for a root process (UID = 0 or EUID = 0). This explains what you see from capsh --print
before and after employing seccomp_load()
.
The NO_NEW_PRIVS flag cannot be reset once it's set (prctl(2)), and there's a reason seccomp_load()
sets it by default.
To prevent seccomp_load()
from setting NO_NEW_PRIVS, add the following code before loading the context:
seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0);
See seccomp_attr_set(3) for more details.
However, you probably should do it the right way by adding desired capabilities to the permitted set as well.
cap_set_flag(caps, CAP_PERMITTED, 1, &net_raw, CAP_SET);