Why is detaching from a uprobe using bcc so much slower than attaching?

I have a program that is attaching to 5000 functions on both entry and exit. Attaching takes some time, about 30 seconds or so. When my program exits though, the cleanup here takes upwards of 10 minutes to finish.

The C code is

typedef struct FunctionEvent_t {
  u32 pid;
  u64 timestamp;
  
  u64 func_addr;
  u8 entry;
} FunctionEvent;

BPF_PERF_OUTPUT(traceevents);

static void fill_and_submit(struct pt_regs *ctx, FunctionEvent *event) {
  event->pid = bpf_get_current_pid_tgid();
  event->timestamp = bpf_ktime_get_ns();
  event->func_addr = PT_REGS_IP(ctx);
  traceevents.perf_submit(ctx, event, sizeof(FunctionEvent));
}

int do_entry_point(struct pt_regs *ctx) {
  FunctionEvent event = {.entry = 1};
  fill_and_submit(ctx, &event);
  return 0;
}

int do_exit_point(struct pt_regs *ctx) {
  FunctionEvent event = {.entry = 0};
  fill_and_submit(ctx, &event);
  return 0;
}

The bcc code is as roughly as follows:

from bcc import BPF
bpf_instance = BPF(text=bpf_program)
path = "path/to/exe"

for func_name, func_addr in BPF.get_user_functions_and_addresses(path, ".*"):
    func_name = func_name.decode("utf-8")

    if func_addr in addresses:
        continue

    addresses.add(func_addr)
    try:
        bpf_instance.attach_uprobe(name=self.args.path, sym=func_name, fn_name="do_entry_point")
        bpf_instance.attach_uretprobe(name=self.args.path, sym=func_name, fn_name="do_exit_point")
    except Exception as e:
        print(f"Failed to attach to function {func_name}")

Is there any way to speed up the detaching process? Can it the attach and detach potentially be parralellized somehow? I wouldn't have though the cleanup would be this slow.

Also to note, I have no idea if attaching to 5000 functions is sane, I'm just playing around and trying to find my systems limits.

Solution

Note, this is speculation, have not been able to test this.

There are two parts to the slowness. The first part has to do with attaching/detaching one-by-one vs in bulk. I found this issue which describes the exact same issue.

We do have multi-kprobe support nowadays via BPF links. I don't think BCC supports this, but you should be able to make use of it via libbpf.

The second part is why is detaching so much slower. This seems to be because of optimization. When you attach a kprobe it initially is unoptimized. A background optimizer will optimize attached kprobes which takes time to do and undo. So when you go and detach the kprobe it now has to undo the optimized kprobe which takes longer.

It seems an attempt has been made in the past to make this more async. However, I don't think this has been merged. But the commit message seems to confirm my optimization theory:

... I found there were another issue.
Probes: 256 kprobe_events
Enable events
real   0m 0.00s
user   0m 0.00s
sys    0m 0.00s
Disable events
real   0m 21.40s
user   0m 0.00s
sys    0m 0.02s
Remove events
real   0m 2.24s
user   0m 0.00s
sys    0m 0.01s
OK, removing events took more than 2 seconds for 256 probe events. But disabling events took 21 seconds, 10 times longer than removing. Actually, since perf-events (base of BPF tracer) does disable and remove at once, it will take more than that.

I also measured it without kprobe jump optimization (echo 0 > /proc/sys/debug/kprobe-optimization) and it changed the results as below.
Probes: 256 kprobe_events
Enable events
real   0m 0.00s
user   0m 0.00s
sys    0m 0.00s
Disable events
real   0m 2.07s
user   0m 0.00s
sys    0m 0.04s
Remove events
real   0m 2.13s
user   0m 0.00s
sys    0m 0.01s

So by disabling the jump optimization you might be able to make a tradeoff between detach time and runtime overhead. See kernel kprobe docs for unoptimized vs optimized overhead.