笔者在MacBook M2上搭建Linux虚拟机上开发eBPF程序时,遇到一些LSM eBPF类型程序无法运行的问题。 在笔者尝试定位这些差异时,看到这篇文章,可以让大家更直观地了解LSM eBPF在ARM64、AMD64 内核上的差异。
原文地址:Exploring BPF LSM support on aarch64 with ftrace
本博客文章是我们在Linux中对于aarch64
上BPF LSM
支持的内部研究的摘要。如果你对内核代码库不熟悉,要开始查看内核源码是非常困难的,因此我们决定发布这篇文章,展示我们的方法,因为这对于想要探索内核内部的任何人都可能有所帮助。
在x86_64
上,我们已经在使用BPF LSM
,而在aarch64
上,我们依赖于Kprobes
,因此我们想知道内核中缺少了哪些功能,才能让这些功能在aarch64
上可用。
我们曾多次深入研究内核源代码,但通常我们搜索的是已经存在的东西,以了解其工作原理。但在这种情况下,我们在寻找的是不存在的东西,我们追寻的是那些因为未实现而返回错误的内容。
回想起Steven Rostedt关于如何开始学习Linux内核的讲话,我们从ftrace
(以及构建在跟踪基础设施上的工具)开始,以了解当我们将一个不受支持的BPF
程序加载到内核时会发生什么。
这是当我们尝试将一个BPF LSM
程序加载到aarch64
5.15 Linux内核时,使用我们的软件pulsar时的输出:
root@pine64-1:/home/exein# ./pulsar-enterprise-exec pulsard
[2023-02-16T14:52:45Z INFO pulsar::pulsard::daemon] Starting module process-monitor
[2023-02-16T14:52:45Z INFO pulsar::pulsard::daemon] Starting module file-system-monitor
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module network-monitor
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module logger
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module rules-engine
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module desktop-notifier
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in file-system-monitor: failed program attach lsm path_mknod
Caused by:
0: `bpf_raw_tracepoint_open` failed
1: No error information (os error 524)
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module anomaly-detection
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module malware-detection
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in malware-detection: /var/lib/pulsar/malware_detection/models/parameters.json not found
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module platform-connector
[2023-02-16T14:52:46Z INFO platform_connector::client] Connected to https://platform-dev-instance.exein.io:8001/
[2023-02-16T14:52:46Z INFO pulsar::pulsard::daemon] Starting module threat-response
[2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in network-monitor: failed program attach lsm socket_bind
Caused by:
0: `bpf_raw_tracepoint_open` failed
1: No error information (os error 524)
我们在尝试加载与path_mknod
LSM挂钩相关的BPF程序时,pulsar出现了错误524
或ENOTSUPP
。让我们尝试深入研究这个问题。
注意: 在进行这项研究时,我们当时无法找到预先编译为启用
BPF
和BTF
的aarch64
,因此我们不得不编译一个自定义内核。我们还启用了跟踪选项和function_graph
插件,以使用下面的工具。
所有的实验都是在一台装有定制Armbian镜像的Pine A64上进行的。
这些镜像具有带有标准Ubuntu 22.04 LTS Jammy
用户空间的自定义内核。
为了调查这个问题,我们使用了以下工具:
要使用这些工具,您需要在Linux内核中启用一些选项,请查阅官方文档获取完整的要求。
注意: 也可以使用其他工具来完成相同的工作,例如perf-tools中的
funcgraph
和kprobe
。
现在我们开始使用这些工具来查看在内核5.15中尝试加载我们的BPF程序时会发生什么。
从这一点开始到本文末尾,我们将使用probe
二进制文件代替pulsar
,因为它更简单。为了简要概括其工作原理,以下是命令行帮助:
exein@pine64-1:~$ ./probe
Test runner for eBPF programs
Usage: probe [OPTIONS] <COMMAND>
Commands:
file-system-monitor Watch file creations
process-monitor Watch process events (fork/exec/exit)
network-monitor Watch network events
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose
-h, --help Print help
-V, --version Print version
在这些示例中,我们将尝试加载file-system-monitor
探针。
通过运行以下命令,我们可以看到__sys_bpf
函数的函数图调用,这是BPF系统调用的入口点:
trace-cmd record -p function_graph -g __sys_bpf ./probe file-system-monitor
trace-cmd report
输出是一个非常庞大的函数图,太大了,无法在这里粘贴。由于我们遇到了错误,我们对程序停止前的最后几个函数感兴趣。以下是trace-cmd report
输出的最后几行:
...
tokio-runtime-w-1666 [003] 1318.058019: funcgraph_entry: | bpf_trampoline_link_prog() {
tokio-runtime-w-1666 [003] 1318.058020: funcgraph_entry: 2.292 us | bpf_attach_type_to_tramp();
tokio-runtime-w-1666 [003] 1318.058024: funcgraph_entry: 1.250 us | mutex_lock();
tokio-runtime-w-1666 [003] 1318.058028: funcgraph_entry: | bpf_trampoline_update() {
tokio-runtime-w-1666 [003] 1318.058030: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058031: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058036: funcgraph_exit: 6.792 us | }
tokio-runtime-w-1666 [003] 1318.058039: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058042: funcgraph_entry: 2.750 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058046: funcgraph_exit: 6.417 us | }
tokio-runtime-w-1666 [003] 1318.058048: funcgraph_entry: 2.708 us | bpf_jit_charge_modmem();
tokio-runtime-w-1666 [003] 1318.058053: funcgraph_entry: | bpf_jit_alloc_exec_page() {
tokio-runtime-w-1666 [003] 1318.058055: funcgraph_entry: | bpf_jit_alloc_exec() {
tokio-runtime-w-1666 [003] 1318.058057: funcgraph_entry: | vmalloc() {
tokio-runtime-w-1666 [003] 1318.058059: funcgraph_entry: | __vmalloc_node() {
tokio-runtime-w-1666 [003] 1318.058061: funcgraph_entry: | __vmalloc_node_range() {
tokio-runtime-w-1666 [003] 1318.058064: funcgraph_entry: | __get_vm_area_node.constprop.64() {
tokio-runtime-w-1666 [003] 1318.058067: funcgraph_entry: | kmem_cache_alloc_node_trace() {
tokio-runtime-w-1666 [003] 1318.058069: funcgraph_entry: 1.459 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058073: funcgraph_exit: 6.292 us | }
tokio-runtime-w-1666 [003] 1318.058075: funcgraph_entry: | alloc_vmap_area() {
tokio-runtime-w-1666 [003] 1318.058077: funcgraph_entry: | kmem_cache_alloc_node() {
tokio-runtime-w-1666 [003] 1318.058079: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058085: funcgraph_exit: 7.625 us | }
tokio-runtime-w-1666 [003] 1318.058088: funcgraph_entry: | kmem_cache_alloc_node() {
tokio-runtime-w-1666 [003] 1318.058089: funcgraph_entry: 1.208 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058092: funcgraph_exit: 4.584 us | }
tokio-runtime-w-1666 [003] 1318.058104: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-1666 [003] 1318.058107: funcgraph_entry: 2.084 us | __slab_free();
tokio-runtime-w-1666 [003] 1318.058110: funcgraph_exit: 5.667 us | }
tokio-runtime-w-1666 [003] 1318.058112: funcgraph_entry: 6.375 us | insert_vmap_area.constprop.74();
tokio-runtime-w-1666 [003] 1318.058119: funcgraph_exit: + 44.667 us | }
tokio-runtime-w-1666 [003] 1318.058122: funcgraph_exit: + 58.250 us | }
tokio-runtime-w-1666 [003] 1318.058124: funcgraph_entry: | __kmalloc_node() {
tokio-runtime-w-1666 [003] 1318.058125: funcgraph_entry: 1.625 us | kmalloc_slab();
tokio-runtime-w-1666 [003] 1318.058128: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058131: funcgraph_exit: 7.208 us | }
tokio-runtime-w-1666 [003] 1318.058133: funcgraph_entry: | alloc_pages() {
tokio-runtime-w-1666 [003] 1318.058135: funcgraph_entry: 1.583 us | get_task_policy.part.48();
tokio-runtime-w-1666 [003] 1318.058138: funcgraph_entry: 1.500 us | policy_node();
tokio-runtime-w-1666 [003] 1318.058141: funcgraph_entry: 1.209 us | policy_nodemask();
tokio-runtime-w-1666 [003] 1318.058143: funcgraph_entry: | __alloc_pages() {
tokio-runtime-w-1666 [003] 1318.058145: funcgraph_entry: 1.458 us | should_fail_alloc_page();
tokio-runtime-w-1666 [003] 1318.058147: funcgraph_entry: | get_page_from_freelist() {
tokio-runtime-w-1666 [003] 1318.058150: funcgraph_entry: 1.583 us | prep_new_page();
tokio-runtime-w-1666 [003] 1318.058153: funcgraph_exit: 5.459 us | }
tokio-runtime-w-1666 [003] 1318.058154: funcgraph_exit: + 10.542 us | }
tokio-runtime-w-1666 [003] 1318.058155: funcgraph_exit: + 22.083 us | }
tokio-runtime-w-1666 [003] 1318.058157: funcgraph_entry: | __cond_resched() {
tokio-runtime-w-1666 [003] 1318.058158: funcgraph_entry: 1.833 us | rcu_all_qs();
tokio-runtime-w-1666 [003] 1318.058161: funcgraph_exit: 4.167 us | }
tokio-runtime-w-1666 [003] 1318.058166: funcgraph_entry: 5.542 us | vmap_pages_range_noflush();
tokio-runtime-w-1666 [003] 1318.058173: funcgraph_exit: ! 112.375 us | }
tokio-runtime-w-1666 [003] 1318.058175: funcgraph_exit: ! 116.000 us | }
tokio-runtime-w-1666 [003] 1318.058176: funcgraph_exit: ! 119.292 us | }
tokio-runtime-w-1666 [003] 1318.058177: funcgraph_exit: ! 122.542 us | }
tokio-runtime-w-1666 [003] 1318.058179: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-1666 [003] 1318.058180: funcgraph_entry: 1.375 us | find_vmap_area();
tokio-runtime-w-1666 [003] 1318.058183: funcgraph_exit: 4.333 us | }
tokio-runtime-w-1666 [003] 1318.058185: funcgraph_entry: | set_memory_x() {
tokio-runtime-w-1666 [003] 1318.058186: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-1666 [003] 1318.058188: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-1666 [003] 1318.058189: funcgraph_entry: 1.333 us | find_vmap_area();
tokio-runtime-w-1666 [003] 1318.058192: funcgraph_exit: 3.875 us | }
tokio-runtime-w-1666 [003] 1318.058193: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-1666 [003] 1318.058194: funcgraph_entry: | _vm_unmap_aliases.part.58() {
tokio-runtime-w-1666 [003] 1318.058196: funcgraph_entry: 1.542 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058199: funcgraph_entry: 1.208 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058202: funcgraph_entry: 1.166 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058205: funcgraph_entry: 1.208 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058207: funcgraph_entry: 1.208 us | mutex_lock();
tokio-runtime-w-1666 [003] 1318.058210: funcgraph_entry: | purge_fragmented_blocks_allcpus() {
tokio-runtime-w-1666 [003] 1318.058212: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058214: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058217: funcgraph_entry: 1.500 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058220: funcgraph_entry: 1.167 us | rcu_read_unlock_strict();
tokio-runtime-w-1666 [003] 1318.058222: funcgraph_exit: + 11.917 us | }
tokio-runtime-w-1666 [003] 1318.058224: funcgraph_entry: | __purge_vmap_area_lazy() {
tokio-runtime-w-1666 [003] 1318.058232: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-1666 [003] 1318.058234: funcgraph_entry: 1.250 us | __slab_free();
tokio-runtime-w-1666 [003] 1318.058237: funcgraph_exit: 4.791 us | }
tokio-runtime-w-1666 [003] 1318.058241: funcgraph_entry: 1.209 us | __cond_resched_lock();
tokio-runtime-w-1666 [003] 1318.058244: funcgraph_exit: + 19.625 us | }
tokio-runtime-w-1666 [003] 1318.058245: funcgraph_entry: 1.167 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058247: funcgraph_exit: + 53.042 us | }
tokio-runtime-w-1666 [003] 1318.058248: funcgraph_exit: + 55.625 us | }
tokio-runtime-w-1666 [003] 1318.058250: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-1666 [003] 1318.058251: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-1666 [003] 1318.058253: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-1666 [003] 1318.058255: funcgraph_entry: 1.250 us | pud_huge();
tokio-runtime-w-1666 [003] 1318.058258: funcgraph_entry: 1.166 us | pmd_huge();
tokio-runtime-w-1666 [003] 1318.058260: funcgraph_entry: 1.208 us | change_page_range();
tokio-runtime-w-1666 [003] 1318.058263: funcgraph_exit: 9.834 us | }
tokio-runtime-w-1666 [003] 1318.058264: funcgraph_exit: + 12.709 us | }
tokio-runtime-w-1666 [003] 1318.058266: funcgraph_exit: + 15.459 us | }
tokio-runtime-w-1666 [003] 1318.058268: funcgraph_exit: + 80.791 us | }
tokio-runtime-w-1666 [003] 1318.058270: funcgraph_exit: + 84.834 us | }
tokio-runtime-w-1666 [003] 1318.058272: funcgraph_exit: ! 218.500 us | }
tokio-runtime-w-1666 [003] 1318.058274: funcgraph_entry: | __alloc_percpu_gfp() {
tokio-runtime-w-1666 [003] 1318.058276: funcgraph_entry: | pcpu_alloc() {
tokio-runtime-w-1666 [003] 1318.058281: funcgraph_entry: 2.250 us | mutex_lock_killable();
tokio-runtime-w-1666 [003] 1318.058290: funcgraph_entry: | pcpu_find_block_fit() {
tokio-runtime-w-1666 [003] 1318.058293: funcgraph_entry: 2.833 us | pcpu_next_fit_region.constprop.38();
tokio-runtime-w-1666 [003] 1318.058299: funcgraph_exit: 9.084 us | }
tokio-runtime-w-1666 [003] 1318.058301: funcgraph_entry: | pcpu_alloc_area() {
tokio-runtime-w-1666 [003] 1318.058315: funcgraph_entry: 4.000 us | pcpu_block_update_hint_alloc();
tokio-runtime-w-1666 [003] 1318.058320: funcgraph_entry: 2.208 us | pcpu_chunk_relocate();
tokio-runtime-w-1666 [003] 1318.058324: funcgraph_exit: + 22.625 us | }
tokio-runtime-w-1666 [003] 1318.058327: funcgraph_entry: 1.208 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058332: funcgraph_entry: 1.584 us | pcpu_memcg_post_alloc_hook();
tokio-runtime-w-1666 [003] 1318.058335: funcgraph_exit: + 58.833 us | }
tokio-runtime-w-1666 [003] 1318.058336: funcgraph_exit: + 61.834 us | }
tokio-runtime-w-1666 [003] 1318.058338: funcgraph_entry: | kmem_cache_alloc_trace() {
tokio-runtime-w-1666 [003] 1318.058339: funcgraph_entry: 1.167 us | should_failslab();
tokio-runtime-w-1666 [003] 1318.058342: funcgraph_exit: 4.458 us | }
tokio-runtime-w-1666 [003] 1318.058359: funcgraph_entry: | bpf_image_ksym_add() {
tokio-runtime-w-1666 [003] 1318.058360: funcgraph_entry: | bpf_ksym_add() {
tokio-runtime-w-1666 [003] 1318.058363: funcgraph_entry: 1.583 us | __local_bh_enable_ip();
tokio-runtime-w-1666 [003] 1318.058366: funcgraph_exit: 5.750 us | }
tokio-runtime-w-1666 [003] 1318.058369: funcgraph_exit: 9.834 us | }
tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline();
tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree();
tokio-runtime-w-1666 [003] 1318.058377: funcgraph_exit: ! 348.625 us | }
tokio-runtime-w-1666 [003] 1318.058379: funcgraph_entry: 1.250 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058382: funcgraph_exit: ! 363.167 us | }
tokio-runtime-w-1666 [003] 1318.058384: funcgraph_entry: | bpf_link_cleanup() {
tokio-runtime-w-1666 [003] 1318.058386: funcgraph_entry: | bpf_link_free_id.part.30() {
tokio-runtime-w-1666 [003] 1318.058392: funcgraph_entry: | call_rcu() {
tokio-runtime-w-1666 [003] 1318.058396: funcgraph_entry: 1.834 us | rcu_segcblist_enqueue();
tokio-runtime-w-1666 [003] 1318.058401: funcgraph_exit: 9.333 us | }
tokio-runtime-w-1666 [003] 1318.058403: funcgraph_entry: 1.542 us | __local_bh_enable_ip();
tokio-runtime-w-1666 [003] 1318.058406: funcgraph_exit: + 19.542 us | }
tokio-runtime-w-1666 [003] 1318.058408: funcgraph_entry: | fput() {
tokio-runtime-w-1666 [003] 1318.058409: funcgraph_entry: | fput_many() {
tokio-runtime-w-1666 [003] 1318.058411: funcgraph_entry: | task_work_add() {
tokio-runtime-w-1666 [003] 1318.058414: funcgraph_entry: 1.625 us | kick_process();
tokio-runtime-w-1666 [003] 1318.058418: funcgraph_exit: 6.750 us | }
tokio-runtime-w-1666 [003] 1318.058419: funcgraph_exit: + 10.333 us | }
tokio-runtime-w-1666 [003] 1318.058420: funcgraph_exit: + 12.708 us | }
tokio-runtime-w-1666 [003] 1318.058422: funcgraph_entry: 2.250 us | put_unused_fd();
tokio-runtime-w-1666 [003] 1318.058426: funcgraph_exit: + 41.416 us | }
tokio-runtime-w-1666 [003] 1318.058428: funcgraph_entry: 1.292 us | mutex_unlock();
tokio-runtime-w-1666 [003] 1318.058430: funcgraph_entry: 1.250 us | kfree();
tokio-runtime-w-1666 [003] 1318.058433: funcgraph_exit: ! 567.458 us | }
tokio-runtime-w-1666 [003] 1318.058435: funcgraph_entry: 2.125 us | __bpf_prog_put.isra.47();
tokio-runtime-w-1666 [003] 1318.058438: funcgraph_exit: ! 602.291 us | }
tokio-runtime-w-1666 [003] 1318.058439: funcgraph_exit: ! 631.791 us | }
```shell
这是<code>kernel/bpf/trampoline.c</code>中与最后执行的函数<code>bpf_trampoline_update</code>对应的源代码:
```c
static int bpf_trampoline_update(struct bpf_trampoline *tr)
{
struct bpf_tramp_image *im;
struct bpf_tramp_progs *tprogs;
u32 flags = BPF_TRAMP_F_RESTORE_REGS;
bool ip_arg = false;
int err, total;
tprogs = bpf_trampoline_get_progs(tr, &total, &ip_arg);
if (IS_ERR(tprogs))
return PTR_ERR(tprogs);
if (total == 0) {
err = unregister_fentry(tr, tr->cur_image->image);
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = NULL;
tr->selector = 0;
goto out;
}
im = bpf_tramp_image_alloc(tr->key, tr->selector);
if (IS_ERR(im)) {
err = PTR_ERR(im);
goto out;
}
if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
if (ip_arg)
flags |= BPF_TRAMP_F_IP_ARG;
err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, flags, tprogs,
tr->func.addr);
if (err < 0)
goto out;
WARN_ON(tr->cur_image && tr->selector == 0);
WARN_ON(!tr->cur_image && tr->selector);
if (tr->cur_image)
/* progs already running at this address */
err = modify_fentry(tr, tr->cur_image->image, im->image);
else
/* first time registering */
err = register_fentry(tr, im->image);
if (err)
goto out;
if (tr->cur_image)
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = im;
tr->selector++;
out:
kfree(tprogs);
return err;
}
根据先前的输出,我们可以看到:
tokio-runtime-w-1666 [003] 1318.058371: funcgraph_entry: 1.250 us | arch_prepare_bpf_trampoline();
tokio-runtime-w-1666 [003] 1318.058373: funcgraph_entry: 2.292 us | kfree();
在arch_prepare_bpf_trampoline
和kfree
函数之间没有其他函数调用,所以很可能第一个函数在err
变量中返回了错误代码。让我们来验证一下!
通过以下方式在shell中启动bpftace
,我们可以捕获arch_prepare_bpf_trampoline
函数的返回值并将其打印到控制台上:
bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
并且在另一个终端中启动probe
后,我们从bpftace
得到了以下输出:
root@pine64-1:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
Attaching 1 probe...
retval link: -524
这是因为内核5.15
缺乏对aarch64
架构的arch_prepare_bpf_trampoline
实现,并使用了默认的占位符实现。
int __weak
arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_links *tlinks,
void *orig_call)
{
return -ENOTSUPP;
}
因此,这个功能在这个内核版本上是不受支持的。好消息是,多亏了这个补丁,它在6.x内核中得到了实现。
让我们移步到6.x内核。
如果我们尝试在内核 6.1 上运行 probe
,我们会得到以下输出:
root@pine64:/home/exein# ./probe file-system-monitor
thread 'main' panicked at 'initialization failed: ProgramAttachError { program: "lsm path_mknod", program_error: SyscallError { call: "bpf_raw_tracepoint_open", io_error: Os { code: 524, kind: Uncategorized, message: "No error information" } } }', src/bin/probe.rs:72:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
对于内核版本6.1,我们仍然遇到了和5.15内核一样的错误!!!让我们找出其中的原因。
这次在arch_prepare_bpf_trampoline
上运行bpftrace
,我们得到了以下输出:
root@pine64:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval tp link: %d\n", retval); }'
Attaching 1 probe...
retval tp link: 284
所以问题不在这里,这个函数不再返回错误了。让我们回到函数调用图。
这次我们启动trace-cmd
,跳过一些函数以获得更清晰的输出:
trace-cmd record \
-p function_graph \
-g bpf_trampoline_link_prog \
-n bpf_jit_alloc_exec \
-n kmalloc_trace \
-n arch_prepare_bpf_trampoline \
-n generic_handle_domain_irq \
-n do_interrupt_handler \
-n irq_exit_rcu \
./probe file-system-monitor
我们从trace-cmd report
中获得以下输出:
root@pine64:/home/exein# trace-cmd report
CPU 0 is empty
CPU 1 is empty
CPU 3 is empty
cpus=4
tokio-runtime-w-11886 [002] 193385.056283: funcgraph_entry: | bpf_trampoline_link_prog() {
tokio-runtime-w-11886 [002] 193385.056321: funcgraph_entry: + 15.042 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.056373: funcgraph_entry: | __bpf_trampoline_link_prog() {
tokio-runtime-w-11886 [002] 193385.056395: funcgraph_entry: + 14.833 us | bpf_attach_type_to_tramp();
tokio-runtime-w-11886 [002] 193385.056428: funcgraph_entry: | bpf_trampoline_update.isra.23() {
tokio-runtime-w-11886 [002] 193385.056459: funcgraph_entry: 2.917 us | bpf_jit_charge_modmem();
tokio-runtime-w-11886 [002] 193385.056531: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.056540: funcgraph_entry: 3.000 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.056547: funcgraph_exit: + 16.208 us | }
tokio-runtime-w-11886 [002] 193385.056554: funcgraph_entry: | __alloc_percpu_gfp() {
tokio-runtime-w-11886 [002] 193385.056563: funcgraph_entry: | pcpu_alloc() {
tokio-runtime-w-11886 [002] 193385.056568: funcgraph_entry: 4.875 us | mutex_lock_killable();
tokio-runtime-w-11886 [002] 193385.056591: funcgraph_entry: | pcpu_find_block_fit() {
tokio-runtime-w-11886 [002] 193385.056599: funcgraph_entry: 8.625 us | pcpu_next_fit_region.constprop.38();
tokio-runtime-w-11886 [002] 193385.056608: funcgraph_exit: + 17.166 us | }
tokio-runtime-w-11886 [002] 193385.056610: funcgraph_entry: | pcpu_alloc_area() {
tokio-runtime-w-11886 [002] 193385.056639: funcgraph_entry: 9.167 us | pcpu_block_update();
tokio-runtime-w-11886 [002] 193385.056656: funcgraph_entry: 7.667 us | pcpu_block_update_hint_alloc();
tokio-runtime-w-11886 [002] 193385.056671: funcgraph_entry: 7.750 us | pcpu_chunk_relocate();
tokio-runtime-w-11886 [002] 193385.056679: funcgraph_exit: + 69.667 us | }
tokio-runtime-w-11886 [002] 193385.056682: funcgraph_entry: 7.042 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.056703: funcgraph_entry: 2.792 us | pcpu_memcg_post_alloc_hook();
tokio-runtime-w-11886 [002] 193385.056712: funcgraph_exit: ! 148.709 us | }
tokio-runtime-w-11886 [002] 193385.056719: funcgraph_exit: ! 165.250 us | }
tokio-runtime-w-11886 [002] 193385.056866: funcgraph_entry: | bpf_image_ksym_add() {
tokio-runtime-w-11886 [002] 193385.056873: funcgraph_entry: | bpf_ksym_add() {
tokio-runtime-w-11886 [002] 193385.056882: funcgraph_entry: 2.750 us | __local_bh_disable_ip();
tokio-runtime-w-11886 [002] 193385.056897: funcgraph_entry: 4.625 us | __local_bh_enable_ip();
tokio-runtime-w-11886 [002] 193385.056905: funcgraph_exit: + 32.459 us | }
tokio-runtime-w-11886 [002] 193385.056922: funcgraph_entry: 7.584 us | perf_event_ksymbol();
tokio-runtime-w-11886 [002] 193385.056944: funcgraph_exit: + 78.417 us | }
tokio-runtime-w-11886 [002] 193385.057492: funcgraph_entry: | set_memory_ro() {
tokio-runtime-w-11886 [002] 193385.057501: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057504: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.057506: funcgraph_entry: 8.875 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.057518: funcgraph_exit: + 14.250 us | }
tokio-runtime-w-11886 [002] 193385.057522: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057531: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057538: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057544: funcgraph_entry: + 12.791 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057559: funcgraph_entry: 2.708 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057574: funcgraph_entry: + 15.125 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.057591: funcgraph_exit: + 53.792 us | }
tokio-runtime-w-11886 [002] 193385.057597: funcgraph_exit: + 66.083 us | }
tokio-runtime-w-11886 [002] 193385.057610: funcgraph_exit: + 88.125 us | }
tokio-runtime-w-11886 [002] 193385.057619: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-11886 [002] 193385.057622: funcgraph_entry: | _vm_unmap_aliases.part.77() {
tokio-runtime-w-11886 [002] 193385.057625: funcgraph_entry: 9.125 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.057637: funcgraph_entry: 3.084 us | purge_fragmented_blocks_allcpus();
tokio-runtime-w-11886 [002] 193385.057643: funcgraph_entry: | __purge_vmap_area_lazy() {
tokio-runtime-w-11886 [002] 193385.057687: funcgraph_entry: | kmem_cache_free() {
tokio-runtime-w-11886 [002] 193385.057693: funcgraph_entry: + 13.250 us | __slab_free();
tokio-runtime-w-11886 [002] 193385.057705: funcgraph_exit: + 18.750 us | }
tokio-runtime-w-11886 [002] 193385.057718: funcgraph_entry: 7.416 us | __cond_resched_lock();
tokio-runtime-w-11886 [002] 193385.057733: funcgraph_exit: + 90.042 us | }
tokio-runtime-w-11886 [002] 193385.057741: funcgraph_entry: 2.792 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.057747: funcgraph_exit: ! 124.666 us | }
tokio-runtime-w-11886 [002] 193385.057749: funcgraph_exit: ! 130.291 us | }
tokio-runtime-w-11886 [002] 193385.057756: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057759: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057765: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057768: funcgraph_entry: 4.125 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057778: funcgraph_entry: 8.750 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057790: funcgraph_entry: 4.625 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.057797: funcgraph_exit: + 31.958 us | }
tokio-runtime-w-11886 [002] 193385.057803: funcgraph_exit: + 44.375 us | }
tokio-runtime-w-11886 [002] 193385.057817: funcgraph_exit: + 61.208 us | }
tokio-runtime-w-11886 [002] 193385.057820: funcgraph_exit: ! 319.292 us | }
tokio-runtime-w-11886 [002] 193385.057826: funcgraph_exit: ! 333.667 us | }
tokio-runtime-w-11886 [002] 193385.057840: funcgraph_entry: | set_memory_x() {
tokio-runtime-w-11886 [002] 193385.057847: funcgraph_entry: | change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057855: funcgraph_entry: | find_vm_area() {
tokio-runtime-w-11886 [002] 193385.057858: funcgraph_entry: 2.917 us | find_vmap_area();
tokio-runtime-w-11886 [002] 193385.057870: funcgraph_exit: + 14.375 us | }
tokio-runtime-w-11886 [002] 193385.057876: funcgraph_entry: | vm_unmap_aliases() {
tokio-runtime-w-11886 [002] 193385.057879: funcgraph_entry: | _vm_unmap_aliases.part.77() {
tokio-runtime-w-11886 [002] 193385.057882: funcgraph_entry: 3.959 us | mutex_lock();
tokio-runtime-w-11886 [002] 193385.057893: funcgraph_entry: 3.000 us | purge_fragmented_blocks_allcpus();
tokio-runtime-w-11886 [002] 193385.057900: funcgraph_entry: 2.791 us | __purge_vmap_area_lazy();
tokio-runtime-w-11886 [002] 193385.057907: funcgraph_entry: 2.709 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.057913: funcgraph_exit: + 33.708 us | }
tokio-runtime-w-11886 [002] 193385.057915: funcgraph_exit: + 43.000 us | }
tokio-runtime-w-11886 [002] 193385.057922: funcgraph_entry: | __change_memory_common() {
tokio-runtime-w-11886 [002] 193385.057925: funcgraph_entry: | apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057930: funcgraph_entry: | __apply_to_page_range() {
tokio-runtime-w-11886 [002] 193385.057933: funcgraph_entry: 4.292 us | pud_huge();
tokio-runtime-w-11886 [002] 193385.057945: funcgraph_entry: 8.750 us | pmd_huge();
tokio-runtime-w-11886 [002] 193385.057956: funcgraph_entry: 3.958 us | change_page_range();
tokio-runtime-w-11886 [002] 193385.058037: funcgraph_exit: + 32.083 us | }
tokio-runtime-w-11886 [002] 193385.058089: funcgraph_entry: 7.667 us | irq_enter_rcu();
tokio-runtime-w-11886 [002] 193385.058233: funcgraph_exit: ! 308.041 us | }
tokio-runtime-w-11886 [002] 193385.058239: funcgraph_exit: ! 316.709 us | }
tokio-runtime-w-11886 [002] 193385.058247: funcgraph_exit: ! 400.417 us | }
tokio-runtime-w-11886 [002] 193385.058255: funcgraph_exit: ! 415.000 us | }
tokio-runtime-w-11886 [002] 193385.058555: funcgraph_entry: 8.250 us | irq_enter_rcu();
tokio-runtime-w-11886 [002] 193385.058958: funcgraph_entry: | kallsyms_lookup_size_offset() {
tokio-runtime-w-11886 [002] 193385.058974: funcgraph_entry: + 36.333 us | get_symbol_pos();
tokio-runtime-w-11886 [002] 193385.059017: funcgraph_exit: + 59.750 us | }
tokio-runtime-w-11886 [002] 193385.059043: funcgraph_entry: | kfree() {
tokio-runtime-w-11886 [002] 193385.059057: funcgraph_entry: 3.000 us | __kmem_cache_free();
tokio-runtime-w-11886 [002] 193385.059065: funcgraph_exit: + 22.833 us | }
tokio-runtime-w-11886 [002] 193385.059073: funcgraph_exit: # 2644.708 us | }
tokio-runtime-w-11886 [002] 193385.059079: funcgraph_exit: # 2706.292 us | }
tokio-runtime-w-11886 [002] 193385.059095: funcgraph_entry: 2.792 us | mutex_unlock();
tokio-runtime-w-11886 [002] 193385.059101: funcgraph_exit: # 2870.416 us | }
这次程序已经通过了arch_prepare_bpf_trampoline
、set_memory_ro
和set_memory_x
,我们看到的最后一个函数是kallsyms_lookup_size_offset
。
正如我们在kernel/bpf/trampoline.c
中的bpf_trampoline_update
函数中所看到的,这里并没有明确调用kallsyms_lookup_size_offset
:
static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
{
// ... OTHER CODE ...
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
again:
if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) &&
(tr->flags & BPF_TRAMP_F_CALL_ORIG))
tr->flags |= BPF_TRAMP_F_ORIG_STACK;
#endif
err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, tr->flags, tlinks,
tr->func.addr);
if (err < 0)
goto out;
set_memory_ro((long)im->image, 1);
set_memory_x((long)im->image, 1);
WARN_ON(tr->cur_image && tr->selector == 0);
WARN_ON(!tr->cur_image && tr->selector);
if (tr->cur_image)
/* progs already running at this address */
err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
else
/* first time registering */
err = register_fentry(tr, im->image);
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
if (err == -EAGAIN) {
/* -EAGAIN from bpf_tramp_ftrace_ops_func. Now
* BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the
* trampoline again, and retry register.
*/
/* reset fops->func and fops->trampoline for re-register */
tr->fops->func = NULL;
tr->fops->trampoline = 0;
/* reset im->image memory attr for arch_prepare_bpf_trampoline */
set_memory_nx((long)im->image, 1);
set_memory_rw((long)im->image, 1);
goto again;
}
#endif
if (err)
goto out;
if (tr->cur_image)
bpf_tramp_image_put(tr->cur_image);
tr->cur_image = im;
tr->selector++;
out:
/* If any error happens, restore previous flags */
if (err)
tr->flags = orig_flags;
kfree(tlinks);
return err;
}
```shell
> **注意:** <code>bpf_trampoline_update</code>的实现与之前的内核5.15稍有不同。
<code>kallsyms_lookup_size_offset</code>的调用被隐藏在另一个函数内部。我们在函数图中看不到它,因为编译器将其内联了。
看起来<code>kallsyms_lookup_size_offset</code>是由<code>ftrace_location</code>调用的:
```c
unsigned long ftrace_location(unsigned long ip)
{
struct dyn_ftrace *rec;
unsigned long offset;
unsigned long size;
rec = lookup_rec(ip, ip);
if (!rec) {
if (!kallsyms_lookup_size_offset(ip, &size, &offset))
goto out;
/* map sym+0 to __fentry__ */
if (!offset)
rec = lookup_rec(ip, ip + size - 1);
}
if (rec)
return rec->ip;
out:
return 0;
}
ftrace_location
被register_fentry
调用,而register_fentry
在调用ftrace_location
之后,在struct bpf_trampoline *tr
的fops
字段上包含了一次检查。
/* first time registering */
static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
{
void *ip = tr->func.addr;
unsigned long faddr;
int ret;
faddr = ftrace_location((unsigned long)ip);
if (faddr) {
if (!tr->fops)
return -ENOTSUPP;
tr->func.ftrace_managed = true;
}
if (bpf_trampoline_module_get(tr))
return -ENOENT;
if (tr->func.ftrace_managed) {
ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 1);
ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
} else {
ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
}
if (ret)
bpf_trampoline_module_put(tr);
return ret;
}
确实,如果tr->fops
为false
,该函数将返回错误-ENOTSUPP
。
让我们找出tr->fops
是在哪里初始化的。
如果我们是正确的,那么创建trampoline的地方应该在bpf_trampoline_lookup
函数内部。
static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
struct bpf_trampoline *tr;
struct hlist_head *head;
int i;
mutex_lock(&trampoline_mutex);
head = &trampoline_table[hash_64(key, TRAMPOLINE_HASH_BITS)];
hlist_for_each_entry(tr, head, hlist) {
if (tr->key == key) {
refcount_inc(&tr->refcnt);
goto out;
}
}
tr = kzalloc(sizeof(*tr), GFP_KERNEL);
if (!tr)
goto out;
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
if (!tr->fops) {
kfree(tr);
tr = NULL;
goto out;
}
tr->fops->private = tr;
tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
#endif
tr->key = key;
INIT_HLIST_NODE(&tr->hlist);
hlist_add_head(&tr->hlist, head);
refcount_set(&tr->refcnt, 1);
mutex_init(&tr->mutex);
for (i = 0; i < BPF_TRAMP_MAX; i++)
INIT_HLIST_HEAD(&tr->progs_hlist[i]);
out:
mutex_unlock(&trampoline_mutex);
return tr;
}
在分配之后,只有在出现CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
标志时,才会填充trampoline的fops
字段。这个标志依赖于HAVE_CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
标志,而这个标志在aarch64
上不存在。
当前情况下,由于缺少_ftrace直接调用_功能,无法在code>aarch64上使用BPF LSM
。幸运的是,当前的mainline
分支已经合并了一个[补丁](https://lore.kernel.org/bpf/20230207182135.2671106-5-revest@chromium.org/T/),该补丁将在aarch64上启用LSMs(以及其他功能)。
预计这些变化将会在下一个6.4版的Linux内核中发布。
CFC4N的博客 由 CFC4N 创作,采用 知识共享 署名-非商业性使用-相同方式共享(3.0未本地化版本)许可协议进行许可。基于https://www.cnxct.com上的作品创作。转载请注明转自:探索aarch64架构上使用ftrace的BPF LSM