Discussion:
[PATCH RFC v4 net-next 00/26] BPF syscall, maps, verifier, samples, llvm
(too old to reply)
Alexei Starovoitov
2014-08-13 07:57:11 UTC
Permalink
Hi All,

one more RFC...

Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
In V3 I've been using 'fixup' section in eBPF program to tell kernel
which instructions are accessing maps. With new instruction 'fixup' is gone
and map IDR (internal map_ids) are removed.
To understand the logic behind new insn, I need to explain two main
eBPF design constraints:
1. eBPF interpreter must be generic. It should know nothing about maps or
any custom instructions or functions.
2. llvm compiler backend must be generic. It also should know nothing about
maps, helper functions, sockets, tracing, etc. LLVM just takes normal C
and compiles it for some 'fake' HW that happened to be called eBPF ISA.

patch #1 implements BPF_LD_IMM64 insn. It's just a move of 64-bit immediate
value into a register. Nothing fancy.

The reason it improved eBPF program run-time is the following:
in V3 the program used to look like:
bpf_mov r1, const_internal_map_id
bpf_call bpf_map_lookup
so in-kernel bpf_map_lookup() helper would do map_id->map_ptr conversion via
map = idr_find(&bpf_map_id_idr, map_id);
For the life of the program map_id is constant and that lookup was returning
the same value, but there was no easy way to store pointer inside eBPF insn.

With new insn the programs look like:
bpf_ld_imm64 r1, const_internal_map_ptr
bpf_call bpf_map_lookup
and the bpf_map_lookup() helper does:
struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
Though it's a small performance gain, every nsec counts.
Also new insn allows further optimizations in JIT compilers.

How does it help to cleanup program interface towards maps?
Obviously user space doesn't know what kernel map pointer is associated
with process-local map-FD.
So it's using pseudo BPF_LD_IMM64 instruction.
BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg
BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg
Other values are reserved for now. (They will be used to implement
global variables, strings and other constants and per-cpu areas in the future)
So the programs look like:
BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd),
BPF_CALL(BPF_FUNC_map_lookup_elem),
eBPF verifier scans the program for such pseudo instructions, converts
process_local_map_fd -> in-kernel map pointer
and drops 'pseudo' flag of BPF_LD_IMM64 instruction.
eBPF interpreter stays generic and LLVM stays generic, since they know
nothing about pseudo instructions.
Another pseudo instruction is BPF_CALL. User space encodes one of
BPF_FUNC_xxx function ids as part of 'imm' field of the instruction
and eBPF program loader converts it to in-kernel helper function pointer.

The idea to use special instructions to access maps was suggested by Jonathan ;)
It took awhile to figure out how to do it within above two design constraints,
but the end result I think is much cleaner than what I had in V2/V3.

Another difference vs previous set is verifier split into 6 patches and
verifier testsuite is added. Beyond old checks verifier got 'tidiness' checks
to make sure all unused fields of instructions are zero.
Unfortunately classic BPF doesn't check for this. Lesson learned.

Tracing use case got some improvements as well. Now eBPF programs can be
attached to tracepoint, syscall, kprobe and C examples are more usable:
ex1_kern.c - demonstrate how programs can walk in-kernel data structures
ex2_kern.c - in-kernel event accounting and user space histograms
See patch #25

TODO:
- verifier is safe, but not secure, since it allows kernel address leaking.
fix that before lifting root-only restriction
- allow seecomp to use eBPF
- write manpage for eBPF syscall

As always all patches are available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

V3->V4:
- introduced 'load 64-bit immediate' eBPF instruction
- use BPF_LD_IMM64 in LLVM, verifier, programs
- got rid of 'fixup' section in eBPF programs
- got rid of map IDR and internal map_id
- split verifier into 6 patches and added verifier testsuite
- add verifier check for reserved instruction fields
- fixed bug in LLVM eBPF backend (it was miscompiling __builtin_expect)
- fixed race condition in htab_map_update_elem()
- tracing filters can now attach to tracepoint, syscall, kprobe events
- improved C examples

V2->V3:
- fixed verifier register range bug and addressed other comments (Thanks Kees!)
- re-added LLVM eBPF backend
- added two examples in C
- user space ELF parser and loader example

V1->V2:
- got rid of global id, everything now FD based (Thanks Andy!)
- split type enum in verifier (as suggested by Andy and Namhyung)
- switched gpl enforcement to be kmod like (as suggested by Andy and David)
- addressed feedback from Namhyung, Chema, Joe
- added more comments to verifier
- renamed sock_filter_int -> bpf_insn
- rebased on net-next

FD approach made eBPF user interface much cleaner for sockets/seccomp/tracing
use cases. Now socket and tracing examples (patch 15 and 16) can be Ctrl-C in
the middle and kernel will auto cleanup everything including tracing filters.

----

Old V1 cover letter:

'maps' is a generic storage of different types for sharing data between kernel
and userspace. Maps are referrenced by file descriptor. Root process can create
multiple maps of different types where key/value are opaque bytes of data.
It's up to user space and eBPF program to decide what they store in the maps.

eBPF programs are similar to kernel modules. They are loaded by the user space
program and unload on closing of fd. Each program is a safe run-to-completion
set of instructions. eBPF verifier statically determines that the program
terminates and safe to execute. During verification the program takes a hold of
maps that it intends to use, so selected maps cannot be removed until program is
unloaded. The program can be attached to different events. These events can
be packets, tracepoint events and other types in the future. New event triggers
execution of the program which may store information about the event in the maps.
Beyond storing data the programs may call into in-kernel helper functions
which may, for example, dump stack, do trace_printk or other forms of live
kernel debugging. Same program can be attached to multiple events. Different
programs can access the same map:

tracepoint tracepoint tracepoint sk_buff sk_buff
event A event B event C on eth0 on eth1
| | | | |
| | | | |
--> tracing <-- tracing socket socket
prog_1 prog_2 prog_3 prog_4
| | | |
|--- -----| |-------| map_3
map_1 map_2

User space (via syscall) and eBPF programs access maps concurrently.

------

Alexei Starovoitov (26):
net: filter: add "load 64-bit immediate" eBPF instruction
net: filter: split filter.h and expose eBPF to user space
bpf: introduce syscall(BPF, ...) and BPF maps
bpf: enable bpf syscall on x64
bpf: add lookup/update/delete/iterate methods to BPF maps
bpf: add hashtable type of BPF maps
bpf: expand BPF syscall with program load/unload
bpf: handle pseudo BPF_CALL insn
bpf: verifier (add docs)
bpf: verifier (add ability to receive verification log)
bpf: handle pseudo BPF_LD_IMM64 insn
bpf: verifier (add branch/goto checks)
bpf: verifier (add verifier core)
bpf: verifier (add state prunning optimization)
bpf: allow eBPF programs to use maps
net: sock: allow eBPF programs to be attached to sockets
tracing: allow eBPF programs to be attached to events
tracing: allow eBPF programs to be attached to kprobe/kretprobe
samples: bpf: add mini eBPF library to manipulate maps and programs
samples: bpf: example of stateful socket filtering
samples: bpf: example of tracing filters with eBPF
bpf: llvm backend
samples: bpf: elf file loader
samples: bpf: eBPF example in C
samples: bpf: counting eBPF example in C
bpf: verifier test
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:12 UTC
Permalink
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
Documentation/networking/filter.txt | 8 +++++++-
arch/x86/net/bpf_jit_comp.c | 9 +++++++++
include/linux/filter.h | 11 +++++++++++
kernel/bpf/core.c | 5 +++++
lib/test_bpf.c | 22 ++++++++++++++++++++++
5 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index c48a9704bda8..81916ab5d96f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -951,7 +951,7 @@ Size modifier is one of ...

Mode modifier is one of:

- BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */
+ BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
BPF_ABS 0x20
BPF_IND 0x40
BPF_MEM 0x60
@@ -995,6 +995,12 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
2 byte atomic increments are not supported.

+eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
+of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
+instruction that loads 64-bit immediate value into a dst_reg.
+Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
+32-bit immediate value into a register.
+
Testing
-------

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5c8cb8043c5a..67b666aab20e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -393,6 +393,15 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
EMIT1_off32(add_1reg(0xB8, dst_reg), imm32);
break;

+ case BPF_LD | BPF_IMM | BPF_DW:
+ /* movabsq %rax, imm64 */
+ EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+ EMIT(insn->imm, 4);
+ insn++;
+ i++;
+ EMIT(insn->imm, 4);
+ break;
+
/* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */
case BPF_ALU | BPF_MOD | BPF_X:
case BPF_ALU | BPF_DIV | BPF_X:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5227ab8ccb1..73a6d505e729 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -161,6 +161,17 @@ enum {
.off = 0, \
.imm = IMM })

+/* use two of BPF_LD_IMM64 to encode single move 64-bit insn
+ * first macro to carry lower 32-bits and second for higher 32-bits
+ */
+#define BPF_LD_IMM64(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_DW | BPF_IMM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */

#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7f0dbcbb34af..0434c2170f2b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -180,6 +180,7 @@ static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+ [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
};
void *ptr;
int off;
@@ -239,6 +240,10 @@ select_insn:
ALU64_MOV_K:
DST = IMM;
CONT;
+ LD_IMM_DW:
+ DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
+ insn++;
+ CONT;
ALU64_ARSH_X:
(*(s64 *) &DST) >>= SRC;
CONT;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 89e0345733bd..d59444262dc0 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -1697,6 +1697,28 @@ static struct bpf_test tests[] = {
{ },
{ { 1, 0 } },
},
+ {
+ "load 64-bit immediate",
+ .u.insns_int = {
+ BPF_LD_IMM64(R1, 0x1234), /* lower 32-bit */
+ BPF_LD_IMM64(R1, 0x5678), /* higher 32-bit */
+ BPF_MOV64_REG(R2, R1),
+ BPF_MOV64_REG(R3, R2),
+ BPF_ALU64_IMM(BPF_RSH, R2, 32),
+ BPF_ALU64_IMM(BPF_LSH, R3, 32),
+ BPF_ALU64_IMM(BPF_RSH, R3, 32),
+ BPF_ALU64_IMM(BPF_MOV, R0, 0),
+ BPF_JMP_IMM(BPF_JEQ, R2, 0x5678, 1),
+ BPF_EXIT_INSN(),
+ BPF_JMP_IMM(BPF_JEQ, R3, 0x1234, 1),
+ BPF_EXIT_INSN(),
+ BPF_ALU64_IMM(BPF_MOV, R0, 1),
+ BPF_EXIT_INSN(),
+ },
+ INTERNAL,
+ { },
+ { { 0, 1 } }
+ },
};

static struct net_device dev;
--
1.7.9.5
Daniel Borkmann
2014-08-13 09:17:37 UTC
Permalink
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
For follow-ups on this series, can you put the actual motivation
for this change from the cover letter into this commit log as it
otherwise doesn't say anything clearly why it is needed. Code and
test case looks good to me.
Alexei Starovoitov
2014-08-13 17:34:00 UTC
Permalink
Post by Daniel Borkmann
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
For follow-ups on this series, can you put the actual motivation
for this change from the cover letter into this commit log as it
otherwise doesn't say anything clearly why it is needed. Code and
test case looks good to me.
ok. As you saw the full explanation is long, so I opted for 'it_does_this'
commit log. In the next rev will add more reasons to this log. Sure.
Daniel Borkmann
2014-08-13 17:39:56 UTC
Permalink
Post by Alexei Starovoitov
Post by Daniel Borkmann
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.
x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
Note that old eBPF programs are binary compatible with new interpreter.
For follow-ups on this series, can you put the actual motivation
for this change from the cover letter into this commit log as it
otherwise doesn't say anything clearly why it is needed. Code and
test case looks good to me.
ok. As you saw the full explanation is long, so I opted for 'it_does_this'
commit log. In the next rev will add more reasons to this log. Sure.
Great, thanks.
Andy Lutomirski
2014-08-13 16:08:24 UTC
Permalink
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
This might be unnecessarily difficult for fancy static analysis tools
to reason about. Would it make sense to assign two different codes
for this? For example, insn[0].code = code_for_load_low,
insns[1].code = code_for_load_high, along with a verifier check that
they come in matched pairs and that code_for_load_high isn't a jump
target?

(Something else that I find confusing about eBPF: the instruction
mnemonics are very strange. Have you considered giving them real
names? For example, load.imm.low instead of BPF_LD | BPF_DW | BPF_IMM
is easier to read and pronounce.)

--Andy
Alexei Starovoitov
2014-08-13 17:44:04 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
This might be unnecessarily difficult for fancy static analysis tools
to reason about. Would it make sense to assign two different codes
for this? For example, insn[0].code = code_for_load_low,
insns[1].code = code_for_load_high, along with a verifier check that
they come in matched pairs and that code_for_load_high isn't a jump
target?
see my reply to David for the same thing. Short answer is that
sequence of instructions (even if it is a pair of instructions like this)
is very hard to detect in verifier and JITs.
As soon as we give compiler two instructions instead of one,
compiler may optimize them in a fancy ways. Like two loads of
64-bit immediate with upper 32-bit the same, may came out as
4 instructions: load_high, load_low, load_low, mov.
Or in some cases as single load_low, etc.
load 64-bit imm has to stay as single instruction to be verifiable
and patch-able easily.
One can argue: force compiler to emit load_low and load_hi
always together, but then that's exactly what I have. It's a single insn.
Post by Andy Lutomirski
(Something else that I find confusing about eBPF: the instruction
mnemonics are very strange. Have you considered giving them real
names? For example, load.imm.low instead of BPF_LD | BPF_DW | BPF_IMM
is easier to read and pronounce.)
BPF_LD | BPF_DW | BPF_IMM is not really a name. It's macro
for cases when instructions are generated from inside the kernel.
Instructions mnemonics are not defined yet.
llvm emits assembler code like:
bpf_prog2:
ldw r1, 16(r1)
std -8(r10), r1
mov r1, 1
std -16(r10), r1
ld_64 r1, 1
mov r2, r10
addi r2, -8
call 4
jeqi r0, 0 goto .LBB1_2
ldd r1, 0(r0)
addi r1, 1
std 0(r0), r1
.LBB1_3:
mov r0, 0
ret
...
I'm open to change assembler/disassembler mnemonics.
Andy Lutomirski
2014-08-13 18:35:44 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into register.
All previous instructions were 8-byte. This is first 16-byte instruction.
insn[0/1].code = BPF_LD | BPF_DW | BPF_IMM
insn[0/1].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].imm = upper 32-bit
This might be unnecessarily difficult for fancy static analysis tools
to reason about. Would it make sense to assign two different codes
for this? For example, insn[0].code = code_for_load_low,
insns[1].code = code_for_load_high, along with a verifier check that
they come in matched pairs and that code_for_load_high isn't a jump
target?
see my reply to David for the same thing. Short answer is that
sequence of instructions (even if it is a pair of instructions like this)
is very hard to detect in verifier and JITs.
As soon as we give compiler two instructions instead of one,
compiler may optimize them in a fancy ways. Like two loads of
64-bit immediate with upper 32-bit the same, may came out as
4 instructions: load_high, load_low, load_low, mov.
Or in some cases as single load_low, etc.
load 64-bit imm has to stay as single instruction to be verifiable
and patch-able easily.
One can argue: force compiler to emit load_low and load_hi
always together, but then that's exactly what I have. It's a single insn.
The compiler can still think of it as a single insn, though, but some
future compiler might not. In any case, I think that, if you use the
same code for high and for low, you need logic in the JIT that's at
least as complicated. For example, what happens if you have two
consecutive 64-bit immediate loads to the same register? Now you have
four consecutive 8-byte insn words that differ only in their immediate
values, and you need to split them correctly.
Post by Alexei Starovoitov
Post by Andy Lutomirski
(Something else that I find confusing about eBPF: the instruction
mnemonics are very strange. Have you considered giving them real
names? For example, load.imm.low instead of BPF_LD | BPF_DW | BPF_IMM
is easier to read and pronounce.)
BPF_LD | BPF_DW | BPF_IMM is not really a name. It's macro
for cases when instructions are generated from inside the kernel.
Instructions mnemonics are not defined yet.
ldw r1, 16(r1)
std -8(r10), r1
mov r1, 1
std -16(r10), r1
ld_64 r1, 1
mov r2, r10
addi r2, -8
call 4
jeqi r0, 0 goto .LBB1_2
ldd r1, 0(r0)
addi r1, 1
std 0(r0), r1
mov r0, 0
ret
...
I'm open to change assembler/disassembler mnemonics.
Ah, ok. I didn't realize that there were mnemonics at all.

--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
Alexei Starovoitov
2014-08-13 21:02:50 UTC
Permalink
Post by Andy Lutomirski
The compiler can still think of it as a single insn, though, but some
future compiler might not.
I think that would be very dangerous.
compiler (user space) and kernel interpreter must have the same
understanding of ISA.
Post by Andy Lutomirski
In any case, I think that, if you use the
same code for high and for low, you need logic in the JIT that's at
least as complicated.
why do you think so? Handling of pseudo BPF_LD_IMM64 is done
in single patch #11 which is one of the smallest...
Post by Andy Lutomirski
For example, what happens if you have two
consecutive 64-bit immediate loads to the same register? Now you have
four consecutive 8-byte insn words that differ only in their immediate
values, and you need to split them correctly.
I don't need to do anything special in this case.
Two 16-byte instructions back to back is not a problem.
Interpreter or JIT don't care whether they move the same or different
immediates into the same or different register. Interpreter and JITs
are dumb on purpose.
when verifier sees two back to back ld_imm64, the 2nd will simply
override the value loaded by first one. It's not any different than
two back to back 'mov dst_reg, imm32' instructions.
H. Peter Anvin
2014-08-13 21:16:35 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
The compiler can still think of it as a single insn, though, but some
future compiler might not.
I think that would be very dangerous.
compiler (user space) and kernel interpreter must have the same
understanding of ISA.
Only at the point of the interface layer. The compiler can treat it as
a single instruction internally, the JIT can do peephole optimization,
but as long as the instruction stream at the boundary matches the
official ISA spec everything is fine.

-hpa
Andy Lutomirski
2014-08-13 21:17:24 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
The compiler can still think of it as a single insn, though, but some
future compiler might not.
I think that would be very dangerous.
compiler (user space) and kernel interpreter must have the same
understanding of ISA.
Post by Andy Lutomirski
In any case, I think that, if you use the
same code for high and for low, you need logic in the JIT that's at
least as complicated.
why do you think so? Handling of pseudo BPF_LD_IMM64 is done
in single patch #11 which is one of the smallest...
Post by Andy Lutomirski
For example, what happens if you have two
consecutive 64-bit immediate loads to the same register? Now you have
four consecutive 8-byte insn words that differ only in their immediate
values, and you need to split them correctly.
I don't need to do anything special in this case.
Two 16-byte instructions back to back is not a problem.
Interpreter or JIT don't care whether they move the same or different
immediates into the same or different register. Interpreter and JITs
are dumb on purpose.
when verifier sees two back to back ld_imm64, the 2nd will simply
override the value loaded by first one. It's not any different than
two back to back 'mov dst_reg, imm32' instructions.
But this patch makes the JIT code (and any interpreter) weirdly
stateful. You have:

+ case BPF_LD | BPF_IMM | BPF_DW:
+ /* movabsq %rax, imm64 */
+ EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+ EMIT(insn->imm, 4);
+ insn++;
+ i++;
+ EMIT(insn->imm, 4);
+ break;

If you have more than two BPF_LD | BPF_IMM | BPF_DW instructions in a
row, then the way in which they pair up depends on where you start.

I think it would be a lot clearer if you made these be "load low" and
"load high", with JIT code like:

+ case BPF_LOAD_LOW:
+ /* movabsq %rax, imm64 */
+ if (next insn is BPF_LOAD_HIGH) {
+ EMIT2(add_1mod(0x48, dst_reg),
add_1reg(0xB8, dst_reg));
+ EMIT(insn->imm, 4);
+ insn++;
+ i++;
+ EMIT(insn->imm, 4);
+ } else {
+ emit a real load low;
+ }
+ break;

(and you'd have to deal with whether load low by itself is illegal,
zero extends, sign extends, or preserves high bits).

Alternatively, and possibly better, you could have a real encoding for
multiword instructions. Reserve a bit in the opcode to mark a
continuation of the previous instruction, and do:

+ case BPF_LD | BPF_IMM | BPF_DW:
+ assert(insn[1] in bounds && insn[1].code == BPF_CONT);
+ /* movabsq %rax, imm64 */
+ EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+ EMIT(insn->imm, 4);
+ insn++;
+ i++;
+ EMIT(insn->imm, 4);
+ break;

This has a nice benefit for future-proofing: it gives you 119 bits of
payload for 16-byte instructions.

On the other hand, a u8 for the opcode is kind of small, and killing
half of that space like this is probably bad. Maybe reserve two high
bits, with:

0: normal opcode or start of a multiword sequence
1: continuation of a multiword sequence
2, 3: reserved for future longer opcode numbers (e.g. 2 could indicate
that "code" is actually 16 bits)

--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
H. Peter Anvin
2014-08-13 21:21:25 UTC
Permalink
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.

-hpa
Andy Lutomirski
2014-08-13 21:23:31 UTC
Permalink
Post by H. Peter Anvin
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.
I'm confused. What kind of memory? I can control a lot more than 8
bytes of stack very easily.

Or are you concerned about 8 contiguous bytes of *executable* memory?

--Andy
H. Peter Anvin
2014-08-13 21:27:36 UTC
Permalink
Post by Andy Lutomirski
Post by H. Peter Anvin
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.
I'm confused. What kind of memory? I can control a lot more than 8
bytes of stack very easily.
Or are you concerned about 8 contiguous bytes of *executable* memory?
Yes. Useful for some kinds of ROP custom gadgets.

-hpa
Alexei Starovoitov
2014-08-13 21:38:23 UTC
Permalink
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by H. Peter Anvin
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.
I'm confused. What kind of memory? I can control a lot more than 8
bytes of stack very easily.
Or are you concerned about 8 contiguous bytes of *executable* memory?
Yes. Useful for some kinds of ROP custom gadgets.
I don't get it. What is ROP ?
What is the concern about 8 bytes ?
Alexei Starovoitov
2014-08-13 21:56:44 UTC
Permalink
Post by Alexei Starovoitov
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by H. Peter Anvin
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.
I'm confused. What kind of memory? I can control a lot more than 8
bytes of stack very easily.
Or are you concerned about 8 contiguous bytes of *executable* memory?
Yes. Useful for some kinds of ROP custom gadgets.
I don't get it. What is ROP ?
What is the concern about 8 bytes ?
looked it up. too many abbreviations now days.
x64 jit spraying was fixed by Eric some time ago, so JIT emitting
movabsq doesn't increase attack surface. various movs of 32-bit
immediates can be used for 'custom gadget' just as well.
Worst case JIT won't be enabled.
In classic BPF we allow junk to be stored in used fields of
'struct sock_filter' and so far that wasn't a problem.
eBPF is more paranoid regarding verification.
Andy Lutomirski
2014-08-13 21:41:33 UTC
Permalink
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by H. Peter Anvin
One thing about this that may be a serious concern: allowing the user to
control 8 contiguous bytes of kernel memory may be a security hazard.
I'm confused. What kind of memory? I can control a lot more than 8
bytes of stack very easily.
Or are you concerned about 8 contiguous bytes of *executable* memory?
Yes. Useful for some kinds of ROP custom gadgets.
Hmm.

I think this is moot on non-SMEP machines. And I'm not entirely
convinced that it's worth worrying about in general, especially if we
take some care to randomize the location of the JIT mapping.

But yes, gadgets like jumps relative to gs or something along those
lines could make for interesting ROP tools. But someone will probably
figure out how to turn JIT output into a NOP slide + ROP gadget
regardless, at least on x86.

--Andy
Alexei Starovoitov
2014-08-13 21:43:34 UTC
Permalink
Post by Andy Lutomirski
I think this is moot on non-SMEP machines. And I'm not entirely
convinced that it's worth worrying about in general, especially if we
take some care to randomize the location of the JIT mapping.
JIT start address is already randomized...
Alexei Starovoitov
2014-08-13 21:37:13 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
The compiler can still think of it as a single insn, though, but some
future compiler might not.
I think that would be very dangerous.
compiler (user space) and kernel interpreter must have the same
understanding of ISA.
Post by Andy Lutomirski
In any case, I think that, if you use the
same code for high and for low, you need logic in the JIT that's at
least as complicated.
why do you think so? Handling of pseudo BPF_LD_IMM64 is done
in single patch #11 which is one of the smallest...
Post by Andy Lutomirski
For example, what happens if you have two
consecutive 64-bit immediate loads to the same register? Now you have
four consecutive 8-byte insn words that differ only in their immediate
values, and you need to split them correctly.
I don't need to do anything special in this case.
Two 16-byte instructions back to back is not a problem.
Interpreter or JIT don't care whether they move the same or different
immediates into the same or different register. Interpreter and JITs
are dumb on purpose.
when verifier sees two back to back ld_imm64, the 2nd will simply
override the value loaded by first one. It's not any different than
two back to back 'mov dst_reg, imm32' instructions.
But this patch makes the JIT code (and any interpreter) weirdly
+ /* movabsq %rax, imm64 */
+ EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+ EMIT(insn->imm, 4);
+ insn++;
+ i++;
+ EMIT(insn->imm, 4);
+ break;
If you have more than two BPF_LD | BPF_IMM | BPF_DW instructions in a
row, then the way in which they pair up depends on where you start.
For JIT it's not a problem, since it's doing a linear scan. So it always
starts at instruction boundary.
But thinking about it further you're right that it's a bug in verifier.
I've tried it and indeed depending on type of branch verifier doesn't
catch a case of two 16-byte instructions back to back and jump
goes into 2nd half of 1st insn. I need to fix that.
Post by Andy Lutomirski
I think it would be a lot clearer if you made these be "load low" and
+ /* movabsq %rax, imm64 */
+ if (next insn is BPF_LOAD_HIGH) {
such 'if' will be costly in interpreter. I want to avoid it.
Post by Andy Lutomirski
(and you'd have to deal with whether load low by itself is illegal,
zero extends, sign extends, or preserves high bits).
I don't need an instruction that loads low 32-bit. It already exists.
It's called 'mov'.
I'm going to try encoding:
insn[0].code = LD | IMM | DW
insn[1].code = 0
zero is invalid opcode, so it's your 'continuation'.
and it is still single 16-byte instructions without any interpreter overhead.
Post by Andy Lutomirski
This has a nice benefit for future-proofing: it gives you 119 bits of
payload for 16-byte instructions.
It's already future proofed. We can add 24 byte instructions and so on
just as well. There is no point to reserve 119 bits when no one is using
them.
Post by Andy Lutomirski
On the other hand, a u8 for the opcode is kind of small, and killing
half of that space like this is probably bad. Maybe reserve two high
That's an overkill. We use ~80 opcodes out of 256.
There is plenty of room. I see no reason to switch to 16-bit opcodes
until we get even close to half of u8 space.
It feels that we're starting to bikeshed.
Andy Lutomirski
2014-08-13 21:38:49 UTC
Permalink
Post by Alexei Starovoitov
I don't need an instruction that loads low 32-bit. It already exists.
It's called 'mov'.
insn[0].code = LD | IMM | DW
insn[1].code = 0
zero is invalid opcode, so it's your 'continuation'.
and it is still single 16-byte instructions without any interpreter overhead.
Works for me.

--Andy
Alexei Starovoitov
2014-08-13 07:57:14 UTC
Permalink
BPF syscall is a demux for different BPF releated commands.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps can be created from user space via BPF syscall:
- create a map with given type and attributes
fd = bpf_map_create(map_type, struct nlattr *attr, int len)
returns fd or negative error

- close(fd) deletes the map

Next patch allows userspace programs to populate/read maps that eBPF programs
are concurrently updating.

maps can have different types: hash, bloom filter, radix-tree, etc.

The map is defined by:
. type
. max number of elements
. key size in bytes
. value size in bytes

This patch establishes core infrastructure for BPF maps.
Next patches implement lookup/update and hashtable type.
More map types can be added in the future.

syscall is using type-length-value style of passing arguments to be backwards
compatible with future extensions to map attributes. Different map types may
use different attributes as well.
The concept of type-lenght-value is borrowed from netlink, but netlink itself
is not applicable here, since BPF programs and maps can be used in NET-less
configurations.

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
Documentation/networking/filter.txt | 71 ++++++++++++++++
include/linux/bpf.h | 42 ++++++++++
include/uapi/linux/bpf.h | 24 ++++++
kernel/bpf/Makefile | 2 +-
kernel/bpf/syscall.c | 156 +++++++++++++++++++++++++++++++++++
5 files changed, 294 insertions(+), 1 deletion(-)
create mode 100644 include/linux/bpf.h
create mode 100644 kernel/bpf/syscall.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 81916ab5d96f..27a0a6c6acb4 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1001,6 +1001,77 @@ instruction that loads 64-bit immediate value into a dst_reg.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
32-bit immediate value into a register.

+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given type and attributes
+ map_fd = bpf_map_create(map_type, struct nlattr *attr, int len)
+ returns process-local file descriptor or negative error
+
+- lookup key in a given map
+ err = bpf_map_lookup_elem(int fd, void *key, void *value)
+ returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+ err = bpf_map_update_elem(int fd, void *key, void *value)
+ returns zero or negative error
+
+- find and delete element by key in a given map
+ err = bpf_map_delete_elem(int fd, void *key)
+
+- to delete map: close(fd)
+ Exiting process will delete maps automatically
+
+userspace programs uses this API to create/populate/read maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, array, bloom filter, radix-tree, etc.
+
+The map is defined by:
+ . type
+ . max number of elements
+ . key size in bytes
+ . value size in bytes
+
+The maps are accesible from eBPF program with API:
+ void * bpf_map_lookup_elem(u32 map_fd, void *key);
+ int bpf_map_update_elem(u32 map_fd, void *key, void *value);
+ int bpf_map_delete_elem(u32 map_fd, void *key);
+
+The kernel replaces process-local map_fd with kernel internal map pointer,
+while loading eBPF program.
+
+If eBPF verifier is configured to recognize extra calls in the program
+bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
+ ...
+ ptr_to_value = bpf_map_lookup_elem(map_fd, key)
+ access memory range [ptr_to_value, ptr_to_value + value_size_in_bytes)
+ ...
+ prepare key2 and value2 on stack of key_size and value_size
+ err = bpf_map_update_elem(map_fd, key2, value2)
+ ...
+
+eBPF program cannot create or delete maps
+(such calls will be unknown to verifier)
+
+During program loading the refcnt of used maps is incremented, so they don't get
+deleted while program is running
+
+bpf_map_update_elem() can fail if maximum number of elements reached.
+if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
+
+bpf_map_lookup_elem() returns NULL or ptr_to_value, so program must do
+if (ptr_to_value != NULL) check before accessing it.
+NULL means that element with given 'key' was not found.
+
+The verifier will check that the program accesses map elements within specified
+size. It will not let programs pass junk values to bpf_map_*_elem() functions,
+so these functions (implemented in C inside kernel) can safely access
+the pointers in all cases.
+
Testing
-------

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 000000000000..607ca53fe2af
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,42 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_H
+#define _LINUX_BPF_H 1
+
+#include <uapi/linux/bpf.h>
+#include <linux/workqueue.h>
+
+struct bpf_map;
+struct nlattr;
+
+/* map is generic key/value storage optionally accesible by eBPF programs */
+struct bpf_map_ops {
+ /* funcs callable from userspace (via syscall) */
+ struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
+ void (*map_free)(struct bpf_map *);
+};
+
+struct bpf_map {
+ atomic_t refcnt;
+ enum bpf_map_type map_type;
+ u32 key_size;
+ u32 value_size;
+ u32 max_entries;
+ struct bpf_map_ops *ops;
+ struct work_struct work;
+};
+
+struct bpf_map_type_list {
+ struct list_head list_node;
+ struct bpf_map_ops *ops;
+ enum bpf_map_type type;
+};
+
+void bpf_register_map_type(struct bpf_map_type_list *tl);
+void bpf_map_put(struct bpf_map *map);
+
+#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6f6e10875e95..88b703d59b8c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -311,4 +311,28 @@ struct bpf_insn {
__s32 imm; /* signed immediate constant */
};

+/* BPF syscall commands */
+enum bpf_cmd {
+ /* create a map with given type and attributes
+ * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
+ * returns fd or negative error
+ * map is deleted when fd is closed
+ */
+ BPF_MAP_CREATE,
+};
+
+enum bpf_map_attributes {
+ BPF_MAP_UNSPEC,
+ BPF_MAP_KEY_SIZE, /* size of key in bytes */
+ BPF_MAP_VALUE_SIZE, /* size of value in bytes */
+ BPF_MAP_MAX_ENTRIES, /* maximum number of entries in a map */
+ __BPF_MAP_ATTR_MAX,
+};
+#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
+#define BPF_MAP_MAX_ATTR_SIZE 65535
+
+enum bpf_map_type {
+ BPF_MAP_TYPE_UNSPEC,
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a71145e2769..e9f7334ed07a 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o
+obj-y := core.o syscall.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
new file mode 100644
index 000000000000..04cdf7948f8f
--- /dev/null
+++ b/kernel/bpf/syscall.c
@@ -0,0 +1,156 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/syscalls.h>
+#include <net/netlink.h>
+#include <linux/anon_inodes.h>
+
+static LIST_HEAD(bpf_map_types);
+
+static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
+ struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
+{
+ struct bpf_map_type_list *tl;
+ struct bpf_map *map;
+
+ list_for_each_entry(tl, &bpf_map_types, list_node) {
+ if (tl->type == type) {
+ map = tl->ops->map_alloc(tb);
+ if (IS_ERR(map))
+ return map;
+ map->ops = tl->ops;
+ map->map_type = type;
+ return map;
+ }
+ }
+ return ERR_PTR(-EINVAL);
+}
+
+/* boot time registration of different map implementations */
+void bpf_register_map_type(struct bpf_map_type_list *tl)
+{
+ list_add(&tl->list_node, &bpf_map_types);
+}
+
+/* called from workqueue */
+static void bpf_map_free_deferred(struct work_struct *work)
+{
+ struct bpf_map *map = container_of(work, struct bpf_map, work);
+
+ /* implementation dependent freeing */
+ map->ops->map_free(map);
+}
+
+/* decrement map refcnt and schedule it for freeing via workqueue
+ * (unrelying map implementation ops->map_free() might sleep)
+ */
+void bpf_map_put(struct bpf_map *map)
+{
+ if (atomic_dec_and_test(&map->refcnt)) {
+ INIT_WORK(&map->work, bpf_map_free_deferred);
+ schedule_work(&map->work);
+ }
+}
+
+static int bpf_map_release(struct inode *inode, struct file *filp)
+{
+ struct bpf_map *map = filp->private_data;
+
+ bpf_map_put(map);
+ return 0;
+}
+
+static const struct file_operations bpf_map_fops = {
+ .release = bpf_map_release,
+};
+
+static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
+ [BPF_MAP_KEY_SIZE] = { .type = NLA_U32 },
+ [BPF_MAP_VALUE_SIZE] = { .type = NLA_U32 },
+ [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
+};
+
+/* called via syscall */
+static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
+{
+ struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
+ struct bpf_map *map;
+ struct nlattr *attr;
+ int err;
+
+ if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
+ return -EINVAL;
+
+ attr = kmalloc(len, GFP_USER);
+ if (!attr)
+ return -ENOMEM;
+
+ /* copy map attributes from user space */
+ err = -EFAULT;
+ if (copy_from_user(attr, uattr, len) != 0)
+ goto free_attr;
+
+ /* perform basic validation */
+ err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
+ if (err < 0)
+ goto free_attr;
+
+ /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+ map = find_and_alloc_map(type, tb);
+ if (IS_ERR(map)) {
+ err = PTR_ERR(map);
+ goto free_attr;
+ }
+
+ atomic_set(&map->refcnt, 1);
+
+ err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
+
+ if (err < 0)
+ /* failed to allocate fd */
+ goto free_map;
+
+ /* user supplied array of map attributes is no longer needed */
+ kfree(attr);
+
+ return err;
+
+free_map:
+ map->ops->map_free(map);
+free_attr:
+ kfree(attr);
+ return err;
+}
+
+SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
+ unsigned long, arg4, unsigned long, arg5)
+{
+ /* eBPF syscall is limited to root temporarily. This restriction will
+ * be lifted when verifier has enough mileage and security audit is
+ * clean. Note that tracing/networking analytics use cases will be
+ * turning off 'secure' mode of verifier, since they need to pass
+ * kernel data back to user space
+ */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (arg5 != 0)
+ return -EINVAL;
+
+ switch (cmd) {
+ case BPF_MAP_CREATE:
+ return map_create((enum bpf_map_type) arg2,
+ (struct nlattr __user *) arg3, (int) arg4);
+ default:
+ return -EINVAL;
+ }
+}
--
1.7.9.5
Brendan Gregg
2014-08-14 22:28:35 UTC
Permalink
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org> wrote:
[...]
Post by Alexei Starovoitov
maps can have different types: hash, bloom filter, radix-tree, etc.
. type
. max number of elements
. key size in bytes
. value size in bytes
Can values be strings or byte arrays? How would user-level bpf read
them? The two types of uses I'm thinking are:

A. Constructing a custom string in kernel-context, and using that as
the value. Eg, a truncated filename, or a dotted quad IP address, or
the raw contents of a packet.
B. I have a pointer to an existing buffer or string, eg a filename,
that will likely be around for some time (>1s). Instead of the value
storing the string, it could just be a ptr, so long as user-level bpf
has a way to read it.

Also, can keys be strings? I'd ask about multiple keys, but if they
can be a string, I can delimit in the key (eg, "PID:filename").
Thanks,

Brendan
--
http://www.brendangregg.com
Alexei Starovoitov
2014-08-15 06:40:53 UTC
Permalink
On Thu, Aug 14, 2014 at 3:28 PM, Brendan Gregg
Post by Brendan Gregg
[...]
Post by Alexei Starovoitov
maps can have different types: hash, bloom filter, radix-tree, etc.
. type
. max number of elements
. key size in bytes
. value size in bytes
Can values be strings or byte arrays? How would user-level bpf read
A. Constructing a custom string in kernel-context, and using that as
the value. Eg, a truncated filename, or a dotted quad IP address, or
the raw contents of a packet.
B. I have a pointer to an existing buffer or string, eg a filename,
that will likely be around for some time (>1s). Instead of the value
storing the string, it could just be a ptr, so long as user-level bpf
has a way to read it.
Also, can keys be strings? I'd ask about multiple keys, but if they
can be a string, I can delimit in the key (eg, "PID:filename").
Both map keys and values are opaque byte arrays. eBPF program
can decide to store strings in there. Or concatenate multiple
strings as long as sizes are bounded.
High level scripting languages are dazzling with native strings
support, but I'm trying to stay away from it in the kernel.
Scripting languages should be able to convert string operations
into low level eBPF primitives which are being worked on.
So far I've been able to use ids and pointers and concatenations
of binary things as keys and values, and have user space interpret
them. I agree that having a script that does map[probe_name()]++
is definitely more human readable than storing probe ip into
ebpf map and converting addresses to names in userspace.
I'm hoping that the urge to make cool scripting language will push
somebody to have a dtrace/ktap/stap language compiler into eBPF :)
That will also address your concern of embedded setup where
full llvm is too big, but dtrace_into_ebpf compiler may be just right.
At the same time people who care about last bit of performance
will be using C and llvm or ebpf assembler directly.
Anyway will share string related ebpf helpers soon (not in V5 though)
Alexei Starovoitov
2014-08-13 07:57:17 UTC
Permalink
add new map type: BPF_MAP_TYPE_HASH
and its simple (not auto resizeable) hash table implementation

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/uapi/linux/bpf.h | 1 +
kernel/bpf/Makefile | 2 +-
kernel/bpf/hashtab.c | 372 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 374 insertions(+), 1 deletion(-)
create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 804dd8b2ca19..828e873fa435 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -358,6 +358,7 @@ enum bpf_map_attributes {

enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC,
+ BPF_MAP_TYPE_HASH,
};

#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index 000000000000..bc8d32f0f720
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,372 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <linux/jhash.h>
+
+struct bpf_htab {
+ struct bpf_map map;
+ struct hlist_head *buckets;
+ struct kmem_cache *elem_cache;
+ spinlock_t lock;
+ u32 count; /* number of elements in this hashtable */
+ u32 n_buckets; /* number of hash buckets */
+ u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+ struct hlist_node hash_node;
+ struct rcu_head rcu;
+ struct bpf_htab *htab;
+ u32 hash;
+ u32 pad;
+ char key[0];
+};
+
+#define HASH_MAX_BUCKETS 1024
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
+{
+ struct bpf_htab *htab;
+ int err, i;
+
+ htab = kzalloc(sizeof(*htab), GFP_USER);
+ if (!htab)
+ return ERR_PTR(-ENOMEM);
+
+ /* look for mandatory map attributes */
+ err = -EINVAL;
+ if (!attr[BPF_MAP_KEY_SIZE])
+ goto free_htab;
+ htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
+
+ if (!attr[BPF_MAP_VALUE_SIZE])
+ goto free_htab;
+ htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
+
+ if (!attr[BPF_MAP_MAX_ENTRIES])
+ goto free_htab;
+ htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
+
+ htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
+ htab->map.max_entries : HASH_MAX_BUCKETS;
+
+ /* hash table size must be power of 2 */
+ if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
+ goto free_htab;
+
+ err = -E2BIG;
+ if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+ goto free_htab;
+
+ err = -ENOMEM;
+ htab->buckets = kmalloc_array(htab->n_buckets,
+ sizeof(struct hlist_head), GFP_USER);
+
+ if (!htab->buckets)
+ goto free_htab;
+
+ for (i = 0; i < htab->n_buckets; i++)
+ INIT_HLIST_HEAD(&htab->buckets[i]);
+
+ spin_lock_init(&htab->lock);
+ htab->count = 0;
+
+ htab->elem_size = sizeof(struct htab_elem) +
+ round_up(htab->map.key_size, 8) +
+ htab->map.value_size;
+
+ htab->elem_cache = kmem_cache_create("bpf_htab", htab->elem_size, 0, 0,
+ NULL);
+ if (!htab->elem_cache)
+ goto free_buckets;
+
+ return &htab->map;
+
+free_buckets:
+ kfree(htab->buckets);
+free_htab:
+ kfree(htab);
+ return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+ return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+ return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+ void *key, u32 key_size)
+{
+ struct htab_elem *l;
+
+ hlist_for_each_entry_rcu(l, head, hash_node) {
+ if (l->hash == hash && !memcmp(&l->key, key, key_size))
+ return l;
+ }
+ return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_head *head;
+ struct htab_elem *l;
+ u32 hash, key_size;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ head = select_bucket(htab, hash);
+
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (l)
+ return l->key + round_up(map->key_size, 8);
+ else
+ return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_head *head;
+ struct htab_elem *l, *next_l;
+ u32 hash, key_size;
+ int i;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ head = select_bucket(htab, hash);
+
+ /* lookup the key */
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (!l) {
+ i = 0;
+ goto find_first_elem;
+ }
+
+ /* key was found, get next key in the same bucket */
+ next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+ struct htab_elem, hash_node);
+
+ if (next_l) {
+ /* if next elem in this hash list is non-zero, just return it */
+ memcpy(next_key, next_l->key, key_size);
+ return 0;
+ } else {
+ /* no more elements in this hash list, go to the next bucket */
+ i = hash & (htab->n_buckets - 1);
+ i++;
+ }
+
+find_first_elem:
+ /* iterate over buckets */
+ for (; i < htab->n_buckets; i++) {
+ head = select_bucket(htab, i);
+
+ /* pick first element in the bucket */
+ next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
+ struct htab_elem, hash_node);
+ if (next_l) {
+ /* if it's not empty, just return it */
+ memcpy(next_key, next_l->key, key_size);
+ return 0;
+ }
+ }
+
+ /* itereated over all buckets and all elements */
+ return -ENOENT;
+}
+
+static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
+{
+ void *l;
+
+ l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
+ if (!l)
+ return ERR_PTR(-ENOMEM);
+ return l;
+}
+
+static void free_htab_elem_rcu(struct rcu_head *rcu)
+{
+ struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
+
+ kmem_cache_free(l->htab->elem_cache, l);
+}
+
+static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+ l->htab = htab;
+ call_rcu(&l->rcu, free_htab_elem_rcu);
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct htab_elem *l_new, *l_old;
+ struct hlist_head *head;
+ unsigned long flags;
+ u32 key_size;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ l_new = htab_alloc_elem(htab);
+ if (IS_ERR(l_new))
+ return -ENOMEM;
+
+ key_size = map->key_size;
+
+ memcpy(l_new->key, key, key_size);
+ memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
+
+ l_new->hash = htab_map_hash(l_new->key, key_size);
+
+ /* bpf_map_update_elem() can be called in_irq() as well, so
+ * spin_lock() or spin_lock_bh() cannot be used
+ */
+ spin_lock_irqsave(&htab->lock, flags);
+
+ head = select_bucket(htab, l_new->hash);
+
+ l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+
+ if (!l_old && unlikely(htab->count >= map->max_entries)) {
+ /* if elem with this 'key' doesn't exist and we've reached
+ * max_entries limit, fail insertion of new elem
+ */
+ spin_unlock_irqrestore(&htab->lock, flags);
+ kmem_cache_free(htab->elem_cache, l_new);
+ return -EFBIG;
+ }
+
+ /* add new element to the head of the list, so that concurrent
+ * search will find it before old elem
+ */
+ hlist_add_head_rcu(&l_new->hash_node, head);
+ if (l_old) {
+ hlist_del_rcu(&l_old->hash_node);
+ release_htab_elem(htab, l_old);
+ } else {
+ htab->count++;
+ }
+ spin_unlock_irqrestore(&htab->lock, flags);
+
+ return 0;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_delete_elem(struct bpf_map *map, void *key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_head *head;
+ struct htab_elem *l;
+ unsigned long flags;
+ u32 hash, key_size;
+ int ret = -ESRCH;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ spin_lock_irqsave(&htab->lock, flags);
+
+ head = select_bucket(htab, hash);
+
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (l) {
+ hlist_del_rcu(&l->hash_node);
+ htab->count--;
+ release_htab_elem(htab, l);
+ ret = 0;
+ }
+
+ spin_unlock_irqrestore(&htab->lock, flags);
+ return ret;
+}
+
+static void delete_all_elements(struct bpf_htab *htab)
+{
+ int i;
+
+ for (i = 0; i < htab->n_buckets; i++) {
+ struct hlist_head *head = select_bucket(htab, i);
+ struct hlist_node *n;
+ struct htab_elem *l;
+
+ hlist_for_each_entry_safe(l, n, head, hash_node) {
+ hlist_del_rcu(&l->hash_node);
+ htab->count--;
+ kmem_cache_free(htab->elem_cache, l);
+ }
+ }
+}
+
+/* called when map->refcnt goes to zero */
+static void htab_map_free(struct bpf_map *map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+ /* wait for all outstanding updates to complete */
+ synchronize_rcu();
+
+ /* kmem_cache_free all htab elements */
+ delete_all_elements(htab);
+
+ /* and destroy cache, which might sleep */
+ kmem_cache_destroy(htab->elem_cache);
+
+ kfree(htab->buckets);
+ kfree(htab);
+}
+
+static struct bpf_map_ops htab_ops = {
+ .map_alloc = htab_map_alloc,
+ .map_free = htab_map_free,
+ .map_get_next_key = htab_map_get_next_key,
+ .map_lookup_elem = htab_map_lookup_elem,
+ .map_update_elem = htab_map_update_elem,
+ .map_delete_elem = htab_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+ .ops = &htab_ops,
+ .type = BPF_MAP_TYPE_HASH,
+};
+
+static int __init register_htab_map(void)
+{
+ bpf_register_map_type(&tl);
+ return 0;
+}
+late_initcall(register_htab_map);
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:22 UTC
Permalink
eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
to refer to process-local map_fd. Scan the program for such instructions and
if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
by verifier to check access to maps in bpf_map_lookup/update() calls.
If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
BPF_PSEUDO_MAP_FD flag.

Note that eBPF interpreter is generic and knows nothing about pseudo insns.

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---
include/uapi/linux/bpf.h | 12 ++++
kernel/bpf/verifier.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 158 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 72bed3950bf1..68822671ab5e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -169,6 +169,18 @@ enum {
.off = 0, \
.imm = IMM })

+#define BPF_PSEUDO_MAP_FD 1
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_DW | BPF_IMM, \
+ .dst_reg = DST, \
+ .src_reg = BPF_PSEUDO_MAP_FD, \
+ .off = 0, \
+ .imm = MAP_FD }), \
+ BPF_LD_IMM64(DST, 0)
+
/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */

#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 27b34e1c8fbf..f71d4c494aec 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -143,10 +143,15 @@
* load/store to bpf_context are checked against known fields
*/

+#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
+
/* single container for all structs
* one verifier_env per bpf_check() call
*/
struct verifier_env {
+ struct bpf_prog *prog; /* eBPF program being verified */
+ struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */
+ u32 used_map_cnt; /* number of used maps */
};

/* verbose verifier prints what it's seeing
@@ -318,6 +323,114 @@ static void print_bpf_insn(struct bpf_insn *insn)
}
}

+/* return the map pointer stored inside BPF_LD_IMM64 instruction */
+static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn)
+{
+ u64 imm64 = ((u64) (u32) insn[0].imm) | ((u64) (u32) insn[1].imm) << 32;
+
+ return (struct bpf_map *) (unsigned long) imm64;
+}
+
+/* look for pseudo eBPF instructions that access map FDs and
+ * replace them with actual map pointers
+ */
+static int replace_map_fd_with_map_ptr(struct verifier_env *env)
+{
+ struct bpf_insn *insn = env->prog->insnsi;
+ int insn_cnt = env->prog->len;
+ int i, j;
+
+ for (i = 0; i < insn_cnt; i++, insn++) {
+ if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
+ struct bpf_map *map;
+ struct fd f;
+
+ if (i == insn_cnt - 1 ||
+ insn[1].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+ verbose("invalid bpf_ld_imm64 insn\n");
+ return -EINVAL;
+ }
+
+ if (insn->src_reg == 0)
+ /* valid generic load 64-bit imm */
+ goto next_insn;
+
+ if (insn->src_reg != BPF_PSEUDO_MAP_FD) {
+ verbose("unrecognized bpf_ld_imm64 insn\n");
+ return -EINVAL;
+ }
+
+ f = fdget(insn->imm);
+
+ map = bpf_map_get(f);
+ if (IS_ERR(map)) {
+ verbose("fd %d is not pointing to valid bpf_map\n",
+ insn->imm);
+ fdput(f);
+ return PTR_ERR(map);
+ }
+
+ /* store map pointer inside BPF_LD_IMM64 instruction */
+ insn[0].imm = (u32) (unsigned long) map;
+ insn[1].imm = ((u64) (unsigned long) map) >> 32;
+
+ /* check whether we recorded this map already */
+ for (j = 0; j < env->used_map_cnt; j++)
+ if (env->used_maps[j] == map) {
+ fdput(f);
+ goto next_insn;
+ }
+
+ if (env->used_map_cnt >= MAX_USED_MAPS) {
+ fdput(f);
+ return -E2BIG;
+ }
+
+ /* remember this map */
+ env->used_maps[env->used_map_cnt++] = map;
+
+ /* hold the map. If the program is rejected by verifier,
+ * the map will be released by release_maps() or it
+ * will be used by the valid program until it's unloaded
+ * and all maps are released in free_bpf_prog_info()
+ */
+ atomic_inc(&map->refcnt);
+
+ fdput(f);
+next_insn:
+ insn++;
+ i++;
+ }
+ }
+
+ /* now all pseudo BPF_LD_IMM64 instructions load valid
+ * 'struct bpf_map *' into a register instead of user map_fd.
+ * These pointers will be used later by verifier to validate map access.
+ */
+ return 0;
+}
+
+/* drop refcnt of maps used by the rejected program */
+static void release_maps(struct verifier_env *env)
+{
+ int i;
+
+ for (i = 0; i < env->used_map_cnt; i++)
+ bpf_map_put(env->used_maps[i]);
+}
+
+/* convert pseudo BPF_LD_IMM64 into generic BPF_LD_IMM64 */
+static void convert_pseudo_ld_imm64(struct verifier_env *env)
+{
+ struct bpf_insn *insn = env->prog->insnsi;
+ int insn_cnt = env->prog->len;
+ int i;
+
+ for (i = 0; i < insn_cnt; i++, insn++)
+ if (insn->code == (BPF_LD | BPF_IMM | BPF_DW))
+ insn->src_reg = 0;
+}
+
int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
{
void __user *log_ubuf = NULL;
@@ -334,6 +447,8 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
if (!env)
return -ENOMEM;

+ env->prog = prog;
+
/* grab the mutex to protect few globals used by verifier */
mutex_lock(&bpf_verifier_lock);

@@ -361,8 +476,14 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
log_level = 0;
}

+ ret = replace_map_fd_with_map_ptr(env);
+ if (ret < 0)
+ goto skip_full_check;
+
/* ret = do_check(env); */

+skip_full_check:
+
if (log_level && log_len >= log_size - 1) {
BUG_ON(log_len >= log_size);
/* verifier log exceeded user supplied buffer */
@@ -376,11 +497,36 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
goto free_log_buf;
}

+ if (ret == 0 && env->used_map_cnt) {
+ /* if program passed verifier, update used_maps in bpf_prog_info */
+ prog->info->used_maps = kmalloc_array(env->used_map_cnt,
+ sizeof(env->used_maps[0]),
+ GFP_KERNEL);
+
+ if (!prog->info->used_maps) {
+ ret = -ENOMEM;
+ goto free_log_buf;
+ }
+
+ memcpy(prog->info->used_maps, env->used_maps,
+ sizeof(env->used_maps[0]) * env->used_map_cnt);
+ prog->info->used_map_cnt = env->used_map_cnt;
+
+ /* program is valid. Convert pseudo bpf_ld_imm64 into generic
+ * bpf_ld_imm64 instructions
+ */
+ convert_pseudo_ld_imm64(env);
+ }

free_log_buf:
if (log_level)
vfree(log_buf);
free_env:
+ if (!prog->info->used_maps)
+ /* if we didn't copy map pointers into bpf_prog_info, release
+ * them now. Otherwise free_bpf_prog_info() will release them.
+ */
+ release_maps(env);
kfree(env);
mutex_unlock(&bpf_verifier_lock);
return ret;
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:19 UTC
Permalink
in native eBPF programs userspace is using pseudo BPF_CALL instructions
which encode one of 'enum bpf_func_id' inside insn->imm field.
Verifier checks that program using correct function arguments to given func_id.
If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
replacing func_id with in-kernel function pointer.
eBPF interpreter just calls the function.

In-kernel eBPF users continue to use generic BPF_CALL.

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
kernel/bpf/syscall.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4c5f5169f6fc..5a336af61858 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -338,6 +338,40 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl)
list_add(&tl->list_node, &bpf_prog_types);
}

+/* fixup insn->imm field of bpf_call instructions:
+ * if (insn->imm == BPF_FUNC_map_lookup_elem)
+ * insn->imm = bpf_map_lookup_elem - __bpf_call_base;
+ * else if (insn->imm == BPF_FUNC_map_update_elem)
+ * insn->imm = bpf_map_update_elem - __bpf_call_base;
+ * else ...
+ *
+ * this function is called after eBPF program passed verification
+ */
+static void fixup_bpf_calls(struct bpf_prog *prog)
+{
+ const struct bpf_func_proto *fn;
+ int i;
+
+ for (i = 0; i < prog->len; i++) {
+ struct bpf_insn *insn = &prog->insnsi[i];
+
+ if (insn->code == (BPF_JMP | BPF_CALL)) {
+ /* we reach here when program has bpf_call instructions
+ * and it passed bpf_check(), means that
+ * ops->get_func_proto must have been supplied, check it
+ */
+ BUG_ON(!prog->info->ops->get_func_proto);
+
+ fn = prog->info->ops->get_func_proto(insn->imm);
+ /* all functions that have prototype and verifier allowed
+ * programs to call them, must be real in-kernel functions
+ */
+ BUG_ON(!fn->func);
+ insn->imm = fn->func - __bpf_call_base;
+ }
+ }
+}
+
/* drop refcnt on maps used by eBPF program and free auxilary data */
static void free_bpf_prog_info(struct bpf_prog_info *info)
{
@@ -485,6 +519,9 @@ static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
if (err < 0)
goto free_prog_info;

+ /* fixup BPF_CALL->imm field */
+ fixup_bpf_calls(prog);
+
/* eBPF program is ready to be JITed */
bpf_prog_select_runtime(prog);
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:34 UTC
Permalink
simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

It parses elf file compiled by llvm .c->.o
- parses 'maps' section and creates maps via BPF syscall
- parses 'license' section and passes it to syscall
- parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns
by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
- loads eBPF program via BPF syscall
- attaches program FD to tracepoint events

One ELF file can contain multiple BPF programs attached to multiple
tracepoint events

int load_bpf_file(char *path);

bpf_helpers.h is a set of in-kernel helper functions available to eBPF programs

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---
samples/bpf/bpf_helpers.h | 25 +++++
samples/bpf/bpf_load.c | 232 +++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/bpf_load.h | 26 +++++
3 files changed, 283 insertions(+)
create mode 100644 samples/bpf/bpf_helpers.h
create mode 100644 samples/bpf/bpf_load.c
create mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
new file mode 100644
index 000000000000..1ac4cd629f63
--- /dev/null
+++ b/samples/bpf/bpf_helpers.h
@@ -0,0 +1,25 @@
+#ifndef __BPF_HELPERS_H
+#define __BPF_HELPERS_H
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+static void *(*bpf_fetch_ptr)(void *unsafe_ptr) = (void *) BPF_FUNC_fetch_ptr;
+static unsigned long long (*bpf_fetch_u64)(void *unsafe_ptr) = (void *) BPF_FUNC_fetch_u64;
+static unsigned int (*bpf_fetch_u32)(void *unsafe_ptr) = (void *) BPF_FUNC_fetch_u32;
+static unsigned short (*bpf_fetch_u16)(void *unsafe_ptr) = (void *) BPF_FUNC_fetch_u16;
+static unsigned char (*bpf_fetch_u8)(void *unsafe_ptr) = (void *) BPF_FUNC_fetch_u8;
+static int (*bpf_printk)(const char *fmt, int fmt_size, ...) = (void *) BPF_FUNC_printk;
+static int (*bpf_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) = (void *) BPF_FUNC_memcmp;
+static void *(*bpf_map_lookup_elem)(void *map, void *key) = (void*) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value) = (void*) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) = (void *) BPF_FUNC_map_delete_elem;
+static void (*bpf_dump_stack)(void) = (void *) BPF_FUNC_dump_stack;
+
+struct bpf_map_def {
+ int type;
+ int key_size;
+ int value_size;
+ int max_entries;
+};
+
+#endif
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
new file mode 100644
index 000000000000..0c36a2060fbb
--- /dev/null
+++ b/samples/bpf/bpf_load.c
@@ -0,0 +1,232 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include "libbpf.h"
+#include "bpf_helpers.h"
+#include "bpf_load.h"
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+static char license[128];
+static bool processed_sec[128];
+int map_fd[MAX_MAPS];
+
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+ int fd, event_fd, err;
+ char fmt[32];
+ char path[256] = DEBUGFS;
+
+ fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog, size, license);
+
+ if (fd < 0) {
+ printf("err %d errno %d\n", fd, errno);
+ return fd;
+ }
+
+ snprintf(fmt, sizeof(fmt), "bpf-%d", fd);
+
+ strcat(path, event);
+ strcat(path, "/filter");
+
+ printf("writing %s -> %s\n", fmt, path);
+
+ event_fd = open(path, O_WRONLY, 0);
+ if (event_fd < 0) {
+ printf("failed to open event %s\n", event);
+ return event_fd;
+ }
+
+ err = write(event_fd, fmt, strlen(fmt));
+ (void) err;
+
+ return 0;
+}
+
+static int load_maps(struct bpf_map_def *maps, int len)
+{
+ int i;
+
+ for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
+
+ map_fd[i] = bpf_create_map(maps[i].type,
+ maps[i].key_size,
+ maps[i].value_size,
+ maps[i].max_entries);
+ if (map_fd[i] < 0)
+ return 1;
+ }
+ return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+ GElf_Shdr *shdr, Elf_Data **data)
+{
+ Elf_Scn *scn;
+
+ scn = elf_getscn(elf, i);
+ if (!scn)
+ return 1;
+
+ if (gelf_getshdr(scn, shdr) != shdr)
+ return 2;
+
+ *shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+ if (!*shname || !shdr->sh_size)
+ return 3;
+
+ *data = elf_getdata(scn, 0);
+ if (!*data || elf_getdata(scn, *data) != NULL)
+ return 4;
+
+ return 0;
+}
+
+static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
+ GElf_Shdr *shdr, struct bpf_insn *insn)
+{
+ int i, nrels;
+
+ nrels = shdr->sh_size / shdr->sh_entsize;
+
+ for (i = 0; i < nrels; i++) {
+ GElf_Sym sym;
+ GElf_Rel rel;
+ unsigned int insn_idx;
+
+ gelf_getrel(data, i, &rel);
+
+ insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+ gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+ if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+ printf("invalid relo for insn->code %d\n",
+ insn[insn_idx].code);
+ return 1;
+ }
+ insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+ insn[insn_idx].imm = map_fd[sym.st_value / sizeof(struct bpf_map_def)];
+ }
+
+ return 0;
+}
+
+int load_bpf_file(char *path)
+{
+ int fd, i;
+ Elf *elf;
+ GElf_Ehdr ehdr;
+ GElf_Shdr shdr, shdr_prog;
+ Elf_Data *data, *data_prog, *symbols = NULL;
+ char *shname, *shname_prog;
+
+ if (elf_version(EV_CURRENT) == EV_NONE)
+ return 1;
+
+ fd = open(path, O_RDONLY, 0);
+ if (fd < 0)
+ return 1;
+
+ elf = elf_begin(fd, ELF_C_READ, NULL);
+
+ if (!elf)
+ return 1;
+
+ if (gelf_getehdr(elf, &ehdr) != &ehdr)
+ return 1;
+
+ /* scan over all elf sections to get license and map info */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+
+ if (0)
+ printf("section %d:%s data %p size %zd link %d flags %d\n",
+ i, shname, data->d_buf, data->d_size,
+ shdr.sh_link, (int) shdr.sh_flags);
+
+ if (strcmp(shname, "license") == 0) {
+ processed_sec[i] = true;
+ memcpy(license, data->d_buf, data->d_size);
+ } else if (strcmp(shname, "maps") == 0) {
+ processed_sec[i] = true;
+ if (load_maps(data->d_buf, data->d_size))
+ return 1;
+ } else if (shdr.sh_type == SHT_SYMTAB) {
+ symbols = data;
+ }
+ }
+
+ /* load programs that need map fixup (relocations) */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+ if (shdr.sh_type == SHT_REL) {
+ struct bpf_insn *insns;
+
+ if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+ &shdr_prog, &data_prog))
+ continue;
+
+ if (0)
+ printf("relo %s into %s\n", shname, shname_prog);
+
+ insns = (struct bpf_insn *) data_prog->d_buf;
+
+ processed_sec[shdr.sh_info] = true;
+ processed_sec[i] = true;
+
+ if (parse_relo_and_apply(data, symbols, &shdr, insns))
+ continue;
+
+ if (memcmp(shname_prog, "events/", sizeof("events/") - 1) == 0)
+ load_and_attach(shname_prog, insns, data_prog->d_size);
+ }
+ }
+
+ /* load programs that don't use maps */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (processed_sec[i])
+ continue;
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+
+ if (memcmp(shname, "events/", sizeof("events/") - 1) == 0)
+ load_and_attach(shname, data->d_buf, data->d_size);
+ }
+
+ close(fd);
+ return 0;
+}
+
+void read_trace_pipe(void)
+{
+ int trace_fd;
+
+ trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+ if (trace_fd < 0)
+ return;
+
+ while (1) {
+ static char buf[4096];
+ ssize_t sz;
+
+ sz = read(trace_fd, buf, sizeof(buf));
+ if (sz)
+ puts(buf);
+ }
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
new file mode 100644
index 000000000000..209190d793ff
--- /dev/null
+++ b/samples/bpf/bpf_load.h
@@ -0,0 +1,26 @@
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#define MAX_MAPS 64
+
+extern int map_fd[MAX_MAPS];
+
+/* parses elf file compiled by llvm .c->.o
+ * . parses 'maps' section and creates maps via BPF syscall
+ * . parses 'license' section and passes it to syscall
+ * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
+ * storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
+ * . loads eBPF program via BPF syscall
+ * . attaches program FD to tracepoint events
+ *
+ * One ELF file can contain multiple BPF programs attached to multiple
+ * tracepoint events
+ *
+ * returns zero on success
+ */
+int load_bpf_file(char *path);
+
+/* forever reads /sys/.../trace_pipe */
+void read_trace_pipe(void);
+
+#endif
--
1.7.9.5
Brendan Gregg
2014-08-14 19:29:09 UTC
Permalink
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org> wrote:
[...]
Post by Alexei Starovoitov
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+ int fd, event_fd, err;
+ char fmt[32];
+ char path[256] = DEBUGFS;
+
+ fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog, size, license);
+
+ if (fd < 0) {
+ printf("err %d errno %d\n", fd, errno);
+ return fd;
+ }
Minor suggestion: since this is sample code, I'd always print the bpf
log after this this printf() error message:

printf("%s", bpf_log_buf);

Which has helped me debug my eBPF programs, as will be the case for
anyone hacking on the examples. Or have a function for logdie(), if
the log buffer may be populated with useful messages from other error
paths as well.

Brendan
Alexei Starovoitov
2014-08-15 05:56:07 UTC
Permalink
On Thu, Aug 14, 2014 at 12:29 PM, Brendan Gregg
Post by Brendan Gregg
[...]
Post by Alexei Starovoitov
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+ int fd, event_fd, err;
+ char fmt[32];
+ char path[256] = DEBUGFS;
+
+ fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog, size, license);
+
+ if (fd < 0) {
+ printf("err %d errno %d\n", fd, errno);
+ return fd;
+ }
Minor suggestion: since this is sample code, I'd always print the bpf
printf("%s", bpf_log_buf);
Which has helped me debug my eBPF programs, as will be the case for
anyone hacking on the examples.
Good point. Will do in V5.
Post by Brendan Gregg
Or have a function for logdie(), if
the log buffer may be populated with useful messages from other error
paths as well.
This log buffer is an optional buffer that eBPF verifier is using to
store its messages. Mainly for humans to understand why verifier
rejected the program. It's also used by verifier testsuite to check
that reject reason actually matches the test intent.
Alexei Starovoitov
2014-08-13 07:57:25 UTC
Permalink
since verifier walks all possible paths it's important to recognize
equivalent verifier states to speed up verification.

If one of the old states is more strict than the current state, it means
the current state doesn't need to be explored further, since verifier already
concluded that more strict state leads to valid bpf_exit.

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
kernel/bpf/verifier.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 134 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2dcfa5f76418..2c825c3a01cb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -219,6 +219,7 @@ struct verifier_env {
struct verifier_stack_elem *head; /* stack of verifier states to be processed */
int stack_size; /* number of states to be processed */
struct verifier_state cur_state; /* current verifier state */
+ struct verifier_state_list **branch_landing; /* search prunning optimization */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */
u32 used_map_cnt; /* number of used maps */
};
@@ -1207,6 +1208,8 @@ enum {
BRANCH = 2,
};

+#define STATE_END ((struct verifier_state_list *) -1L)
+
#define PUSH_INT(I) \
do { \
if (cur_stack >= insn_cnt) { \
@@ -1248,6 +1251,9 @@ enum {
ret = -EINVAL; \
goto free_st; \
} \
+ if (E == BRANCH) \
+ /* mark branch target for state pruning */ \
+ env->branch_landing[w] = STATE_END; \
if (st[w] == 0) { \
/* tree-edge */ \
st[T] = DISCOVERED | E; \
@@ -1315,6 +1321,10 @@ peek_stack:
PUSH_INSN(t, t + 1, FALLTHROUGH);
PUSH_INSN(t, t + insns[t].off + 1, BRANCH);
}
+ /* tell verifier to check for equivalent verifier states
+ * after every call and jump
+ */
+ env->branch_landing[t + 1] = STATE_END;
} else {
/* all other non-branch instructions with single
* fall-through edge
@@ -1346,6 +1356,88 @@ free_st:
return ret;
}

+/* compare two verifier states
+ *
+ * all states stored in state_list are known to be valid, since
+ * verifier reached 'bpf_exit' instruction through them
+ *
+ * this function is called when verifier exploring different branches of
+ * execution popped from the state stack. If it sees an old state that has
+ * more strict register state and more strict stack state then this execution
+ * branch doesn't need to be explored further, since verifier already
+ * concluded that more strict state leads to valid finish.
+ *
+ * Therefore two states are equivalent if register state is more conservative
+ * and explored stack state is more conservative than the current one.
+ * Example:
+ * explored current
+ * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
+ * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
+ *
+ * In other words if current stack state (one being explored) has more
+ * valid slots than old one that already passed validation, it means
+ * the verifier can stop exploring and conclude that current state is valid too
+ *
+ * Similarly with registers. If explored state has register type as invalid
+ * whereas register type in current state is meaningful, it means that
+ * the current state will reach 'bpf_exit' instruction safely
+ */
+static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
+{
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ if (memcmp(&old->regs[i], &cur->regs[i],
+ sizeof(old->regs[0])) != 0) {
+ if (old->regs[i].type == NOT_INIT ||
+ old->regs[i].type == UNKNOWN_VALUE)
+ continue;
+ return false;
+ }
+ }
+
+ for (i = 0; i < MAX_BPF_STACK; i++) {
+ if (memcmp(&old->stack[i], &cur->stack[i],
+ sizeof(old->stack[0])) != 0) {
+ if (old->stack[i].stype == STACK_INVALID)
+ continue;
+ return false;
+ }
+ }
+ return true;
+}
+
+static int is_state_visited(struct verifier_env *env, int insn_idx)
+{
+ struct verifier_state_list *new_sl;
+ struct verifier_state_list *sl;
+
+ sl = env->branch_landing[insn_idx];
+ if (!sl)
+ /* no branch jump to this insn, ignore it */
+ return 0;
+
+ while (sl != STATE_END) {
+ if (states_equal(&sl->state, &env->cur_state))
+ /* reached equivalent register/stack state,
+ * prune the search
+ */
+ return 1;
+ sl = sl->next;
+ }
+ new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
+
+ if (!new_sl)
+ /* ignore ENOMEM, it doesn't affect correctness */
+ return 0;
+
+ /* add new state to the head of linked list */
+ memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
+ new_sl->next = env->branch_landing[insn_idx];
+ env->branch_landing[insn_idx] = new_sl;
+ return 0;
+}
+
static int do_check(struct verifier_env *env)
{
struct verifier_state *state = &env->cur_state;
@@ -1377,6 +1469,17 @@ static int do_check(struct verifier_env *env)
return -E2BIG;
}

+ if (is_state_visited(env, insn_idx)) {
+ if (log_level) {
+ if (do_print_state)
+ verbose("\nfrom %d to %d: safe\n",
+ prev_insn_idx, insn_idx);
+ else
+ verbose("%d: safe\n", insn_idx);
+ }
+ goto process_bpf_exit;
+ }
+
if (log_level && do_print_state) {
verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
print_verifier_state(env);
@@ -1464,6 +1567,7 @@ static int do_check(struct verifier_env *env)
* something into it earlier
*/
_(check_reg_arg(regs, BPF_REG_0, 1));
+process_bpf_exit:
insn_idx = pop_stack(env, &prev_insn_idx);
if (insn_idx < 0) {
break;
@@ -1597,6 +1701,28 @@ static void convert_pseudo_ld_imm64(struct verifier_env *env)
insn->src_reg = 0;
}

+static void free_states(struct verifier_env *env)
+{
+ struct verifier_state_list *sl, *sln;
+ int i;
+
+ if (!env->branch_landing)
+ return;
+
+ for (i = 0; i < env->prog->len; i++) {
+ sl = env->branch_landing[i];
+
+ if (sl)
+ while (sl != STATE_END) {
+ sln = sl->next;
+ kfree(sl);
+ sl = sln;
+ }
+ }
+
+ kfree(env->branch_landing);
+}
+
int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
{
void __user *log_ubuf = NULL;
@@ -1646,6 +1772,13 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
if (ret < 0)
goto skip_full_check;

+ env->branch_landing = kcalloc(prog->len,
+ sizeof(struct verifier_state_list *),
+ GFP_KERNEL);
+ ret = -ENOMEM;
+ if (!env->branch_landing)
+ goto skip_full_check;
+
ret = check_cfg(env);
if (ret < 0)
goto skip_full_check;
@@ -1654,6 +1787,7 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])

skip_full_check:
while (pop_stack(env, NULL) >= 0);
+ free_states(env);

if (log_level && log_len >= log_size - 1) {
BUG_ON(log_len >= log_size);
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:21 UTC
Permalink
add optional attributes for BPF_PROG_LOAD syscall:
BPF_PROG_LOG_LEVEL, /* verbosity level of eBPF verifier */
BPF_PROG_LOG_BUF, /* user supplied buffer */
BPF_PROG_LOG_SIZE, /* size of user buffer */

In such case the verifier will return its verification log in the user
supplied buffer which can be used by humans to analyze why verifier
rejected given program

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/uapi/linux/bpf.h | 4 +
kernel/bpf/syscall.c | 3 +
kernel/bpf/verifier.c | 236 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 243 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa09ba084ebc..72bed3950bf1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -372,6 +372,10 @@ enum bpf_prog_attributes {
BPF_PROG_UNSPEC,
BPF_PROG_TEXT, /* array of eBPF instructions */
BPF_PROG_LICENSE, /* license string */
+ /* optional program attributes */
+ BPF_PROG_LOG_LEVEL, /* verbosity level of eBPF verifier */
+ BPF_PROG_LOG_BUF, /* user supplied buffer */
+ BPF_PROG_LOG_SIZE, /* size of user buffer */
__BPF_PROG_ATTR_MAX,
};
#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a3581646ee11..60cb760cb423 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -443,6 +443,9 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
[BPF_PROG_TEXT] = { .type = NLA_BINARY },
[BPF_PROG_LICENSE] = { .type = NLA_NUL_STRING },
+ [BPF_PROG_LOG_LEVEL] = { .type = NLA_U32 },
+ [BPF_PROG_LOG_BUF] = { .len = sizeof(void *) },
+ [BPF_PROG_LOG_SIZE] = { .type = NLA_U32 },
};

static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cf8a0131cd91..27b34e1c8fbf 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -143,9 +143,245 @@
* load/store to bpf_context are checked against known fields
*/

+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+};
+
+/* verbose verifier prints what it's seeing
+ * bpf_check() is called under lock, so no race to access these global vars
+ */
+static u32 log_level, log_size, log_len;
+static void *log_buf;
+
+static DEFINE_MUTEX(bpf_verifier_lock);
+
+/* log_level controls verbosity level of eBPF verifier.
+ * verbose() is used to dump the verification trace to the log, so the user
+ * can figure out what's wrong with the program
+ */
+static void verbose(const char *fmt, ...)
+{
+ va_list args;
+
+ if (log_level == 0 || log_len >= log_size - 1)
+ return;
+
+ va_start(args, fmt);
+ log_len += vscnprintf(log_buf + log_len, log_size - log_len, fmt, args);
+ va_end(args);
+}
+
+static const char *const bpf_class_string[] = {
+ [BPF_LD] = "ld",
+ [BPF_LDX] = "ldx",
+ [BPF_ST] = "st",
+ [BPF_STX] = "stx",
+ [BPF_ALU] = "alu",
+ [BPF_JMP] = "jmp",
+ [BPF_RET] = "BUG",
+ [BPF_ALU64] = "alu64",
+};
+
+static const char *const bpf_alu_string[] = {
+ [BPF_ADD >> 4] = "+=",
+ [BPF_SUB >> 4] = "-=",
+ [BPF_MUL >> 4] = "*=",
+ [BPF_DIV >> 4] = "/=",
+ [BPF_OR >> 4] = "|=",
+ [BPF_AND >> 4] = "&=",
+ [BPF_LSH >> 4] = "<<=",
+ [BPF_RSH >> 4] = ">>=",
+ [BPF_NEG >> 4] = "neg",
+ [BPF_MOD >> 4] = "%=",
+ [BPF_XOR >> 4] = "^=",
+ [BPF_MOV >> 4] = "=",
+ [BPF_ARSH >> 4] = "s>>=",
+ [BPF_END >> 4] = "endian",
+};
+
+static const char *const bpf_ldst_string[] = {
+ [BPF_W >> 3] = "u32",
+ [BPF_H >> 3] = "u16",
+ [BPF_B >> 3] = "u8",
+ [BPF_DW >> 3] = "u64",
+};
+
+static const char *const bpf_jmp_string[] = {
+ [BPF_JA >> 4] = "jmp",
+ [BPF_JEQ >> 4] = "==",
+ [BPF_JGT >> 4] = ">",
+ [BPF_JGE >> 4] = ">=",
+ [BPF_JSET >> 4] = "&",
+ [BPF_JNE >> 4] = "!=",
+ [BPF_JSGT >> 4] = "s>",
+ [BPF_JSGE >> 4] = "s>=",
+ [BPF_CALL >> 4] = "call",
+ [BPF_EXIT >> 4] = "exit",
+};
+
+static void print_bpf_insn(struct bpf_insn *insn)
+{
+ u8 class = BPF_CLASS(insn->code);
+
+ if (class == BPF_ALU || class == BPF_ALU64) {
+ if (BPF_SRC(insn->code) == BPF_X)
+ verbose("(%02x) %sr%d %s %sr%d\n",
+ insn->code, class == BPF_ALU ? "(u32) " : "",
+ insn->dst_reg,
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ class == BPF_ALU ? "(u32) " : "",
+ insn->src_reg);
+ else
+ verbose("(%02x) %sr%d %s %s%d\n",
+ insn->code, class == BPF_ALU ? "(u32) " : "",
+ insn->dst_reg,
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ class == BPF_ALU ? "(u32) " : "",
+ insn->imm);
+ } else if (class == BPF_STX) {
+ if (BPF_MODE(insn->code) == BPF_MEM)
+ verbose("(%02x) *(%s *)(r%d %+d) = r%d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg,
+ insn->off, insn->src_reg);
+ else if (BPF_MODE(insn->code) == BPF_XADD)
+ verbose("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg, insn->off,
+ insn->src_reg);
+ else
+ verbose("BUG_%02x\n", insn->code);
+ } else if (class == BPF_ST) {
+ if (BPF_MODE(insn->code) != BPF_MEM) {
+ verbose("BUG_st_%02x\n", insn->code);
+ return;
+ }
+ verbose("(%02x) *(%s *)(r%d %+d) = %d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg,
+ insn->off, insn->imm);
+ } else if (class == BPF_LDX) {
+ if (BPF_MODE(insn->code) != BPF_MEM) {
+ verbose("BUG_ldx_%02x\n", insn->code);
+ return;
+ }
+ verbose("(%02x) r%d = *(%s *)(r%d %+d)\n",
+ insn->code, insn->dst_reg,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->src_reg, insn->off);
+ } else if (class == BPF_LD) {
+ if (BPF_MODE(insn->code) == BPF_ABS) {
+ verbose("(%02x) r0 = *(%s *)skb[%d]\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->imm);
+ } else if (BPF_MODE(insn->code) == BPF_IND) {
+ verbose("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->src_reg, insn->imm);
+ } else if (BPF_MODE(insn->code) == BPF_IMM) {
+ verbose("(%02x) r%d = 0x%x\n",
+ insn->code, insn->dst_reg, insn->imm);
+ } else {
+ verbose("BUG_ld_%02x\n", insn->code);
+ return;
+ }
+ } else if (class == BPF_JMP) {
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_CALL) {
+ verbose("(%02x) call %d\n", insn->code, insn->imm);
+ } else if (insn->code == (BPF_JMP | BPF_JA)) {
+ verbose("(%02x) goto pc%+d\n",
+ insn->code, insn->off);
+ } else if (insn->code == (BPF_JMP | BPF_EXIT)) {
+ verbose("(%02x) exit\n", insn->code);
+ } else if (BPF_SRC(insn->code) == BPF_X) {
+ verbose("(%02x) if r%d %s r%d goto pc%+d\n",
+ insn->code, insn->dst_reg,
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->src_reg, insn->off);
+ } else {
+ verbose("(%02x) if r%d %s 0x%x goto pc%+d\n",
+ insn->code, insn->dst_reg,
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->imm, insn->off);
+ }
+ } else {
+ verbose("(%02x) %s\n", insn->code, bpf_class_string[class]);
+ }
+}
+
int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
{
+ void __user *log_ubuf = NULL;
+ struct verifier_env *env;
int ret = -EINVAL;

+ if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
+ return -E2BIG;
+
+ /* 'struct verifier_env' can be global, but since it's not small,
+ * allocate/free it every time bpf_check() is called
+ */
+ env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
+ if (!env)
+ return -ENOMEM;
+
+ /* grab the mutex to protect few globals used by verifier */
+ mutex_lock(&bpf_verifier_lock);
+
+ if (tb[BPF_PROG_LOG_LEVEL] && tb[BPF_PROG_LOG_BUF] &&
+ tb[BPF_PROG_LOG_SIZE]) {
+ /* user requested verbose verifier output
+ * and supplied buffer to store the verification trace
+ */
+ log_level = nla_get_u32(tb[BPF_PROG_LOG_LEVEL]);
+ log_ubuf = *(void __user **) nla_data(tb[BPF_PROG_LOG_BUF]);
+ log_size = nla_get_u32(tb[BPF_PROG_LOG_SIZE]);
+ log_len = 0;
+
+ ret = -EINVAL;
+ /* log_* values have to be sane */
+ if (log_size < 128 || log_size > UINT_MAX >> 8 ||
+ log_level == 0 || log_ubuf == NULL)
+ goto free_env;
+
+ ret = -ENOMEM;
+ log_buf = vmalloc(log_size);
+ if (!log_buf)
+ goto free_env;
+ } else {
+ log_level = 0;
+ }
+
+ /* ret = do_check(env); */
+
+ if (log_level && log_len >= log_size - 1) {
+ BUG_ON(log_len >= log_size);
+ /* verifier log exceeded user supplied buffer */
+ ret = -ENOSPC;
+ /* fall through to return what was recorded */
+ }
+
+ /* copy verifier log back to user space including trailing zero */
+ if (log_level && copy_to_user(log_ubuf, log_buf, log_len + 1) != 0) {
+ ret = -EFAULT;
+ goto free_log_buf;
+ }
+
+
+free_log_buf:
+ if (log_level)
+ vfree(log_buf);
+free_env:
+ kfree(env);
+ mutex_unlock(&bpf_verifier_lock);
return ret;
}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:36 UTC
Permalink
this example has two probes in C that use two different maps.

1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
count number of packet drops at different locations

2nd probe attaches to kprobe/sys_write and computes a histogram of different
write sizes

Usage:
$ sudo ex2

Should see:
writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
writing bpf-8 -> /sys/kernel/debug/tracing/events/kprobes/sys_write/filter
location 0xffffffff816efc67 count 1

location 0xffffffff815d8030 count 1
location 0xffffffff816efc67 count 3

location 0xffffffff815d8030 count 4
location 0xffffffff816efc67 count 9

syscall write() stats
byte_size : count distribution
1 -> 1 : 3141 |**** |
2 -> 3 : 2 | |
4 -> 7 : 14 | |
8 -> 15 : 3268 |***** |
16 -> 31 : 732 | |
32 -> 63 : 20042 |************************************* |
64 -> 127 : 12154 |********************** |
128 -> 255 : 2215 |*** |
256 -> 511 : 9 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 1 | |

Ctrl-C at any time. Kernel will auto cleanup maps and programs

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/Makefile | 6 ++--
samples/bpf/ex2_kern.c | 73 +++++++++++++++++++++++++++++++++++++
samples/bpf/ex2_user.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 171 insertions(+), 2 deletions(-)
create mode 100644 samples/bpf/ex2_kern.c
create mode 100644 samples/bpf/ex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index c97f408fcd6d..b865a5df5c60 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,19 +2,21 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example dropmon ex1
+hostprogs-y := sock_example dropmon ex1 ex2

sock_example-objs := sock_example.o libbpf.o
dropmon-objs := dropmon.o libbpf.o
ex1-objs := bpf_load.o libbpf.o ex1_user.o
+ex2-objs := bpf_load.o libbpf.o ex2_user.o

# Tell kbuild to always build the programs
-always := $(hostprogs-y) ex1_kern.o
+always := $(hostprogs-y) ex1_kern.o ex2_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include

HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_ex1 += -lelf
+HOSTLOADLIBES_ex2 += -lelf

LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc

diff --git a/samples/bpf/ex2_kern.c b/samples/bpf/ex2_kern.c
new file mode 100644
index 000000000000..2daa50b27ce5
--- /dev/null
+++ b/samples/bpf/ex2_kern.c
@@ -0,0 +1,73 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(long),
+ .max_entries = 1024,
+};
+
+SEC("events/skb/kfree_skb")
+int bpf_prog2(struct bpf_context *ctx)
+{
+ long loc = ctx->arg2;
+ long init_val = 1;
+ void *value;
+
+ value = bpf_map_lookup_elem(&my_map, &loc);
+ if (value)
+ (*(long *) value) += 1;
+ else
+ bpf_map_update_elem(&my_map, &loc, &init_val);
+ return 0;
+}
+
+static unsigned int log2(unsigned int v)
+{
+ unsigned int r;
+ unsigned int shift;
+
+ r = (v > 0xFFFF) << 4; v >>= r;
+ shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+ shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+ shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+ r |= (v >> 1);
+ return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+ unsigned int hi = v >> 32;
+ if (hi)
+ return log2(hi) + 32;
+ else
+ return log2(v);
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(long),
+ .max_entries = 64,
+};
+
+SEC("events/kprobes/sys_write")
+int bpf_prog3(struct bpf_context *ctx)
+{
+ long write_size = ctx->arg3;
+ long init_val = 1;
+ void *value;
+ u32 index = log2l(write_size);
+
+ value = bpf_map_lookup_elem(&my_hist_map, &index);
+ if (value)
+ __sync_fetch_and_add((long *)value, 1);
+ else
+ bpf_map_update_elem(&my_hist_map, &index, &init_val);
+ return 0;
+}
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex2_user.c b/samples/bpf/ex2_user.c
new file mode 100644
index 000000000000..fd5ce21ae60a
--- /dev/null
+++ b/samples/bpf/ex2_user.c
@@ -0,0 +1,94 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+ int i;
+
+ for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+ str[i] = '*';
+ if (val > max)
+ str[i - 1] = '+';
+ str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+ int key, next_key;
+ long value;
+ long data[MAX_INDEX] = {};
+ char starstr[MAX_STARS];
+ int i;
+ int max_ind = -1;
+ long max_value = 0;
+
+ key = -1; /* some unknown key */
+ while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+ bpf_lookup_elem(fd, &next_key, &value);
+ if (next_key > MAX_INDEX) {
+ printf("BUG: invalid index %d\n", next_key);
+ } else {
+ data[next_key] = value;
+ if (next_key > max_ind)
+ max_ind = next_key;
+ if (value > max_value)
+ max_value = value;
+ }
+ key = next_key;
+ }
+
+ printf(" syscall write() stats\n");
+ printf(" byte_size : count distribution\n");
+ for (i = 1; i <= max_ind + 1; i++) {
+ stars(starstr, data[i - 1], max_value, MAX_STARS);
+ printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+ (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+ MAX_STARS, starstr);
+ }
+}
+static void int_exit(int sig)
+{
+ print_hist(map_fd[1]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ long key, next_key, value;
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ signal(SIGINT, int_exit);
+
+ i = system("echo 'p:sys_write sys_write' > /sys/kernel/debug/tracing/kprobe_events");
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ for (i = 0; i < 5; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd[0], &next_key, &value);
+ printf("location 0x%lx count %ld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+ print_hist(map_fd[1]);
+
+ return 0;
+}
--
1.7.9.5
Brendan Gregg
2014-08-14 22:13:53 UTC
Permalink
Post by Alexei Starovoitov
this example has two probes in C that use two different maps.
1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
count number of packet drops at different locations
2nd probe attaches to kprobe/sys_write and computes a histogram of different
write sizes
$ sudo ex2
writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
writing bpf-8 -> /sys/kernel/debug/tracing/events/kprobes/sys_write/filter
location 0xffffffff816efc67 count 1
location 0xffffffff815d8030 count 1
location 0xffffffff816efc67 count 3
location 0xffffffff815d8030 count 4
location 0xffffffff816efc67 count 9
syscall write() stats
byte_size : count distribution
1 -> 1 : 3141 |**** |
2 -> 3 : 2 | |
4 -> 7 : 14 | |
8 -> 15 : 3268 |***** |
16 -> 31 : 732 | |
32 -> 63 : 20042 |************************************* |
64 -> 127 : 12154 |********************** |
128 -> 255 : 2215 |*** |
256 -> 511 : 9 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 1 | |
This is pretty awesome.

Given that this is tracing two tracepoints at once, I'd like to see a
similar example where time is stored on the first tracepoint,
retrieved on the second for a delta calculation, then presented with a
similar histogram as seen above.

Brendan
Alexei Starovoitov
2014-08-15 06:19:57 UTC
Permalink
On Thu, Aug 14, 2014 at 3:13 PM, Brendan Gregg
Post by Brendan Gregg
Post by Alexei Starovoitov
this example has two probes in C that use two different maps.
1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
count number of packet drops at different locations
2nd probe attaches to kprobe/sys_write and computes a histogram of different
write sizes
$ sudo ex2
writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
writing bpf-8 -> /sys/kernel/debug/tracing/events/kprobes/sys_write/filter
location 0xffffffff816efc67 count 1
location 0xffffffff815d8030 count 1
location 0xffffffff816efc67 count 3
location 0xffffffff815d8030 count 4
location 0xffffffff816efc67 count 9
syscall write() stats
byte_size : count distribution
1 -> 1 : 3141 |**** |
2 -> 3 : 2 | |
4 -> 7 : 14 | |
8 -> 15 : 3268 |***** |
16 -> 31 : 732 | |
32 -> 63 : 20042 |************************************* |
64 -> 127 : 12154 |********************** |
128 -> 255 : 2215 |*** |
256 -> 511 : 9 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 1 | |
This is pretty awesome.
Given that this is tracing two tracepoints at once, I'd like to see a
similar example where time is stored on the first tracepoint,
retrieved on the second for a delta calculation, then presented with a
similar histogram as seen above.
Very good point. The time related helpers are missing. In V5 I'm
thinking to add something like bpf_ktime_get_ns().
To associate begin and end events I think bpf_gettid() would be
needed, but that doesn't feel generic enough for helper function,
so I'm leaning toward 'bpf_get_current()' helper that will return
'current' task pointer. eBPF program can use this pointer for
correlation of events or can go exploring task fields with
bpf_fetch_() helpers...

Thank you very much for trying things out and for your feedback!
Alexei Starovoitov
2014-08-13 07:57:37 UTC
Permalink
simple verifier test from user space. Tests valid and invalid programs
and expects predefined error log messages from kernel

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/Makefile | 3 +-
samples/bpf/test_verifier.c | 354 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 356 insertions(+), 1 deletion(-)
create mode 100644 samples/bpf/test_verifier.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b865a5df5c60..e39cb4f13be9 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,13 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example dropmon ex1 ex2
+hostprogs-y := sock_example dropmon ex1 ex2 test_verifier

sock_example-objs := sock_example.o libbpf.o
dropmon-objs := dropmon.o libbpf.o
ex1-objs := bpf_load.o libbpf.o ex1_user.o
ex2-objs := bpf_load.o libbpf.o ex2_user.o
+test_verifier-objs := test_verifier.o libbpf.o

# Tell kbuild to always build the programs
always := $(hostprogs-y) ex1_kern.o ex2_kern.o
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
new file mode 100644
index 000000000000..46cef16425e4
--- /dev/null
+++ b/samples/bpf/test_verifier.c
@@ -0,0 +1,354 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include "libbpf.h"
+
+#define MAX_INSNS 512
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+struct bpf_test {
+ const char *descr;
+ struct bpf_insn insns[MAX_INSNS];
+ int fixup[32];
+ const char *errstr;
+ enum {
+ ACCEPT,
+ REJECT
+ } result;
+};
+
+static struct bpf_test tests[] = {
+ {
+ "add+sub+mul",
+ .insns = {
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 2),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, 3),
+ BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -1),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_1, 3),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_1),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ },
+ {
+ "dropmon",
+ .insns = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+ BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {4, 16},
+ .result = ACCEPT,
+ },
+ {
+ "dropmon2",
+ .insns = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+ BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+ BPF_JMP_IMM(BPF_JA, 0, 0, -10),
+ },
+ .fixup = {4, 16},
+ .result = ACCEPT,
+ },
+ {
+ "unreachable",
+ .insns = {
+ BPF_EXIT_INSN(),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "unreachable",
+ .result = REJECT,
+ },
+ {
+ "unreachable2",
+ .insns = {
+ BPF_JMP_IMM(BPF_JA, 0, 0, 1),
+ BPF_JMP_IMM(BPF_JA, 0, 0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "unreachable",
+ .result = REJECT,
+ },
+ {
+ "out of range jump",
+ .insns = {
+ BPF_JMP_IMM(BPF_JA, 0, 0, 1),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "jump out of range",
+ .result = REJECT,
+ },
+ {
+ "out of range jump2",
+ .insns = {
+ BPF_JMP_IMM(BPF_JA, 0, 0, -2),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "jump out of range",
+ .result = REJECT,
+ },
+ {
+ "no bpf_exit",
+ .insns = {
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
+ },
+ .errstr = "jump out of range",
+ .result = REJECT,
+ },
+ {
+ "loop (back-edge)",
+ .insns = {
+ BPF_JMP_IMM(BPF_JA, 0, 0, -1),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "back-edge",
+ .result = REJECT,
+ },
+ {
+ "loop2 (back-edge)",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+ BPF_JMP_IMM(BPF_JA, 0, 0, -4),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "back-edge",
+ .result = REJECT,
+ },
+ {
+ "conditional loop",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -3),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "back-edge",
+ .result = REJECT,
+ },
+ {
+ "read uninitialized register",
+ .insns = {
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "R2 !read_ok",
+ .result = REJECT,
+ },
+ {
+ "program doesn't init R0 before exit",
+ .insns = {
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "R0 !read_ok",
+ .result = REJECT,
+ },
+ {
+ "stack out of bounds",
+ .insns = {
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "invalid stack",
+ .result = REJECT,
+ },
+ {
+ "uninitialized stack",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {2},
+ .errstr = "invalid indirect read from stack",
+ .result = REJECT,
+ },
+ {
+ "invalid map_fd for function call",
+ .insns = {
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "fd 0 is not pointing to valid bpf_map",
+ .result = REJECT,
+ },
+ {
+ "don't check return value before access",
+ .insns = {
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {3},
+ .errstr = "R0 invalid mem access 'map_value_or_null'",
+ .result = REJECT,
+ },
+ {
+ "access memory with incorrect alignment",
+ .insns = {
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {3},
+ .errstr = "misaligned access",
+ .result = REJECT,
+ },
+ {
+ "sometimes access memory with incorrect alignment",
+ .insns = {
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {3},
+ .errstr = "R0 invalid mem access",
+ .result = REJECT,
+ },
+};
+
+static int probe_filter_length(struct bpf_insn *fp)
+{
+ int len = 0;
+
+ for (len = MAX_INSNS - 1; len > 0; --len)
+ if (fp[len].code != 0)
+ break;
+
+ return len + 1;
+}
+
+static int create_map(void)
+{
+ long long key, value = 0;
+ int map_fd;
+
+ map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
+ if (map_fd < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ }
+
+ return map_fd;
+}
+
+static int test(void)
+{
+ int prog_fd, i;
+
+ for (i = 0; i < ARRAY_SIZE(tests); i++) {
+ struct bpf_insn *prog = tests[i].insns;
+ int prog_len = probe_filter_length(prog);
+ int *fixup = tests[i].fixup;
+ int map_fd = -1;
+
+ if (*fixup) {
+
+ map_fd = create_map();
+
+ do {
+ prog[*fixup].imm = map_fd;
+ fixup++;
+ } while (*fixup);
+ }
+ printf("#%d %s ", i, tests[i].descr);
+
+ prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
+ prog_len * sizeof(struct bpf_insn),
+ "GPL");
+
+ if (tests[i].result == ACCEPT) {
+ if (prog_fd < 0) {
+ printf("FAIL\nfailed to load prog '%s'\n",
+ strerror(errno));
+ printf("%s", bpf_log_buf);
+ goto fail;
+ }
+ } else {
+ if (prog_fd >= 0) {
+ printf("FAIL\nunexpected success to load\n");
+ printf("%s", bpf_log_buf);
+ goto fail;
+ }
+ if (strstr(bpf_log_buf, tests[i].errstr) == 0) {
+ printf("FAIL\nunexpected error message: %s",
+ bpf_log_buf);
+ goto fail;
+ }
+ }
+
+ printf("OK\n");
+fail:
+ if (map_fd >= 0)
+ close(map_fd);
+ close(prog_fd);
+
+ }
+
+ return 0;
+}
+
+int main(void)
+{
+ return test();
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:35 UTC
Permalink
ex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"

ex1_user.c - corresponding user space component that
forever reads /sys/.../trace_pipe

Usage:
$ sudo ex1

should see:
writing bpf-4 -> /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
ping-25476 [001] ..s3 5639.718218: __netif_receive_skb_core: skb 4d51700 dev b9e6000
ping-25476 [001] ..s3 5639.718262: __netif_receive_skb_core: skb 4d51400 dev b9e6000

ping-25476 [002] ..s3 5640.716233: __netif_receive_skb_core: skb 5d06500 dev b9e6000
ping-25476 [002] ..s3 5640.716272: __netif_receive_skb_core: skb 5d06300 dev b9e6000

Ctrl-C at any time, kernel will auto cleanup

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/Makefile | 15 +++++++++++++--
samples/bpf/ex1_kern.c | 27 +++++++++++++++++++++++++++
samples/bpf/ex1_user.c | 24 ++++++++++++++++++++++++
3 files changed, 64 insertions(+), 2 deletions(-)
create mode 100644 samples/bpf/ex1_kern.c
create mode 100644 samples/bpf/ex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index caf1ab93b37c..c97f408fcd6d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,23 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example dropmon
+hostprogs-y := sock_example dropmon ex1

sock_example-objs := sock_example.o libbpf.o
dropmon-objs := dropmon.o libbpf.o
+ex1-objs := bpf_load.o libbpf.o ex1_user.o

# Tell kbuild to always build the programs
-always := $(hostprogs-y)
+always := $(hostprogs-y) ex1_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include
+
+HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_ex1 += -lelf
+
+LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+
+%.o: %.c
+ clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+ -O2 -emit-llvm -c $< -o -| $(LLC) -o $@
diff --git a/samples/bpf/ex1_kern.c b/samples/bpf/ex1_kern.c
new file mode 100644
index 000000000000..05a06cccaadb
--- /dev/null
+++ b/samples/bpf/ex1_kern.c
@@ -0,0 +1,27 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+SEC("events/net/netif_receive_skb")
+int bpf_prog1(struct bpf_context *ctx)
+{
+ /*
+ * attaches to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+ * prints events for loobpack device only
+ */
+ char devname[] = "lo";
+ struct net_device *dev;
+ struct sk_buff *skb = 0;
+
+ skb = (struct sk_buff *) ctx->arg1;
+ dev = bpf_fetch_ptr(&skb->dev);
+ if (bpf_memcmp(dev->name, devname, 2) == 0) {
+ char fmt[] = "skb %x dev %x\n";
+ bpf_printk(fmt, sizeof(fmt), skb, dev);
+ }
+ return 0;
+}
+
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex1_user.c b/samples/bpf/ex1_user.c
new file mode 100644
index 000000000000..e85c1b483f57
--- /dev/null
+++ b/samples/bpf/ex1_user.c
@@ -0,0 +1,24 @@
+#include <stdio.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+ FILE *f;
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ f = popen("ping -c5 localhost", "r");
+ (void) f;
+
+ read_trace_pipe();
+
+ return 0;
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:33 UTC
Permalink
llvm backend that generated eBPF and emits either
binary ELF or human readable eBPF assembler

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---

the body of the patch is removed to prevent spam.
I doubt too many falks are interested in reading llvm diffs on lkml.
The backend is available in the tree:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=4afc9f0b5890c0020a7d70c42bd4f0fa55d33cb0

tools/bpf/llvm/.gitignore | 54 ++
tools/bpf/llvm/LICENSE.TXT | 70 ++
tools/bpf/llvm/Makefile.rules | 641 ++++++++++++++++++
tools/bpf/llvm/README.txt | 23 +
tools/bpf/llvm/bld/Makefile | 27 +
tools/bpf/llvm/bld/Makefile.common | 14 +
tools/bpf/llvm/bld/Makefile.config | 124 ++++
.../llvm/bld/include/llvm/Config/AsmParsers.def | 8 +
.../llvm/bld/include/llvm/Config/AsmPrinters.def | 9 +
.../llvm/bld/include/llvm/Config/Disassemblers.def | 8 +
tools/bpf/llvm/bld/include/llvm/Config/Targets.def | 9 +
.../bpf/llvm/bld/include/llvm/Support/DataTypes.h | 96 +++
tools/bpf/llvm/bld/lib/Makefile | 11 +
.../llvm/bld/lib/Target/BPF/InstPrinter/Makefile | 10 +
.../llvm/bld/lib/Target/BPF/MCTargetDesc/Makefile | 11 +
tools/bpf/llvm/bld/lib/Target/BPF/Makefile | 17 +
.../llvm/bld/lib/Target/BPF/TargetInfo/Makefile | 10 +
tools/bpf/llvm/bld/lib/Target/Makefile | 11 +
tools/bpf/llvm/bld/tools/Makefile | 12 +
tools/bpf/llvm/bld/tools/llc/Makefile | 15 +
tools/bpf/llvm/lib/Target/BPF/BPF.h | 28 +
tools/bpf/llvm/lib/Target/BPF/BPF.td | 29 +
tools/bpf/llvm/lib/Target/BPF/BPFAsmPrinter.cpp | 100 +++
tools/bpf/llvm/lib/Target/BPF/BPFCallingConv.td | 24 +
tools/bpf/llvm/lib/Target/BPF/BPFFrameLowering.cpp | 36 ++
tools/bpf/llvm/lib/Target/BPF/BPFFrameLowering.h | 35 +
tools/bpf/llvm/lib/Target/BPF/BPFISelDAGToDAG.cpp | 182 ++++++
tools/bpf/llvm/lib/Target/BPF/BPFISelLowering.cpp | 683 ++++++++++++++++++++
tools/bpf/llvm/lib/Target/BPF/BPFISelLowering.h | 105 +++
tools/bpf/llvm/lib/Target/BPF/BPFInstrFormats.td | 29 +
tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.cpp | 162 +++++
tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.h | 53 ++
tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.td | 498 ++++++++++++++
tools/bpf/llvm/lib/Target/BPF/BPFMCInstLower.cpp | 77 +++
tools/bpf/llvm/lib/Target/BPF/BPFMCInstLower.h | 40 ++
tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.cpp | 122 ++++
tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.h | 65 ++
tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.td | 39 ++
tools/bpf/llvm/lib/Target/BPF/BPFSubtarget.cpp | 23 +
tools/bpf/llvm/lib/Target/BPF/BPFSubtarget.h | 33 +
tools/bpf/llvm/lib/Target/BPF/BPFTargetMachine.cpp | 66 ++
tools/bpf/llvm/lib/Target/BPF/BPFTargetMachine.h | 69 ++
.../lib/Target/BPF/InstPrinter/BPFInstPrinter.cpp | 81 +++
.../lib/Target/BPF/InstPrinter/BPFInstPrinter.h | 34 +
.../lib/Target/BPF/MCTargetDesc/BPFAsmBackend.cpp | 89 +++
.../llvm/lib/Target/BPF/MCTargetDesc/BPFBaseInfo.h | 33 +
.../Target/BPF/MCTargetDesc/BPFELFObjectWriter.cpp | 56 ++
.../lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h | 34 +
.../Target/BPF/MCTargetDesc/BPFMCCodeEmitter.cpp | 167 +++++
.../Target/BPF/MCTargetDesc/BPFMCTargetDesc.cpp | 115 ++++
.../lib/Target/BPF/MCTargetDesc/BPFMCTargetDesc.h | 56 ++
.../lib/Target/BPF/TargetInfo/BPFTargetInfo.cpp | 13 +
tools/bpf/llvm/tools/llc/llc.cpp | 381 +++++++++++
53 files changed, 4737 insertions(+)
create mode 100644 tools/bpf/llvm/.gitignore
create mode 100644 tools/bpf/llvm/LICENSE.TXT
create mode 100644 tools/bpf/llvm/Makefile.rules
create mode 100644 tools/bpf/llvm/README.txt
create mode 100644 tools/bpf/llvm/bld/Makefile
create mode 100644 tools/bpf/llvm/bld/Makefile.common
create mode 100644 tools/bpf/llvm/bld/Makefile.config
create mode 100644 tools/bpf/llvm/bld/include/llvm/Config/AsmParsers.def
create mode 100644 tools/bpf/llvm/bld/include/llvm/Config/AsmPrinters.def
create mode 100644 tools/bpf/llvm/bld/include/llvm/Config/Disassemblers.def
create mode 100644 tools/bpf/llvm/bld/include/llvm/Config/Targets.def
create mode 100644 tools/bpf/llvm/bld/include/llvm/Support/DataTypes.h
create mode 100644 tools/bpf/llvm/bld/lib/Makefile
create mode 100644 tools/bpf/llvm/bld/lib/Target/BPF/InstPrinter/Makefile
create mode 100644 tools/bpf/llvm/bld/lib/Target/BPF/MCTargetDesc/Makefile
create mode 100644 tools/bpf/llvm/bld/lib/Target/BPF/Makefile
create mode 100644 tools/bpf/llvm/bld/lib/Target/BPF/TargetInfo/Makefile
create mode 100644 tools/bpf/llvm/bld/lib/Target/Makefile
create mode 100644 tools/bpf/llvm/bld/tools/Makefile
create mode 100644 tools/bpf/llvm/bld/tools/llc/Makefile
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPF.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPF.td
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFAsmPrinter.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFCallingConv.td
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFFrameLowering.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFFrameLowering.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFISelDAGToDAG.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFISelLowering.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFISelLowering.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFInstrFormats.td
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFInstrInfo.td
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFMCInstLower.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFMCInstLower.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFRegisterInfo.td
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFSubtarget.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFSubtarget.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFTargetMachine.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/BPFTargetMachine.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/InstPrinter/BPFInstPrinter.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/InstPrinter/BPFInstPrinter.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFAsmBackend.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFBaseInfo.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFELFObjectWriter.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFMCCodeEmitter.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFMCTargetDesc.cpp
create mode 100644 tools/bpf/llvm/lib/Target/BPF/MCTargetDesc/BPFMCTargetDesc.h
create mode 100644 tools/bpf/llvm/lib/Target/BPF/TargetInfo/BPFTargetInfo.cpp
create mode 100644 tools/bpf/llvm/tools/llc/llc.cpp
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:32 UTC
Permalink
simple packet drop monitor:
- in-kernel eBPF program attaches to kfree_skb() event and records number
of packet drops at given location
- userspace iterates over the map every second and prints stats

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/Makefile | 3 +-
samples/bpf/dropmon.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 133 insertions(+), 1 deletion(-)
create mode 100644 samples/bpf/dropmon.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 63c65e5faf58..caf1ab93b37c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,9 +2,10 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example
+hostprogs-y := sock_example dropmon

sock_example-objs := sock_example.o libbpf.o
+dropmon-objs := dropmon.o libbpf.o

# Tell kbuild to always build the programs
always := $(hostprogs-y)
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index 000000000000..475a075bf38a
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,131 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ * of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include "libbpf.h"
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
+
+static int write_to_file(const char *file, const char *str, bool keep_open)
+{
+ int fd, err;
+
+ fd = open(file, O_WRONLY);
+ err = write(fd, str, strlen(str));
+ (void) err;
+
+ if (keep_open) {
+ return fd;
+ } else {
+ close(fd);
+ return -1;
+ }
+}
+
+static int dropmon(void)
+{
+ /* the following eBPF program is equivalent to C:
+ * void filter(struct bpf_context *ctx)
+ * {
+ * long loc = ctx->arg2;
+ * long init_val = 1;
+ * void *value;
+ *
+ * value = bpf_map_lookup_elem(MAP_ID, &loc);
+ * if (value) {
+ * (*(long *) value) += 1;
+ * } else {
+ * bpf_map_update_elem(MAP_ID, &loc, &init_val);
+ * }
+ * }
+ */
+ static struct bpf_insn prog[] = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+ BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+ BPF_EXIT_INSN(),
+ };
+
+ long long key, next_key, value = 0;
+ int prog_fd, map_fd, i;
+ char fmt[32];
+
+ map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
+ if (map_fd < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ goto cleanup;
+ }
+
+ prog[4].imm = map_fd;
+ prog[16].imm = map_fd;
+
+ prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
+ sizeof(prog), "GPL");
+ if (prog_fd < 0) {
+ printf("failed to load prog '%s'\n", strerror(errno));
+ return -1;
+ }
+
+ sprintf(fmt, "bpf_%d", prog_fd);
+
+ write_to_file(TRACEPOINT "filter", fmt, true);
+
+ for (i = 0; i < 10; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd, &next_key, &value);
+ printf("location 0x%llx count %lld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+
+cleanup:
+ /* maps, programs, tracepoint filters will auto cleanup on process exit */
+
+ return 0;
+}
+
+int main(void)
+{
+ dropmon();
+ return 0;
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:30 UTC
Permalink
the library includes a trivial set of BPF syscall wrappers:

int bpf_create_map(int key_size, int value_size, int max_entries);

int bpf_update_elem(int fd, void *key, void *value);

int bpf_lookup_elem(int fd, void *key, void *value);

int bpf_delete_elem(int fd, void *key);

int bpf_get_next_key(int fd, void *key, void *next_key);

int bpf_prog_load(enum bpf_prog_type prog_type,
const struct sock_filter_int *insns, int insn_len,
const char *license);

bpf_prog_load() stores verifier log into global bpf_log_buf[] array

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/libbpf.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/libbpf.h | 21 ++++++++
2 files changed, 159 insertions(+)
create mode 100644 samples/bpf/libbpf.c
create mode 100644 samples/bpf/libbpf.h

diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
new file mode 100644
index 000000000000..81396dab5caa
--- /dev/null
+++ b/samples/bpf/libbpf.c
@@ -0,0 +1,138 @@
+/* eBPF mini library */
+#include <stdlib.h>
+#include <stdio.h>
+#include <linux/unistd.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/netlink.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include "libbpf.h"
+
+struct nlattr_u32 {
+ __u16 nla_len;
+ __u16 nla_type;
+ __u32 val;
+};
+
+int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
+ int max_entries)
+{
+ struct nlattr_u32 attr[] = {
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_KEY_SIZE,
+ .val = key_size,
+ },
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_VALUE_SIZE,
+ .val = value_size,
+ },
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_MAX_ENTRIES,
+ .val = max_entries,
+ },
+ };
+
+ return syscall(__NR_bpf, BPF_MAP_CREATE, map_type, attr, sizeof(attr), 0);
+}
+
+
+int bpf_update_elem(int fd, void *key, void *value)
+{
+ return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, fd, key, value, 0);
+}
+
+int bpf_lookup_elem(int fd, void *key, void *value)
+{
+ return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, fd, key, value, 0);
+}
+
+int bpf_delete_elem(int fd, void *key)
+{
+ return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, fd, key, 0, 0);
+}
+
+int bpf_get_next_key(int fd, void *key, void *next_key)
+{
+ return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, fd, key, next_key, 0);
+}
+
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+
+char bpf_log_buf[LOG_BUF_SIZE];
+
+int bpf_prog_load(enum bpf_prog_type prog_type,
+ const struct bpf_insn *insns, int prog_len,
+ const char *license)
+{
+ int nlattr_size, license_len, err;
+ void *nlattr, *ptr;
+ char *log_buf = bpf_log_buf;
+ int log_size = LOG_BUF_SIZE;
+ int log_level = 1;
+
+ log_buf[0] = 0;
+
+ license_len = strlen(license) + 1;
+ nlattr_size = sizeof(struct nlattr) + prog_len + sizeof(struct nlattr) +
+ ROUND_UP(license_len, 4) +
+ sizeof(struct nlattr) + sizeof(log_level) +
+ sizeof(struct nlattr) + sizeof(log_buf) +
+ sizeof(struct nlattr) + sizeof(log_size);
+
+ ptr = nlattr = malloc(nlattr_size);
+ if (!ptr) {
+ errno = ENOMEM;
+ return -1;
+ }
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = prog_len + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_TEXT,
+ };
+ ptr += sizeof(struct nlattr);
+
+ memcpy(ptr, insns, prog_len);
+ ptr += prog_len;
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = ROUND_UP(license_len, 4) + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_LICENSE,
+ };
+ ptr += sizeof(struct nlattr);
+
+ memcpy(ptr, license, license_len);
+ ptr += ROUND_UP(license_len, 4);
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = sizeof(log_level) + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_LOG_LEVEL,
+ };
+ ptr += sizeof(struct nlattr);
+ memcpy(ptr, &log_level, sizeof(log_level));
+ ptr += sizeof(log_level);
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = sizeof(log_buf) + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_LOG_BUF,
+ };
+ ptr += sizeof(struct nlattr);
+ memcpy(ptr, &log_buf, sizeof(log_buf));
+ ptr += sizeof(log_buf);
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = sizeof(log_size) + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_LOG_SIZE,
+ };
+ ptr += sizeof(struct nlattr);
+ memcpy(ptr, &log_size, sizeof(log_size));
+ ptr += sizeof(log_size);
+
+ err = syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, nlattr, nlattr_size, 0);
+
+ free(nlattr);
+ return err;
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
new file mode 100644
index 000000000000..b19e39794291
--- /dev/null
+++ b/samples/bpf/libbpf.h
@@ -0,0 +1,21 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+struct bpf_insn;
+
+int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
+ int max_entries);
+int bpf_update_elem(int fd, void *key, void *value);
+int bpf_lookup_elem(int fd, void *key, void *value);
+int bpf_delete_elem(int fd, void *key);
+int bpf_get_next_key(int fd, void *key, void *next_key);
+
+int bpf_prog_load(enum bpf_prog_type prog_type,
+ const struct bpf_insn *insns, int insn_len,
+ const char *license);
+
+#define LOG_BUF_SIZE 8192
+extern char bpf_log_buf[LOG_BUF_SIZE];
+
+#endif
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:28 UTC
Permalink
User interface:
fd = open("/sys/kernel/debug/tracing/__event__/filter")

write(fd, "bpf_123")

where 123 is process local FD associated with eBPF program previously loaded.
__event__ is static tracepoint event or syscall.
(kprobe support is in next patch)
Once program is successfully attached to tracepoint event, the tracepoint
will be auto-enabled

close(fd)
auto-disables tracepoint event and detaches eBPF program from it

eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- trace_printk
- dump_stack
- fetch_ptr/u64/u32/u16/u8 values from unsafe address via probe_kernel_read(),
so that eBPF program can walk any kernel data structures

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/linux/ftrace_event.h | 5 +
include/trace/bpf_trace.h | 30 +++++
include/trace/ftrace.h | 10 ++
include/uapi/linux/bpf.h | 9 ++
kernel/trace/Kconfig | 1 +
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 241 ++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 3 +
kernel/trace/trace_events.c | 36 +++++-
kernel/trace/trace_events_filter.c | 72 ++++++++++-
kernel/trace/trace_syscalls.c | 18 +++
11 files changed, 424 insertions(+), 2 deletions(-)
create mode 100644 include/trace/bpf_trace.h
create mode 100644 kernel/trace/bpf_trace.c

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 06c6faa9e5cc..94b896ef6b31 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -237,6 +237,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED_BIT,
TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
TRACE_EVENT_FL_TRACEPOINT_BIT,
+ TRACE_EVENT_FL_BPF_BIT,
};

/*
@@ -259,6 +260,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
TRACE_EVENT_FL_USE_CALL_FILTER = (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+ TRACE_EVENT_FL_BPF = (1 << TRACE_EVENT_FL_BPF_BIT),
};

struct ftrace_event_call {
@@ -533,6 +535,9 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
event_triggers_post_call(file, tt);
}

+struct bpf_context;
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx);
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..4dfdf738bd12
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,30 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+ unsigned long arg1;
+ unsigned long arg2;
+ unsigned long arg3;
+ unsigned long arg4;
+ unsigned long arg5;
+ unsigned long arg6;
+ unsigned long ret;
+};
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...);
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..ad4987ac68bb 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
*/

#include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>

/*
* DECLARE_EVENT_CLASS can be used to add a generic function
@@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto) \
if (ftrace_trigger_soft_disabled(ftrace_file)) \
return; \
\
+ if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
+ unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+ struct bpf_context __ctx; \
+ \
+ populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
+ return; \
+ } \
+ \
__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
\
entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file, \
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c18ac0c1e3e5..76d7196e518a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -396,6 +396,7 @@ enum bpf_prog_attributes {
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
BPF_PROG_TYPE_SOCKET_FILTER,
+ BPF_PROG_TYPE_TRACING_FILTER,
};

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -406,6 +407,14 @@ enum bpf_func_id {
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
+ BPF_FUNC_fetch_ptr, /* void *bpf_fetch_ptr(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u64, /* u64 bpf_fetch_u64(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u32, /* u32 bpf_fetch_u32(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u16, /* u16 bpf_fetch_u16(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u8, /* u8 bpf_fetch_u8(void *unsafe_ptr) */
+ BPF_FUNC_memcmp, /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
+ BPF_FUNC_dump_stack, /* void bpf_dump_stack(void) */
+ BPF_FUNC_printk, /* int bpf_printk(const char *fmt, int fmt_size, ...) */
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c899dd..c816cd779697 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -75,6 +75,7 @@ config FTRACE_NMI_ENTER

config EVENT_TRACING
select CONTEXT_SWITCH_TRACER
+ depends on NET
bool

config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 67d6369ddf83..fe897168a19e 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
endif
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o
obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_PM_RUNTIME),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..eac13a14dd26
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,241 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...)
+{
+ va_list args;
+
+ va_start(args, ctx);
+
+ ctx->arg1 = va_arg(args, unsigned long);
+ ctx->arg2 = va_arg(args, unsigned long);
+ ctx->arg3 = va_arg(args, unsigned long);
+ ctx->arg4 = va_arg(args, unsigned long);
+ ctx->arg5 = va_arg(args, unsigned long);
+ ctx->arg6 = va_arg(args, unsigned long);
+
+ va_end(args);
+}
+EXPORT_SYMBOL_GPL(populate_bpf_context);
+
+static u64 bpf_fetch_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) r1;
+ void *ptr = NULL;
+
+ probe_kernel_read(&ptr, unsafe_ptr, sizeof(ptr));
+ return (u64) (unsigned long) ptr;
+}
+
+#define FETCH(SIZE) \
+static u64 bpf_fetch_##SIZE(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) \
+{ \
+ void *unsafe_ptr = (void *) r1; \
+ SIZE val = 0; \
+ \
+ probe_kernel_read(&val, unsafe_ptr, sizeof(val)); \
+ return (u64) (SIZE) val; \
+}
+FETCH(u64)
+FETCH(u32)
+FETCH(u16)
+FETCH(u8)
+#undef FETCH
+
+static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) r1;
+ void *safe_ptr = (void *) r2;
+ u32 size = (u32) r3;
+ char buf[64];
+ int err;
+
+ if (size < 64) {
+ err = probe_kernel_read(buf, unsafe_ptr, size);
+ if (err)
+ return err;
+ return memcmp(buf, safe_ptr, size);
+ }
+ return -1;
+}
+
+static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ trace_dump_stack(0);
+ return 0;
+}
+
+/* limited printk()
+ * only %d %u %x conversion specifiers allowed
+ */
+static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+ char *fmt = (char *) r1;
+ int fmt_cnt = 0;
+ int i;
+
+ /* bpf_check() guarantees that fmt points to bpf program stack and
+ * fmt_size bytes of it were initialized by bpf program
+ */
+ if (fmt[fmt_size - 1] != 0)
+ return -EINVAL;
+
+ /* check format string for allowed specifiers */
+ for (i = 0; i < fmt_size; i++)
+ if (fmt[i] == '%') {
+ if (i + 1 >= fmt_size)
+ return -EINVAL;
+ if (fmt[i + 1] != 'd' && fmt[i + 1] != 'u' &&
+ fmt[i + 1] != 'x')
+ return -EINVAL;
+ fmt_cnt++;
+ }
+
+ if (fmt_cnt > 3)
+ return -EINVAL;
+
+ return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+ (u32) r3, (u32) r4, (u32) r5);
+}
+
+static struct bpf_func_proto tracing_filter_funcs[] = {
+#define FETCH(SIZE) \
+ [BPF_FUNC_fetch_##SIZE] = { \
+ .func = bpf_fetch_##SIZE, \
+ .gpl_only = false, \
+ .ret_type = RET_INTEGER, \
+ },
+ FETCH(ptr)
+ FETCH(u64)
+ FETCH(u32)
+ FETCH(u16)
+ FETCH(u8)
+#undef FETCH
+ [BPF_FUNC_memcmp] = {
+ .func = bpf_memcmp,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_PTR_TO_STACK,
+ .arg3_type = ARG_CONST_STACK_SIZE,
+ },
+ [BPF_FUNC_dump_stack] = {
+ .func = bpf_dump_stack,
+ .gpl_only = false,
+ .ret_type = RET_VOID,
+ },
+ [BPF_FUNC_printk] = {
+ .func = bpf_printk,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+ },
+ [BPF_FUNC_map_lookup_elem] = {
+ .func = bpf_map_lookup_elem,
+ .gpl_only = false,
+ .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ },
+ [BPF_FUNC_map_update_elem] = {
+ .func = bpf_map_update_elem,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ .arg3_type = ARG_PTR_TO_MAP_VALUE,
+ },
+ [BPF_FUNC_map_delete_elem] = {
+ .func = bpf_map_delete_elem,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ },
+};
+
+static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
+{
+ if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
+ return NULL;
+ return &tracing_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+ int size;
+ enum bpf_access_type type;
+} tracing_filter_ctx_access[] = {
+ [offsetof(struct bpf_context, arg1)] = {
+ FIELD_SIZEOF(struct bpf_context, arg1),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg2)] = {
+ FIELD_SIZEOF(struct bpf_context, arg2),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg3)] = {
+ FIELD_SIZEOF(struct bpf_context, arg3),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg4)] = {
+ FIELD_SIZEOF(struct bpf_context, arg4),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg5)] = {
+ FIELD_SIZEOF(struct bpf_context, arg5),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg6)] = {
+ FIELD_SIZEOF(struct bpf_context, arg6),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, ret)] = {
+ FIELD_SIZEOF(struct bpf_context, ret),
+ BPF_READ
+ },
+};
+
+static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ const struct bpf_context_access *access;
+
+ if (off < 0 || off >= ARRAY_SIZE(tracing_filter_ctx_access))
+ return false;
+
+ access = &tracing_filter_ctx_access[off];
+ if (access->size == size && (access->type & type))
+ return true;
+
+ return false;
+}
+
+static struct bpf_verifier_ops tracing_filter_ops = {
+ .get_func_proto = tracing_filter_func_proto,
+ .is_valid_access = tracing_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+ .ops = &tracing_filter_ops,
+ .type = BPF_PROG_TYPE_TRACING_FILTER,
+};
+
+static int __init register_tracing_filter_ops(void)
+{
+ bpf_register_prog_type(&tl);
+ return 0;
+}
+late_initcall(register_tracing_filter_ops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 385391fb1d3b..f0b7caa71b9d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -986,12 +986,15 @@ struct ftrace_event_field {
int is_signed;
};

+struct bpf_prog;
+
struct event_filter {
int n_preds; /* Number assigned */
int a_preds; /* allocated */
struct filter_pred *preds;
struct filter_pred *root;
char *filter_string;
+ struct bpf_prog *prog;
};

struct event_subsystem {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index ef06ce7e9cf8..d79f0ee98881 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1051,6 +1051,26 @@ event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
return r;
}

+static int event_filter_release(struct inode *inode, struct file *filp)
+{
+ struct ftrace_event_file *file;
+ char buf[2] = "0";
+
+ mutex_lock(&event_mutex);
+ file = event_file_data(filp);
+ if (file) {
+ if (file->event_call->flags & TRACE_EVENT_FL_BPF) {
+ /* auto-disable the filter */
+ ftrace_event_enable_disable(file, 0);
+
+ /* if BPF filter was used, clear it on fd close */
+ apply_event_filter(file, buf);
+ }
+ }
+ mutex_unlock(&event_mutex);
+ return 0;
+}
+
static ssize_t
event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
loff_t *ppos)
@@ -1074,10 +1094,23 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,

mutex_lock(&event_mutex);
file = event_file_data(filp);
- if (file)
+ if (file) {
err = apply_event_filter(file, buf);
+ if (!err && file->event_call->flags & TRACE_EVENT_FL_BPF)
+ /* once filter is applied, auto-enable it */
+ ftrace_event_enable_disable(file, 1);
+ }
+
mutex_unlock(&event_mutex);

+ if (file && file->event_call->flags & TRACE_EVENT_FL_BPF) {
+ /*
+ * allocate per-cpu printk buffers, since eBPF program
+ * might be calling bpf_trace_printk
+ */
+ trace_printk_init_buffers();
+ }
+
free_page((unsigned long) buf);
if (err < 0)
return err;
@@ -1328,6 +1361,7 @@ static const struct file_operations ftrace_event_filter_fops = {
.open = tracing_open_generic,
.read = event_filter_read,
.write = event_filter_write,
+ .release = event_filter_release,
.llseek = default_llseek,
};

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 7a8c1528e141..401fca436054 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,9 @@
#include <linux/mutex.h>
#include <linux/perf_event.h>
#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <linux/filter.h>

#include "trace.h"
#include "trace_output.h"
@@ -535,6 +538,16 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
return WALK_PRED_DEFAULT;
}

+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx)
+{
+ BUG_ON(!filter || !filter->prog);
+
+ rcu_read_lock();
+ BPF_PROG_RUN(filter->prog, (void *) ctx);
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
+
/* return 1 if event matches, 0 otherwise (discard) */
int filter_match_preds(struct event_filter *filter, void *rec)
{
@@ -789,6 +802,8 @@ static void __free_filter(struct event_filter *filter)
if (!filter)
return;

+ if (filter->prog)
+ bpf_prog_put(filter->prog);
__free_preds(filter);
kfree(filter->filter_string);
kfree(filter);
@@ -1857,6 +1872,48 @@ static int create_filter_start(char *filter_str, bool set_str,
return err;
}

+static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+{
+ struct event_filter *filter;
+ struct bpf_prog *prog;
+ long ufd;
+ int err = 0;
+
+ *filterp = NULL;
+
+ filter = __alloc_filter();
+ if (!filter)
+ return -ENOMEM;
+
+ err = replace_filter_string(filter, filter_str);
+ if (err)
+ goto free_filter;
+
+ err = kstrtol(filter_str + 4, 0, &ufd);
+ if (err)
+ goto free_filter;
+
+ err = -ESRCH;
+ prog = bpf_prog_get(ufd);
+ if (!prog)
+ goto free_filter;
+
+ filter->prog = prog;
+
+ err = -EINVAL;
+ if (prog->info->prog_type != BPF_PROG_TYPE_TRACING_FILTER)
+ /* prog_id is valid, but it's not a tracing filter program */
+ goto free_filter;
+
+ *filterp = filter;
+
+ return 0;
+
+free_filter:
+ __free_filter(filter);
+ return err;
+}
+
static void create_filter_finish(struct filter_parse_state *ps)
{
if (ps) {
@@ -1966,7 +2023,20 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
return 0;
}

- err = create_filter(call, filter_string, true, &filter);
+ /*
+ * 'bpf_123' string is a request to attach eBPF program with id == 123
+ * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
+ */
+ if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
+ filter_string[4] != 0) {
+ err = create_filter_bpf(filter_string, &filter);
+ if (!err)
+ call->flags |= TRACE_EVENT_FL_BPF;
+ } else {
+ err = create_filter(call, filter_string, true, &filter);
+ if (!err)
+ call->flags &= ~TRACE_EVENT_FL_BPF;
+ }

/*
* Always swap the call filter with the new filter
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 759d5e004517..a8d61a685480 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -7,6 +7,7 @@
#include <linux/ftrace.h>
#include <linux/perf_event.h>
#include <asm/syscall.h>
+#include <trace/bpf_trace.h>

#include "trace_output.h"
#include "trace.h"
@@ -328,6 +329,14 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (!sys_data)
return;

+ if (ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF) {
+ struct bpf_context __ctx;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+ &__ctx.arg1);
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+ return;
+ }
+
size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;

local_save_flags(irq_flags);
@@ -375,6 +384,15 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
if (!sys_data)
return;

+ if (ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF) {
+ struct bpf_context __ctx;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+ &__ctx.arg1);
+ __ctx.ret = syscall_get_return_value(current, regs);
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+ return;
+ }
+
local_save_flags(irq_flags);
pc = preempt_count();
--
1.7.9.5
Brendan Gregg
2014-08-14 21:20:05 UTC
Permalink
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <***@plumgrid.com> wrote:
[...]
Post by Alexei Starovoitov
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+ unsigned long arg1;
+ unsigned long arg2;
+ unsigned long arg3;
+ unsigned long arg4;
+ unsigned long arg5;
+ unsigned long arg6;
+ unsigned long ret;
+};
While this works, the argN+1 shift for 32-bit is a gotcha to learn.
Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
was a way not to shift arguments, I could have one program for both.
Eg, additional arg1hi, arg2hi, ... for the higher order u32s.

Brendan
Alexei Starovoitov
2014-08-15 06:08:10 UTC
Permalink
On Thu, Aug 14, 2014 at 2:20 PM, Brendan Gregg
Post by Brendan Gregg
[...]
Post by Alexei Starovoitov
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+ unsigned long arg1;
+ unsigned long arg2;
+ unsigned long arg3;
+ unsigned long arg4;
+ unsigned long arg5;
+ unsigned long arg6;
+ unsigned long ret;
+};
While this works, the argN+1 shift for 32-bit is a gotcha to learn.
Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
correct.
I've picked 'long' type for these tracepoint 'arguments' to match
what is going on at assembler level.
32-bit archs are passing 64-bit values in two consecutive registers
or two stack slots. So it's partially exposing architectural details.
I've tried to use u64 here, but it complicated tracepoint+ebpf patch
a lot, since I need per-architecture support for moving C arguments
into u64 variables and hacking tracepoint event definitions in a nasty
ways. This 'long' type approach is the least intrusive I could find.
Also out of 1842 total tracepoint fields, only 144 fields are 64-bit,
so rarely one would need to deal with u64. Most of the tracepoint
arguments are either longs, ints or pointers, which fits this approach
the best.
In general the eBPF design approach is to keep kernel bits as simple
as possible and move complexity to user space.
In this case some higher language than C for writing scripts can
hide this oddity.
Andy Lutomirski
2014-08-15 17:20:25 UTC
Permalink
Post by Alexei Starovoitov
On Thu, Aug 14, 2014 at 2:20 PM, Brendan Gregg
Post by Brendan Gregg
[...]
Post by Alexei Starovoitov
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+ unsigned long arg1;
+ unsigned long arg2;
+ unsigned long arg3;
+ unsigned long arg4;
+ unsigned long arg5;
+ unsigned long arg6;
+ unsigned long ret;
+};
While this works, the argN+1 shift for 32-bit is a gotcha to learn.
Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
correct.
I've picked 'long' type for these tracepoint 'arguments' to match
what is going on at assembler level.
32-bit archs are passing 64-bit values in two consecutive registers
or two stack slots. So it's partially exposing architectural details.
I've tried to use u64 here, but it complicated tracepoint+ebpf patch
a lot, since I need per-architecture support for moving C arguments
into u64 variables and hacking tracepoint event definitions in a nasty
ways. This 'long' type approach is the least intrusive I could find.
Also out of 1842 total tracepoint fields, only 144 fields are 64-bit,
so rarely one would need to deal with u64. Most of the tracepoint
arguments are either longs, ints or pointers, which fits this approach
the best.
In general the eBPF design approach is to keep kernel bits as simple
as possible and move complexity to user space.
In this case some higher language than C for writing scripts can
hide this oddity.
The downside of this approach is that compat support might be
difficult or impossible.

--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
Alexei Starovoitov
2014-08-15 17:36:17 UTC
Permalink
Post by Andy Lutomirski
The downside of this approach is that compat support might be
difficult or impossible.
Would do you mean by compat? 32-bit programs on 64-bit kernels?
There is no such concept for eBPF. All eBPF programs are always
operating on 64-bit registers.
Andy Lutomirski
2014-08-15 18:50:49 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
The downside of this approach is that compat support might be
difficult or impossible.
Would do you mean by compat? 32-bit programs on 64-bit kernels?
There is no such concept for eBPF. All eBPF programs are always
operating on 64-bit registers.
Doesn't the eBPF program need to know sizeof(long) to read these
fields correctly? Or am I misunderstanding what the code does?

--Andy
Alexei Starovoitov
2014-08-15 18:56:12 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
The downside of this approach is that compat support might be
difficult or impossible.
Would do you mean by compat? 32-bit programs on 64-bit kernels?
There is no such concept for eBPF. All eBPF programs are always
operating on 64-bit registers.
Doesn't the eBPF program need to know sizeof(long) to read these
fields correctly? Or am I misunderstanding what the code does?
correct. eBPF program would be using 8-byte read on 64-bit kernel
and 4-byte read on 32-bit kernel. Same with access to ptrace fields
and pretty much all other fields in the kernel. The program will be
different on different kernels.
Say, this bpf_context struct doesn't exist at all. The programs would
still need to be different to walk in-kernel data structures...
Andy Lutomirski
2014-08-15 19:02:51 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
The downside of this approach is that compat support might be
difficult or impossible.
Would do you mean by compat? 32-bit programs on 64-bit kernels?
There is no such concept for eBPF. All eBPF programs are always
operating on 64-bit registers.
Doesn't the eBPF program need to know sizeof(long) to read these
fields correctly? Or am I misunderstanding what the code does?
correct. eBPF program would be using 8-byte read on 64-bit kernel
and 4-byte read on 32-bit kernel. Same with access to ptrace fields
and pretty much all other fields in the kernel. The program will be
different on different kernels.
Say, this bpf_context struct doesn't exist at all. The programs would
still need to be different to walk in-kernel data structures...
Hmm. I guess this isn't so bad.

What's the actual difficulty with using u64? ISTM that, if the clang
front-end can't deal with u64, there's a bigger problem. Or is it
something else I don't understand.

--Andy
Alexei Starovoitov
2014-08-15 19:16:11 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
correct. eBPF program would be using 8-byte read on 64-bit kernel
and 4-byte read on 32-bit kernel. Same with access to ptrace fields
and pretty much all other fields in the kernel. The program will be
different on different kernels.
Say, this bpf_context struct doesn't exist at all. The programs would
still need to be different to walk in-kernel data structures...
Hmm. I guess this isn't so bad.
What's the actual difficulty with using u64? ISTM that, if the clang
front-end can't deal with u64, there's a bigger problem. Or is it
something else I don't understand.
clang/llvm has no problem with u64 :)
This bpf_context struct for tracing is trying to answer the question:
'what's the most convenient way to access tracepoint arguments
from a script'.
When kernel code has something like:
trace_kfree_skb(skb, net_tx_action);
the script needs to be able to access this 'skb' and 'net_tx_action'
values through _single_ data structure.
In this proposal they are ctx->arg1 and ctx->arg2.
I've considered having different bpf_context's for every event, but
the complexity explodes. I need to hack all event definitions and so on.
imo it's better to move complexity to userspace, so program author
or high level language abstracts these details.
Andy Lutomirski
2014-08-15 19:18:53 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
correct. eBPF program would be using 8-byte read on 64-bit kernel
and 4-byte read on 32-bit kernel. Same with access to ptrace fields
and pretty much all other fields in the kernel. The program will be
different on different kernels.
Say, this bpf_context struct doesn't exist at all. The programs would
still need to be different to walk in-kernel data structures...
Hmm. I guess this isn't so bad.
What's the actual difficulty with using u64? ISTM that, if the clang
front-end can't deal with u64, there's a bigger problem. Or is it
something else I don't understand.
clang/llvm has no problem with u64 :)
'what's the most convenient way to access tracepoint arguments
from a script'.
trace_kfree_skb(skb, net_tx_action);
the script needs to be able to access this 'skb' and 'net_tx_action'
values through _single_ data structure.
In this proposal they are ctx->arg1 and ctx->arg2.
I've considered having different bpf_context's for every event, but
the complexity explodes. I need to hack all event definitions and so on.
imo it's better to move complexity to userspace, so program author
or high level language abstracts these details.
I still don't understand why making them long instead of u64 is
helpful, though. I feel like I'm missing obvious here.
--
Andy Lutomirski
AMA Capital Management, LLC
Alexei Starovoitov
2014-08-15 19:35:20 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
clang/llvm has no problem with u64 :)
'what's the most convenient way to access tracepoint arguments
from a script'.
trace_kfree_skb(skb, net_tx_action);
the script needs to be able to access this 'skb' and 'net_tx_action'
values through _single_ data structure.
In this proposal they are ctx->arg1 and ctx->arg2.
I've considered having different bpf_context's for every event, but
the complexity explodes. I need to hack all event definitions and so on.
imo it's better to move complexity to userspace, so program author
or high level language abstracts these details.
I still don't understand why making them long instead of u64 is
helpful, though. I feel like I'm missing obvious here.
I promise to come back to this... Have to go off grid...
will think of it in the mean time... Appreciate this discussion!!
Alexei Starovoitov
2014-08-19 18:39:09 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
'what's the most convenient way to access tracepoint arguments
from a script'.
trace_kfree_skb(skb, net_tx_action);
the script needs to be able to access this 'skb' and 'net_tx_action'
values through _single_ data structure.
In this proposal they are ctx->arg1 and ctx->arg2.
I've considered having different bpf_context's for every event, but
the complexity explodes. I need to hack all event definitions and so on.
imo it's better to move complexity to userspace, so program author
or high level language abstracts these details.
I still don't understand why making them long instead of u64 is
helpful, though. I feel like I'm missing obvious here.
The problem statement:
- tracepoint events are defined as:
TRACE_EVENT(sock_exceed_buf_limit,
TP_PROTO(struct sock *sk, struct proto *prot, long allocated),
TP_ARGS(sk, prot, allocated),

- from eBPF C program (or higher level language) I would like
to access these tracepoint arguments as
ctx->arg1, ctx->arg2, ctx->arg3

- accessing tracepoint fields after buffer copy is not an option,
since it's unnecessary alloc/copy/free of a lot of values
(including strings) per event that programs will mostly ignore.
If needed, programs can fetch them on demand.

Bad approach #1
- define different bpf_context per event and customize eBPF verifier
to have different program types per event, so particular program
can be attached to one particular event only
Cons: quite complex, require trace/events/*.h hacking,
one ebpf program cannot be used to attach to multiple events
So #1 is no go.

Approach #2
- define bpf_context once for all tracepoint events as:
struct bpf_context {
unsigned long arg1, arg2, arg3, ...
};
and main ftrace.h macro as:
struct bpf_context ctx;
populate_bpf_context(&ctx, args, 0, 0, 0, 0, 0);
trace_filter_call_bpf(ftrace_file->filter, &ctx);
where 'args' is a macro taken from TP_ARGS above and
/* called from ftrace_raw_event_*() to copy args */
void populate_bpf_context(struct bpf_context *ctx, ...)
{
va_start(args, ctx);
ctx->arg1 = va_arg(args, unsigned long);
ctx->arg2 = va_arg(args, unsigned long);
this approach relies on type promotion when args are passed
into vararg function.
On 64-bit arch our tracepoint arguments 'sk, prot, allocated' will
get stored into arg1, arg2, arg3 and the rest of argX will be zeros.
On 32-bit 'u64' types will be passed in two 'long' slots of vararg.
Obviously changing 'long' to 'u64' in bpf_context and in
populate_bpf_context() will not work, because types
are promoted to 'long'.
Disadvantage of this approach is that 32 vs 64 bit archs need to
deal with argX differently.
That's what you saw in this patch.

New approach #3
just discovered __MAPx() macro used by syscalls.h which can
be massaged for this use case, so define:
struct bpf_context {
u64 arg1, arg2, arg3,...
};
and argument casting macro as:
#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
..
#define __BPF_CAST(a,...) __CAST_TO_U64(a), __BPF_CAST6(__VA_ARGS__)

so main ftrace.h macro becomes:
struct bpf_context __ctx = ((struct bpf_context) {
__BPF_CAST(args, 0, 0, 0, 0, 0, 0)
});
trace_filter_call_bpf(ftrace_file->filter, &__ctx);

where 'args' is still the same 'sk, prot, allocated' from TP_ARGS.
Casting macro will cast 'sk' to u64 and will assign into arg1,
'prot' will be casted to u64 and assigned to arg2, etc

All good, but the tricky part is how to cast all arguments passed
into tracepoint events into u64 without warnings on
32-bit and 64-bit architectures.
The following:
#define __CAST_TO_U64(expr) (u64) expr
will not work, since compiler will be spewing warnings about
casting pointer to integer...
The following
#define __CAST_TO_U64(expr) (u64) (long) expr
will work fine on 64-bit architectures, since all integer and
pointer types will be warning-free casted and stored
in arg1, arg2, arg3, ...
but it won't work on 32-bit architectures, since full u64
tracepoint arguments will be chopped to 'long'.

It took a lot of macro wizardry to come up with the following:
/* cast any interger or pointer type to u64 without warnings */
#define __CAST_TO_U64(expr) \
__builtin_choose_expr(sizeof(long) < sizeof(expr), \
(u64) (expr - ((typeof(expr))0)), \
(u64) (long) expr)
the tricky part is that GCC syntax-parses and warns
in both sides of __builtin_choose_expr(), so u64 case in 32-bit
archs needs to be fancy, so all types can go through it
warning free.
Though it's tricky the end result is nice.
The disadvantages of approach #2 are solved and tracepoint
arguments are stored into 'u64 argX' on both 32 and 64-bit archs.
The extra benefit is that this casting macro is way faster than
vararg approach #2.
So in V5 of this series I'm planning to use this new approach
unless there are better ideas.
full diff of this approach:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=235d87dd7afd8d5262556cba7882d6efb25d8305
Andy Lutomirski
2014-08-15 17:25:44 UTC
Permalink
Post by Alexei Starovoitov
fd = open("/sys/kernel/debug/tracing/__event__/filter")
write(fd, "bpf_123")
I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem. In general, write(2) should never do anything that involves
any security-relevant context of the caller.

Ideally, you would look up fd 123 in the file table of whomever called
open. If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.

This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.

--Andy
Alexei Starovoitov
2014-08-15 17:51:27 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
fd = open("/sys/kernel/debug/tracing/__event__/filter")
write(fd, "bpf_123")
I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem. In general, write(2) should never do anything that involves
any security-relevant context of the caller.
Ideally, you would look up fd 123 in the file table of whomever called
open. If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.
This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.
hmm. you need to be root to open the events anyway.
pretty much the whole tracing for root only, since any kernel data
structures can be printed, stored into maps and so on.
So I don't quite follow your security concern here.

Even say root opens a tracepoint and does exec() of another
app that uploads ebpf program, gets program_fd and does
write into tracepoint fd. The root app that did this open() is
doing exec() on purpose. It's not like it's exec-ing something
it doesn't know about.

Remember, FDs was your idea in the first place ;)
I had global ids and everything root initially.
Andy Lutomirski
2014-08-15 18:53:16 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
fd = open("/sys/kernel/debug/tracing/__event__/filter")
write(fd, "bpf_123")
I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem. In general, write(2) should never do anything that involves
any security-relevant context of the caller.
Ideally, you would look up fd 123 in the file table of whomever called
open. If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.
This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.
hmm. you need to be root to open the events anyway.
pretty much the whole tracing for root only, since any kernel data
structures can be printed, stored into maps and so on.
So I don't quite follow your security concern here.
Even say root opens a tracepoint and does exec() of another
app that uploads ebpf program, gets program_fd and does
write into tracepoint fd. The root app that did this open() is
doing exec() on purpose. It's not like it's exec-ing something
it doesn't know about.
As long as everyone who can debugfs/tracing/whatever has all
privileges, then this is fine.

If not, then it's a minor capability or MAC bypass. Suppose you only
have one capability or, more realistically, limited MAC permissions.
You can still open the tracing file, pass it to an unwitting program
with elevated permission (e.g. using selinux's entrypoint mechanism),
and trick that program into writing bpf_123.

Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
attack is possible.

I don't think that fixing this should be a prerequisite for merging,
since the risk is so small. Nonetheless, it would be nice. (This
family of attacks has lead to several root vulnerabilities in the
past.)

--Andy
Post by Alexei Starovoitov
Remember, FDs was your idea in the first place ;)
I had global ids and everything root initially.
--
Andy Lutomirski
AMA Capital Management, LLC
Alexei Starovoitov
2014-08-15 19:07:42 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
fd = open("/sys/kernel/debug/tracing/__event__/filter")
write(fd, "bpf_123")
I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem. In general, write(2) should never do anything that involves
any security-relevant context of the caller.
Ideally, you would look up fd 123 in the file table of whomever called
open. If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.
This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.
hmm. you need to be root to open the events anyway.
pretty much the whole tracing for root only, since any kernel data
structures can be printed, stored into maps and so on.
So I don't quite follow your security concern here.
Even say root opens a tracepoint and does exec() of another
app that uploads ebpf program, gets program_fd and does
write into tracepoint fd. The root app that did this open() is
doing exec() on purpose. It's not like it's exec-ing something
it doesn't know about.
As long as everyone who can debugfs/tracing/whatever has all
privileges, then this is fine.
If not, then it's a minor capability or MAC bypass. Suppose you only
have one capability or, more realistically, limited MAC permissions.
Hard to think of MAC abbreviation other than in networking way... ;)
MAC bypass... kinda sounds like L3 networking without L2... ;)
Post by Andy Lutomirski
You can still open the tracing file, pass it to an unwitting program
with elevated permission (e.g. using selinux's entrypoint mechanism),
and trick that program into writing bpf_123.
hmm, but to open tracing file you'd need to be root already...
otherwise yeah, if non-root could open it and pass it, then it
would be nasty.
Post by Andy Lutomirski
Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
attack is possible.
I don't think that fixing this should be a prerequisite for merging,
since the risk is so small. Nonetheless, it would be nice. (This
family of attacks has lead to several root vulnerabilities in the
past.)
Ok. I think keeping a track of pid between open and write is kinda
ugly. Should we add some new CAP flag and check it for all file
ops? Another option is to conditionally make open() of tracing
files as cloexec...
Andy Lutomirski
2014-08-15 19:20:04 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
fd = open("/sys/kernel/debug/tracing/__event__/filter")
write(fd, "bpf_123")
I didn't follow all the code flow leading to parsing the "bpf_123"
string, but if it works the way I imagine it does, it's a security
problem. In general, write(2) should never do anything that involves
any security-relevant context of the caller.
Ideally, you would look up fd 123 in the file table of whomever called
open. If that's difficult to implement efficiently, then it would be
nice to have some check that the callers of write(2) and open(2) are
the same task and that exec wasn't called in between.
This isn't a very severe security issue because you need privilege to
open the thing in the first place, but it would still be nice to
address.
hmm. you need to be root to open the events anyway.
pretty much the whole tracing for root only, since any kernel data
structures can be printed, stored into maps and so on.
So I don't quite follow your security concern here.
Even say root opens a tracepoint and does exec() of another
app that uploads ebpf program, gets program_fd and does
write into tracepoint fd. The root app that did this open() is
doing exec() on purpose. It's not like it's exec-ing something
it doesn't know about.
As long as everyone who can debugfs/tracing/whatever has all
privileges, then this is fine.
If not, then it's a minor capability or MAC bypass. Suppose you only
have one capability or, more realistically, limited MAC permissions.
Hard to think of MAC abbreviation other than in networking way... ;)
MAC bypass... kinda sounds like L3 networking without L2... ;)
Post by Andy Lutomirski
You can still open the tracing file, pass it to an unwitting program
with elevated permission (e.g. using selinux's entrypoint mechanism),
and trick that program into writing bpf_123.
hmm, but to open tracing file you'd need to be root already...
otherwise yeah, if non-root could open it and pass it, then it
would be nasty.
Post by Andy Lutomirski
Admittedly, it's unlikely that fd 123 will be an *eBPF* fd, but the
attack is possible.
I don't think that fixing this should be a prerequisite for merging,
since the risk is so small. Nonetheless, it would be nice. (This
family of attacks has lead to several root vulnerabilities in the
past.)
Ok. I think keeping a track of pid between open and write is kinda
ugly.
Agreed.

TBH, I would just add a comment to the open implementation saying
that, if unprivileged or less privileged open is allowed, then this
needs to be fixed.
Post by Alexei Starovoitov
Should we add some new CAP flag and check it for all file
ops? Another option is to conditionally make open() of tracing
files as cloexec...
That won't help. The same attack can be done with SCM_RIGHTS, and
cloexec can be cleared.
--
Andy Lutomirski
AMA Capital Management, LLC
Alexei Starovoitov
2014-08-15 19:29:38 UTC
Permalink
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
I don't think that fixing this should be a prerequisite for merging,
since the risk is so small. Nonetheless, it would be nice. (This
family of attacks has lead to several root vulnerabilities in the
past.)
Ok. I think keeping a track of pid between open and write is kinda
ugly.
Agreed.
TBH, I would just add a comment to the open implementation saying
that, if unprivileged or less privileged open is allowed, then this
needs to be fixed.
ok. will do.
Post by Andy Lutomirski
Post by Alexei Starovoitov
Should we add some new CAP flag and check it for all file
ops? Another option is to conditionally make open() of tracing
files as cloexec...
That won't help. The same attack can be done with SCM_RIGHTS, and
cloexec can be cleared.
ouch, can we then make ebpf FDs and may be debugfs FDs
not passable at all? Otherwise it feels that generality and
flexibility of FDs is becoming a burden.
Andy Lutomirski
2014-08-15 19:32:41 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by Alexei Starovoitov
Post by Andy Lutomirski
I don't think that fixing this should be a prerequisite for merging,
since the risk is so small. Nonetheless, it would be nice. (This
family of attacks has lead to several root vulnerabilities in the
past.)
Ok. I think keeping a track of pid between open and write is kinda
ugly.
Agreed.
TBH, I would just add a comment to the open implementation saying
that, if unprivileged or less privileged open is allowed, then this
needs to be fixed.
ok. will do.
Post by Andy Lutomirski
Post by Alexei Starovoitov
Should we add some new CAP flag and check it for all file
ops? Another option is to conditionally make open() of tracing
files as cloexec...
That won't help. The same attack can be done with SCM_RIGHTS, and
cloexec can be cleared.
ouch, can we then make ebpf FDs and may be debugfs FDs
not passable at all? Otherwise it feels that generality and
flexibility of FDs is becoming a burden.
I'm not sure there's much of a general problem. The issue is when
there's an fd for which write(2) (or other
assumed-to-not-check-permissions calls like read, pread, pwrite, etc)
depend on context. This is historically an issue for netlink and
various /proc files.

--Andy
Alexei Starovoitov
2014-08-13 07:57:31 UTC
Permalink
this socket filter example does:

- creates a hashtable in kernel with key 4 bytes and value 8 bytes

- populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto

- loads eBPF program:
r0 = skb[14 + 9]; // load one byte of ip->proto
*(u32*)(fp - 4) = r0;
value = bpf_map_lookup_elem(map_fd, fp - 4);
if (value)
(*(u64*)value) += 1;

- attaches this program to eth0 raw socket

- every second user space reads map[6] and map[17] to see how many
TCP and UDP packets were seen on eth0

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 12 ++++
samples/bpf/sock_example.c | 158 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 171 insertions(+)
create mode 100644 samples/bpf/.gitignore
create mode 100644 samples/bpf/Makefile
create mode 100644 samples/bpf/sock_example.c

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index 000000000000..5465c6e92a00
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1 @@
+sock_example
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
new file mode 100644
index 000000000000..63c65e5faf58
--- /dev/null
+++ b/samples/bpf/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := sock_example
+
+sock_example-objs := sock_example.o libbpf.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS += -I$(objtree)/usr/include
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index 000000000000..a577ee64de5b
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,158 @@
+/* eBPF example program:
+ * - creates a hashtable in kernel with key 4 bytes and value 8 bytes
+ *
+ * - populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto
+ *
+ * - loads eBPF program:
+ * r0 = skb[14 + 9]; // load one byte of ip->proto
+ * *(u32*)(fp - 4) = r0;
+ * value = bpf_map_lookup_elem(map_id, fp - 4);
+ * if (value)
+ * (*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[6] and map[17] to see how many
+ * TCP and UDP packets were seen on eth0
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include "libbpf.h"
+
+static int open_raw_sock(const char *name)
+{
+ struct sockaddr_ll sll;
+ struct packet_mreq mr;
+ struct ifreq ifr;
+ int sock;
+
+ sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, htons(ETH_P_ALL));
+ if (sock < 0) {
+ printf("cannot open socket!\n");
+ return -1;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strncpy((char *)ifr.ifr_name, name, IFNAMSIZ);
+ if (ioctl(sock, SIOCGIFINDEX, &ifr) < 0) {
+ printf("ioctl: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+
+ memset(&sll, 0, sizeof(sll));
+ sll.sll_family = AF_PACKET;
+ sll.sll_ifindex = ifr.ifr_ifindex;
+ sll.sll_protocol = htons(ETH_P_ALL);
+ if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+ printf("bind: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+
+ memset(&mr, 0, sizeof(mr));
+ mr.mr_ifindex = ifr.ifr_ifindex;
+ mr.mr_type = PACKET_MR_PROMISC;
+ if (setsockopt(sock, SOL_PACKET, PACKET_ADD_MEMBERSHIP, &mr, sizeof(mr)) < 0) {
+ printf("set_promisc: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+ return sock;
+}
+
+static int test_sock(void)
+{
+ static struct bpf_insn prog[] = {
+ BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+ BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
+ BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+ BPF_EXIT_INSN(),
+ };
+ int sock = -1, map_fd, prog_fd, i, key;
+ long long value = 0, tcp_cnt, udp_cnt;
+
+ map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 2);
+ if (map_fd < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ /* must have been left from previous aborted run, delete it */
+ goto cleanup;
+ }
+
+ key = 6; /* tcp */
+ if (bpf_update_elem(map_fd, &key, &value) < 0) {
+ printf("update err key=%d\n", key);
+ goto cleanup;
+ }
+
+ key = 17; /* udp */
+ if (bpf_update_elem(map_fd, &key, &value) < 0) {
+ printf("update err key=%d\n", key);
+ goto cleanup;
+ }
+
+ prog[5].imm = map_fd;
+
+ prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
+ "GPL");
+ if (prog_fd < 0) {
+ printf("failed to load prog '%s'\n", strerror(errno));
+ goto cleanup;
+ }
+
+ sock = open_raw_sock("eth0");
+
+ if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_fd, sizeof(prog_fd)) < 0) {
+ printf("setsockopt %d\n", errno);
+ goto cleanup;
+ }
+
+ for (i = 0; i < 10; i++) {
+ key = 6;
+ if (bpf_lookup_elem(map_fd, &key, &tcp_cnt) < 0) {
+ printf("lookup err\n");
+ break;
+ }
+ key = 17;
+ if (bpf_lookup_elem(map_fd, &key, &udp_cnt) < 0) {
+ printf("lookup err\n");
+ break;
+ }
+ printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
+ sleep(1);
+ }
+
+cleanup:
+ /* maps, programs, raw sockets will auto cleanup on process exit */
+
+ return 0;
+}
+
+int main(void)
+{
+ test_sock();
+ return 0;
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:27 UTC
Permalink
introduce new setsockopt() command:

int fd;
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &fd, sizeof(fd))

fd is associated with eBPF program priorly loaded via:

fd = syscall(__NR_bpf, BPF_PROG_LOAD, BPF_PROG_TYPE_SOCKET_FILTER,
&prog, sizeof(prog));

setsockopt() calls bpf_prog_get() which increment refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to different sockets.

Program exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/cris/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/linux/filter.h | 1 +
include/uapi/asm-generic/socket.h | 2 +
net/core/filter.c | 135 +++++++++++++++++++++++++++++++-
net/core/sock.c | 13 +++
17 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 3de1394bcab8..8c83c376b5ba 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6e6cd159924b..498ef7220466 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index ed94e5ed0a23..0d5120724780 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -82,6 +82,8 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */


diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index ca2c6e6f31c6..81fba267c285 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -80,5 +80,7 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */

diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index a1b49bac7951..9cbb2e82fa7c 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 6c9a24b3aefa..587ac2fb4106 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index a14baa218c76..ab1aed2306db 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 6aa3ce1854aa..1c4f916d0ef1 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index fe35ceacf0e7..d189bb79ca07 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -79,4 +79,6 @@

#define SO_BPF_EXTENSIONS 0x4029

+#define SO_ATTACH_FILTER_EBPF 0x402a
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index a9c3e2e18c05..88488f24ae7f 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index e031332096d7..c5f26af90366 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -86,4 +86,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 54d9608681b6..667ed3fa63f2 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -76,6 +76,8 @@

#define SO_BPF_EXTENSIONS 0x0032

+#define SO_ATTACH_FILTER_EBPF 0x0033
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 39acec0cf0b1..24f3e4434979 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f06913b29861..9b3f4173d1fd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -78,6 +78,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
void bpf_prog_destroy(struct bpf_prog *fp);

int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int sk_attach_filter_ebpf(u32 ufd, struct sock *sk);
int sk_detach_filter(struct sock *sk);

int bpf_check_classic(const struct sock_filter *filter, unsigned int flen);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index ea0796bdcf88..f41844e9ac07 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -82,4 +82,6 @@

#define SO_BPF_EXTENSIONS 48

+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index ed15874a9beb..0c8531aa6302 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -44,6 +44,7 @@
#include <linux/ratelimit.h>
#include <linux/seccomp.h>
#include <linux/if_vlan.h>
+#include <linux/bpf.h>

/**
* sk_filter - run a packet through a socket filter
@@ -844,8 +845,12 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)

static void __bpf_prog_release(struct bpf_prog *prog)
{
- bpf_release_orig_filter(prog);
- bpf_prog_free(prog);
+ if (prog->has_info) {
+ bpf_prog_put(prog);
+ } else {
+ bpf_release_orig_filter(prog);
+ bpf_prog_free(prog);
+ }
}

static void __sk_filter_release(struct sk_filter *fp)
@@ -1120,6 +1125,132 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_attach_filter);

+int sk_attach_filter_ebpf(u32 ufd, struct sock *sk)
+{
+ struct sk_filter *fp, *old_fp;
+ struct bpf_prog *prog;
+
+ if (sock_flag(sk, SOCK_FILTER_LOCKED))
+ return -EPERM;
+
+ prog = bpf_prog_get(ufd);
+ if (!prog)
+ return -EINVAL;
+
+ if (prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+ /* valid fd, but invalid program type */
+ bpf_prog_put(prog);
+ return -EINVAL;
+ }
+
+ fp = kmalloc(sizeof(*fp), GFP_KERNEL);
+ if (!fp) {
+ bpf_prog_put(prog);
+ return -ENOMEM;
+ }
+ fp->prog = prog;
+
+ atomic_set(&fp->refcnt, 0);
+
+ if (!sk_filter_charge(sk, fp)) {
+ __sk_filter_release(fp);
+ return -ENOMEM;
+ }
+
+ old_fp = rcu_dereference_protected(sk->sk_filter,
+ sock_owned_by_user(sk));
+ rcu_assign_pointer(sk->sk_filter, fp);
+
+ if (old_fp)
+ sk_filter_uncharge(sk, old_fp);
+
+ return 0;
+}
+
+static struct bpf_func_proto sock_filter_funcs[] = {
+ [BPF_FUNC_map_lookup_elem] = {
+ .func = bpf_map_lookup_elem,
+ .gpl_only = false,
+ .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ },
+ [BPF_FUNC_map_update_elem] = {
+ .func = bpf_map_update_elem,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ .arg3_type = ARG_PTR_TO_MAP_VALUE,
+ },
+ [BPF_FUNC_map_delete_elem] = {
+ .func = bpf_map_delete_elem,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_KEY,
+ },
+};
+
+/* allow socket filters to call
+ * bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
+ */
+static const struct bpf_func_proto *sock_filter_func_proto(enum bpf_func_id func_id)
+{
+ if (func_id < 0 || func_id >= ARRAY_SIZE(sock_filter_funcs))
+ return NULL;
+ return &sock_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+ int size;
+ enum bpf_access_type type;
+} sock_filter_ctx_access[] = {
+ [offsetof(struct sk_buff, mark)] = {
+ FIELD_SIZEOF(struct sk_buff, mark), BPF_READ
+ },
+ [offsetof(struct sk_buff, protocol)] = {
+ FIELD_SIZEOF(struct sk_buff, protocol), BPF_READ
+ },
+ [offsetof(struct sk_buff, queue_mapping)] = {
+ FIELD_SIZEOF(struct sk_buff, queue_mapping), BPF_READ
+ },
+};
+
+/* allow socket filters to access to 'mark', 'protocol' and 'queue_mapping'
+ * fields of 'struct sk_buff'
+ */
+static bool sock_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ const struct bpf_context_access *access;
+
+ if (off < 0 || off >= ARRAY_SIZE(sock_filter_ctx_access))
+ return false;
+
+ access = &sock_filter_ctx_access[off];
+ if (access->size == size && (access->type & type))
+ return true;
+
+ return false;
+}
+
+static struct bpf_verifier_ops sock_filter_ops = {
+ .get_func_proto = sock_filter_func_proto,
+ .is_valid_access = sock_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+ .ops = &sock_filter_ops,
+ .type = BPF_PROG_TYPE_SOCKET_FILTER,
+};
+
+static int __init register_sock_filter_ops(void)
+{
+ bpf_register_prog_type(&tl);
+ return 0;
+}
+late_initcall(register_sock_filter_ops);
+
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 2714811afbd8..9603769accfb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -896,6 +896,19 @@ set_rcvbuf:
}
break;

+ case SO_ATTACH_FILTER_EBPF:
+ ret = -EINVAL;
+ if (optlen == sizeof(u32)) {
+ u32 ufd;
+
+ ret = -EFAULT;
+ if (copy_from_user(&ufd, optval, sizeof(ufd)))
+ break;
+
+ ret = sk_attach_filter_ebpf(ufd, sk);
+ }
+ break;
+
case SO_DETACH_FILTER:
ret = sk_detach_filter(sk);
break;
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:29 UTC
Permalink
Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
kernel/trace/trace_kprobe.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 282f6e4e5539..a71e3d521938 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,6 +19,7 @@

#include <linux/module.h>
#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>

#include "trace_probe.h"

@@ -930,6 +931,18 @@ __kprobe_trace_func(struct trace_kprobe *tk, struct pt_regs *regs,
if (ftrace_trigger_soft_disabled(ftrace_file))
return;

+ if (call->flags & TRACE_EVENT_FL_BPF) {
+ struct bpf_context __ctx = {};
+ /* get first 3 arguments of the function. x64 syscall ABI uses
+ * the same 3 registers as x64 calling convention.
+ * todo: implement it cleanly via arch specific
+ * regs_get_argument_nth() helper
+ */
+ syscall_get_arguments(current, regs, 0, 3, &__ctx.arg1);
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+ return;
+ }
+
local_save_flags(irq_flags);
pc = preempt_count();

@@ -978,6 +991,17 @@ __kretprobe_trace_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
if (ftrace_trigger_soft_disabled(ftrace_file))
return;

+ if (call->flags & TRACE_EVENT_FL_BPF) {
+ struct bpf_context __ctx = {};
+ /* assume that register used to return a value from syscall is
+ * the same as register used to return a value from a function
+ * todo: provide arch specific helper
+ */
+ __ctx.ret = syscall_get_return_value(current, regs);
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx);
+ return;
+ }
+
local_save_flags(irq_flags);
pc = preempt_count();
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:26 UTC
Permalink
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/linux/bpf.h | 5 ++++
include/uapi/linux/bpf.h | 3 ++
kernel/bpf/syscall.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 76 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4d99c62c6cea..e811e294b59a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -132,4 +132,9 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
/* verify correctness of eBPF program */
int bpf_check(struct bpf_prog *fp, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1]);

+/* in-kernel helper functions called from eBPF programs */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7e025ba49b70..c18ac0c1e3e5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -403,6 +403,9 @@ enum bpf_prog_type {
*/
enum bpf_func_id {
BPF_FUNC_unspec,
+ BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
+ BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value) */
+ BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 60cb760cb423..009fe8c77a0b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -588,3 +588,71 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
}
}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ * .arg1_type = ARG_CONST_MAP_PTR,
+ * .arg2_type = ARG_PTR_TO_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+ void *key = (void *) (unsigned long) r2;
+ void *value;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ value = map->ops->map_lookup_elem(map, key);
+
+ return (unsigned long) value;
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = RET_INTEGER,
+ * .arg1_type = ARG_CONST_MAP_PTR,
+ * .arg2_type = ARG_PTR_TO_MAP_KEY,
+ * .arg3_type = ARG_PTR_TO_MAP_VALUE,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+ void *key = (void *) (unsigned long) r2;
+ void *value = (void *) (unsigned long) r3;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ return map->ops->map_update_elem(map, key, value);
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = RET_INTEGER,
+ * .arg1_type = ARG_CONST_MAP_PTR,
+ * .arg2_type = ARG_PTR_TO_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+ void *key = (void *) (unsigned long) r2;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ return map->ops->map_delete_elem(map, key);
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:24 UTC
Permalink
This patch adds verifier core which simulates execution of every insn and
records the state of registers and program stack. Every branch instruction seen
during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
it pops the state from the stack and continues until it reaches BPF_EXIT again.
For program:
1: bpf_mov r1, xxx
2: if (r1 == 0) goto 5
3: bpf_mov r0, 1
4: goto 6
5: bpf_mov r0, 2
6: bpf_exit
The verifier will walk insns: 1, 2, 3, 4, 6
then it will pop the state recorded at insn#2 and will continue: 5, 6

This way it walks all possible paths through the program and checks all
possible values of registers. While doing so, it checks for:
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
- BPF_LD_ABS|IND instructions are only used in socket filters
- instruction encoding is not using reserved fields

Kernel subsystem configures the verifier with two callbacks:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
that provides information to the verifer which fields of 'ctx'
are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
returns argument constraints of kernel helper functions that eBPF program
may call, so that verifier can checks that R1-R5 types match the prototype

More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---
include/linux/bpf.h | 47 +++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/verifier.c | 990 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 1037 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d818e473d12c..4d99c62c6cea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,6 +47,31 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
void bpf_map_put(struct bpf_map *map);
struct bpf_map *bpf_map_get(struct fd f);

+/* function argument constraints */
+enum bpf_arg_type {
+ ARG_ANYTHING = 0, /* any argument is ok */
+
+ /* the following constraints used to prototype
+ * bpf_map_lookup/update/delete_elem() functions
+ */
+ ARG_CONST_MAP_PTR, /* const argument used as pointer to bpf_map */
+ ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
+ ARG_PTR_TO_MAP_VALUE, /* pointer to stack used as map value */
+
+ /* the following constraints used to prototype bpf_memcmp() and other
+ * functions that access data on eBPF program stack
+ */
+ ARG_PTR_TO_STACK, /* any pointer to eBPF program stack */
+ ARG_CONST_STACK_SIZE, /* number of bytes accessed from stack */
+};
+
+/* type of values returned from helper functions */
+enum bpf_return_type {
+ RET_INTEGER, /* function returns integer */
+ RET_VOID, /* function doesn't return anything */
+ RET_PTR_TO_MAP_VALUE_OR_NULL, /* returns a pointer to map elem value or NULL */
+};
+
/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
* to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
* instructions after verifying
@@ -54,11 +79,33 @@ struct bpf_map *bpf_map_get(struct fd f);
struct bpf_func_proto {
u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
bool gpl_only;
+ enum bpf_return_type ret_type;
+ enum bpf_arg_type arg1_type;
+ enum bpf_arg_type arg2_type;
+ enum bpf_arg_type arg3_type;
+ enum bpf_arg_type arg4_type;
+ enum bpf_arg_type arg5_type;
+};
+
+/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
+ * the first argument to eBPF programs.
+ * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
+ */
+struct bpf_context;
+
+enum bpf_access_type {
+ BPF_READ = 1,
+ BPF_WRITE = 2
};

struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+
+ /* return true if 'size' wide access at offset 'off' within bpf_context
+ * with 'type' (read or write) is allowed
+ */
+ bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
};

struct bpf_prog_type_list {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 68822671ab5e..7e025ba49b70 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -395,6 +395,7 @@ enum bpf_prog_attributes {

enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
+ BPF_PROG_TYPE_SOCKET_FILTER,
};

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 19c853f68c30..2dcfa5f76418 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -143,6 +143,72 @@
* load/store to bpf_context are checked against known fields
*/

+#define _(OP) ({ int ret = (OP); if (ret < 0) return ret; })
+
+/* types of values stored in eBPF registers */
+enum bpf_reg_type {
+ NOT_INIT = 0, /* nothing was written into register */
+ UNKNOWN_VALUE, /* reg doesn't contain a valid pointer */
+ PTR_TO_CTX, /* reg points to bpf_context */
+ CONST_PTR_TO_MAP, /* reg points to struct bpf_map */
+ PTR_TO_MAP_VALUE, /* reg points to map element value */
+ PTR_TO_MAP_VALUE_OR_NULL,/* points to map elem value or NULL */
+ FRAME_PTR, /* reg == frame_pointer */
+ PTR_TO_STACK, /* reg == frame_pointer + imm */
+ CONST_IMM, /* constant integer value */
+};
+
+struct reg_state {
+ enum bpf_reg_type type;
+ union {
+ /* valid when type == CONST_IMM | PTR_TO_STACK */
+ int imm;
+
+ /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
+ * PTR_TO_MAP_VALUE_OR_NULL
+ */
+ struct bpf_map *map_ptr;
+ };
+};
+
+enum bpf_stack_slot_type {
+ STACK_INVALID, /* nothing was stored in this stack slot */
+ STACK_SPILL, /* 1st byte of register spilled into stack */
+ STACK_SPILL_PART, /* other 7 bytes of register spill */
+ STACK_MISC /* BPF program wrote some data into this slot */
+};
+
+struct bpf_stack_slot {
+ enum bpf_stack_slot_type stype;
+ struct reg_state reg_st;
+};
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct verifier_state {
+ struct reg_state regs[MAX_BPF_REG];
+ struct bpf_stack_slot stack[MAX_BPF_STACK];
+};
+
+/* linked list of verifier states used to prune search */
+struct verifier_state_list {
+ struct verifier_state state;
+ struct verifier_state_list *next;
+};
+
+/* verifier_state + insn_idx are pushed to stack when branch is encountered */
+struct verifier_stack_elem {
+ /* verifer state is 'st'
+ * before processing instruction 'insn_idx'
+ * and after processing instruction 'prev_insn_idx'
+ */
+ struct verifier_state st;
+ int insn_idx;
+ int prev_insn_idx;
+ struct verifier_stack_elem *next;
+};
+
#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */

/* single container for all structs
@@ -150,6 +216,9 @@
*/
struct verifier_env {
struct bpf_prog *prog; /* eBPF program being verified */
+ struct verifier_stack_elem *head; /* stack of verifier states to be processed */
+ int stack_size; /* number of states to be processed */
+ struct verifier_state cur_state; /* current verifier state */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */
u32 used_map_cnt; /* number of used maps */
};
@@ -178,6 +247,45 @@ static void verbose(const char *fmt, ...)
va_end(args);
}

+/* string representation of 'enum bpf_reg_type' */
+static const char * const reg_type_str[] = {
+ [NOT_INIT] = "?",
+ [UNKNOWN_VALUE] = "inv",
+ [PTR_TO_CTX] = "ctx",
+ [CONST_PTR_TO_MAP] = "map_ptr",
+ [PTR_TO_MAP_VALUE] = "map_value",
+ [PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
+ [FRAME_PTR] = "fp",
+ [PTR_TO_STACK] = "fp",
+ [CONST_IMM] = "imm",
+};
+
+static void print_verifier_state(struct verifier_env *env)
+{
+ enum bpf_reg_type t;
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ t = env->cur_state.regs[i].type;
+ if (t == NOT_INIT)
+ continue;
+ verbose(" R%d=%s", i, reg_type_str[t]);
+ if (t == CONST_IMM || t == PTR_TO_STACK)
+ verbose("%d", env->cur_state.regs[i].imm);
+ else if (t == CONST_PTR_TO_MAP || t == PTR_TO_MAP_VALUE ||
+ t == PTR_TO_MAP_VALUE_OR_NULL)
+ verbose("(ks=%d,vs=%d)",
+ env->cur_state.regs[i].map_ptr->key_size,
+ env->cur_state.regs[i].map_ptr->value_size);
+ }
+ for (i = 0; i < MAX_BPF_STACK; i++) {
+ if (env->cur_state.stack[i].stype == STACK_SPILL)
+ verbose(" fp%d=%s", -MAX_BPF_STACK + i,
+ reg_type_str[env->cur_state.stack[i].reg_st.type]);
+ }
+ verbose("\n");
+}
+
static const char *const bpf_class_string[] = {
[BPF_LD] = "ld",
[BPF_LDX] = "ldx",
@@ -323,6 +431,647 @@ static void print_bpf_insn(struct bpf_insn *insn)
}
}

+static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
+{
+ struct verifier_stack_elem *elem;
+ int insn_idx;
+
+ if (env->head == NULL)
+ return -1;
+
+ memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
+ insn_idx = env->head->insn_idx;
+ if (prev_insn_idx)
+ *prev_insn_idx = env->head->prev_insn_idx;
+ elem = env->head->next;
+ kfree(env->head);
+ env->head = elem;
+ env->stack_size--;
+ return insn_idx;
+}
+
+static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
+ int prev_insn_idx)
+{
+ struct verifier_stack_elem *elem;
+
+ elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
+ if (!elem)
+ goto err;
+
+ memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
+ elem->insn_idx = insn_idx;
+ elem->prev_insn_idx = prev_insn_idx;
+ elem->next = env->head;
+ env->head = elem;
+ env->stack_size++;
+ if (env->stack_size > 1024) {
+ verbose("BPF program is too complex\n");
+ goto err;
+ }
+ return &elem->st;
+err:
+ /* pop all elements and return */
+ while (pop_stack(env, NULL) >= 0);
+ return NULL;
+}
+
+#define CALLER_SAVED_REGS 6
+static const int caller_saved[CALLER_SAVED_REGS] = {
+ BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
+};
+
+static void init_reg_state(struct reg_state *regs)
+{
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ regs[i].type = NOT_INIT;
+ regs[i].imm = 0;
+ regs[i].map_ptr = NULL;
+ }
+
+ /* frame pointer */
+ regs[BPF_REG_FP].type = FRAME_PTR;
+
+ /* 1st arg to a function */
+ regs[BPF_REG_1].type = PTR_TO_CTX;
+}
+
+static void mark_reg_unknown_value(struct reg_state *regs, int regno)
+{
+ regs[regno].type = UNKNOWN_VALUE;
+ regs[regno].imm = 0;
+ regs[regno].map_ptr = NULL;
+}
+
+static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
+{
+ if (regno >= MAX_BPF_REG) {
+ verbose("R%d is invalid\n", regno);
+ return -EINVAL;
+ }
+
+ if (is_src) {
+ if (regs[regno].type == NOT_INIT) {
+ verbose("R%d !read_ok\n", regno);
+ return -EACCES;
+ }
+ } else {
+ if (regno == BPF_REG_FP) {
+ verbose("frame pointer is read only\n");
+ return -EACCES;
+ }
+ mark_reg_unknown_value(regs, regno);
+ }
+ return 0;
+}
+
+static int bpf_size_to_bytes(int bpf_size)
+{
+ if (bpf_size == BPF_W)
+ return 4;
+ else if (bpf_size == BPF_H)
+ return 2;
+ else if (bpf_size == BPF_B)
+ return 1;
+ else if (bpf_size == BPF_DW)
+ return 8;
+ else
+ return -EACCES;
+}
+
+static int check_stack_write(struct verifier_state *state, int off, int size,
+ int value_regno)
+{
+ struct bpf_stack_slot *slot;
+ int i;
+
+ if (value_regno >= 0 &&
+ (state->regs[value_regno].type == PTR_TO_MAP_VALUE ||
+ state->regs[value_regno].type == PTR_TO_STACK ||
+ state->regs[value_regno].type == PTR_TO_CTX)) {
+
+ /* register containing pointer is being spilled into stack */
+ if (size != 8) {
+ verbose("invalid size of register spill\n");
+ return -EACCES;
+ }
+
+ slot = &state->stack[MAX_BPF_STACK + off];
+ slot->stype = STACK_SPILL;
+ /* save register state */
+ slot->reg_st = state->regs[value_regno];
+ for (i = 1; i < 8; i++) {
+ slot = &state->stack[MAX_BPF_STACK + off + i];
+ slot->stype = STACK_SPILL_PART;
+ slot->reg_st.type = UNKNOWN_VALUE;
+ slot->reg_st.map_ptr = NULL;
+ }
+ } else {
+
+ /* regular write of data into stack */
+ for (i = 0; i < size; i++) {
+ slot = &state->stack[MAX_BPF_STACK + off + i];
+ slot->stype = STACK_MISC;
+ slot->reg_st.type = UNKNOWN_VALUE;
+ slot->reg_st.map_ptr = NULL;
+ }
+ }
+ return 0;
+}
+
+static int check_stack_read(struct verifier_state *state, int off, int size,
+ int value_regno)
+{
+ int i;
+ struct bpf_stack_slot *slot;
+
+ slot = &state->stack[MAX_BPF_STACK + off];
+
+ if (slot->stype == STACK_SPILL) {
+ if (size != 8) {
+ verbose("invalid size of register spill\n");
+ return -EACCES;
+ }
+ for (i = 1; i < 8; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].stype !=
+ STACK_SPILL_PART) {
+ verbose("corrupted spill memory\n");
+ return -EACCES;
+ }
+ }
+
+ /* restore register state from stack */
+ state->regs[value_regno] = slot->reg_st;
+ return 0;
+ } else {
+ for (i = 0; i < size; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].stype !=
+ STACK_MISC) {
+ verbose("invalid read from stack off %d+%d size %d\n",
+ off, i, size);
+ return -EACCES;
+ }
+ }
+ /* have read misc data from the stack */
+ mark_reg_unknown_value(state->regs, value_regno);
+ return 0;
+ }
+}
+
+/* check read/write into map element returned by bpf_map_lookup_elem() */
+static int check_map_access(struct verifier_env *env, int regno, int off,
+ int size)
+{
+ struct bpf_map *map = env->cur_state.regs[regno].map_ptr;
+
+ if (off < 0 || off + size > map->value_size) {
+ verbose("invalid access to map value, value_size=%d off=%d size=%d\n",
+ map->value_size, off, size);
+ return -EACCES;
+ }
+ return 0;
+}
+
+/* check access to 'struct bpf_context' fields */
+static int check_ctx_access(struct verifier_env *env, int off, int size,
+ enum bpf_access_type t)
+{
+ if (env->prog->info->ops->is_valid_access &&
+ env->prog->info->ops->is_valid_access(off, size, t))
+ return 0;
+
+ verbose("invalid bpf_context access off=%d size=%d\n", off, size);
+ return -EACCES;
+}
+
+static int check_mem_access(struct verifier_env *env, int regno, int off,
+ int bpf_size, enum bpf_access_type t,
+ int value_regno)
+{
+ struct verifier_state *state = &env->cur_state;
+ int size;
+
+ _(size = bpf_size_to_bytes(bpf_size));
+
+ if (off % size != 0) {
+ verbose("misaligned access off %d size %d\n", off, size);
+ return -EACCES;
+ }
+
+ if (state->regs[regno].type == PTR_TO_MAP_VALUE) {
+ _(check_map_access(env, regno, off, size));
+ if (t == BPF_READ)
+ mark_reg_unknown_value(state->regs, value_regno);
+ } else if (state->regs[regno].type == PTR_TO_CTX) {
+ _(check_ctx_access(env, off, size, t));
+ if (t == BPF_READ)
+ mark_reg_unknown_value(state->regs, value_regno);
+ } else if (state->regs[regno].type == FRAME_PTR) {
+ if (off >= 0 || off < -MAX_BPF_STACK) {
+ verbose("invalid stack off=%d size=%d\n", off, size);
+ return -EACCES;
+ }
+ if (t == BPF_WRITE)
+ _(check_stack_write(state, off, size, value_regno));
+ else
+ _(check_stack_read(state, off, size, value_regno));
+ } else {
+ verbose("R%d invalid mem access '%s'\n",
+ regno, reg_type_str[state->regs[regno].type]);
+ return -EACCES;
+ }
+ return 0;
+}
+
+/* when register 'regno' is passed into function that will read 'access_size'
+ * bytes from that pointer, make sure that it's within stack boundary
+ * and all elements of stack are initialized
+ */
+static int check_stack_boundary(struct verifier_env *env,
+ int regno, int access_size)
+{
+ struct verifier_state *state = &env->cur_state;
+ struct reg_state *regs = state->regs;
+ int off, i;
+
+ if (regs[regno].type != PTR_TO_STACK)
+ return -EACCES;
+
+ off = regs[regno].imm;
+ if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
+ access_size <= 0) {
+ verbose("invalid stack type R%d off=%d access_size=%d\n",
+ regno, off, access_size);
+ return -EACCES;
+ }
+
+ for (i = 0; i < access_size; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].stype != STACK_MISC) {
+ verbose("invalid indirect read from stack off %d+%d size %d\n",
+ off, i, access_size);
+ return -EACCES;
+ }
+ }
+ return 0;
+}
+
+static int check_func_arg(struct verifier_env *env, u32 regno,
+ enum bpf_arg_type arg_type, struct bpf_map **mapp)
+{
+ struct reg_state *reg = env->cur_state.regs + regno;
+ enum bpf_reg_type expected_type;
+
+ if (arg_type == ARG_ANYTHING)
+ return 0;
+
+ if (reg->type == NOT_INIT) {
+ verbose("R%d !read_ok\n", regno);
+ return -EACCES;
+ }
+
+ if (arg_type == ARG_PTR_TO_STACK || arg_type == ARG_PTR_TO_MAP_KEY ||
+ arg_type == ARG_PTR_TO_MAP_VALUE) {
+ expected_type = PTR_TO_STACK;
+ } else if (arg_type == ARG_CONST_STACK_SIZE) {
+ expected_type = CONST_IMM;
+ } else if (arg_type == ARG_CONST_MAP_PTR) {
+ expected_type = CONST_PTR_TO_MAP;
+ } else {
+ verbose("unsupported arg_type %d\n", arg_type);
+ return -EFAULT;
+ }
+
+ if (reg->type != expected_type) {
+ verbose("R%d type=%s expected=%s\n", regno,
+ reg_type_str[reg->type], reg_type_str[expected_type]);
+ return -EACCES;
+ }
+
+ if (arg_type == ARG_CONST_MAP_PTR) {
+ /* bpf_map_xxx(map_ptr) call: remember that map_ptr */
+ *mapp = reg->map_ptr;
+
+ } else if (arg_type == ARG_PTR_TO_MAP_KEY) {
+ /* bpf_map_xxx(..., map_ptr, ..., key) call:
+ * check that [key, key + map->key_size) are within
+ * stack limits and initialized
+ */
+ if (!*mapp) {
+ /* in function declaration map_ptr must come before
+ * map_key or map_elem, so that it's verified
+ * and known before we have to check map_key here.
+ * It means that kernel subsystem misconfigured verifier
+ */
+ verbose("invalid map_ptr to access map->key\n");
+ return -EACCES;
+ }
+ _(check_stack_boundary(env, regno, (*mapp)->key_size));
+
+ } else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
+ /* bpf_map_xxx(..., map_ptr, ..., value) call:
+ * check [value, value + map->value_size) validity
+ */
+ if (!*mapp) {
+ /* kernel subsystem misconfigured verifier */
+ verbose("invalid map_ptr to access map->elem\n");
+ return -EACCES;
+ }
+ _(check_stack_boundary(env, regno, (*mapp)->value_size));
+
+ } else if (arg_type == ARG_CONST_STACK_SIZE) {
+ /* bpf_xxx(..., buf, len) call will access 'len' bytes
+ * from stack pointer 'buf'. Check it
+ * note: regno == len, regno - 1 == buf
+ */
+ if (regno == 0) {
+ /* kernel subsystem misconfigured verifier */
+ verbose("ARG_CONST_STACK_SIZE cannot be first argument\n");
+ return -EACCES;
+ }
+ _(check_stack_boundary(env, regno - 1, reg->imm));
+ }
+
+ return 0;
+}
+
+static int check_call(struct verifier_env *env, int func_id)
+{
+ struct verifier_state *state = &env->cur_state;
+ const struct bpf_func_proto *fn = NULL;
+ struct reg_state *regs = state->regs;
+ struct bpf_map *map = NULL;
+ struct reg_state *reg;
+ int i;
+
+ /* find function prototype */
+ if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
+ verbose("invalid func %d\n", func_id);
+ return -EINVAL;
+ }
+
+ if (env->prog->info->ops->get_func_proto)
+ fn = env->prog->info->ops->get_func_proto(func_id);
+
+ if (!fn) {
+ verbose("unknown func %d\n", func_id);
+ return -EINVAL;
+ }
+
+ /* eBPF programs must be GPL compatible to use GPL-ed functions */
+ if (!env->prog->info->is_gpl_compatible && fn->gpl_only) {
+ verbose("cannot call GPL only function from proprietary program\n");
+ return -EINVAL;
+ }
+
+ /* check args */
+ _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map));
+ _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map));
+ _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map));
+ _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map));
+ _(check_func_arg(env, BPF_REG_5, fn->arg5_type, &map));
+
+ /* reset caller saved regs */
+ for (i = 0; i < CALLER_SAVED_REGS; i++) {
+ reg = regs + caller_saved[i];
+ reg->type = NOT_INIT;
+ reg->imm = 0;
+ }
+
+ /* update return register */
+ if (fn->ret_type == RET_INTEGER) {
+ regs[BPF_REG_0].type = UNKNOWN_VALUE;
+ } else if (fn->ret_type == RET_VOID) {
+ regs[BPF_REG_0].type = NOT_INIT;
+ } else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL) {
+ regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
+ /*
+ * remember map_ptr, so that check_map_access()
+ * can check 'value_size' boundary of memory access
+ * to map element returned from bpf_map_lookup_elem()
+ */
+ if (map == NULL) {
+ verbose("kernel subsystem misconfigured verifier\n");
+ return -EINVAL;
+ }
+ regs[BPF_REG_0].map_ptr = map;
+ } else {
+ verbose("unknown return type %d of func %d\n",
+ fn->ret_type, func_id);
+ return -EINVAL;
+ }
+ return 0;
+}
+
+/* check validity of 32-bit and 64-bit arithmetic operations */
+static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
+{
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_END || opcode == BPF_NEG) {
+ if (opcode == BPF_NEG) {
+ if (BPF_SRC(insn->code) != 0 ||
+ insn->src_reg != BPF_REG_0 ||
+ insn->off != 0 || insn->imm != 0) {
+ verbose("BPF_NEG uses reserved fields\n");
+ return -EINVAL;
+ }
+ } else {
+ if (insn->src_reg != BPF_REG_0 || insn->off != 0 ||
+ (insn->imm != 16 && insn->imm != 32 && insn->imm != 64)) {
+ verbose("BPF_END uses reserved fields\n");
+ return -EINVAL;
+ }
+ }
+
+ /* check src operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ } else if (opcode == BPF_MOV) {
+
+ if (BPF_SRC(insn->code) == BPF_X) {
+ if (insn->imm != 0 || insn->off != 0) {
+ verbose("BPF_MOV uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ /* check src operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+ } else {
+ if (insn->src_reg != BPF_REG_0 || insn->off != 0) {
+ verbose("BPF_MOV uses reserved fields\n");
+ return -EINVAL;
+ }
+ }
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ if (BPF_SRC(insn->code) == BPF_X) {
+ if (BPF_CLASS(insn->code) == BPF_ALU64) {
+ /* case: R1 = R2
+ * copy register state to dest reg
+ */
+ regs[insn->dst_reg].type = regs[insn->src_reg].type;
+ regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
+ } else {
+ regs[insn->dst_reg].type = UNKNOWN_VALUE;
+ regs[insn->dst_reg].imm = 0;
+ }
+ } else {
+ /* case: R = imm
+ * remember the value we stored into this reg
+ */
+ regs[insn->dst_reg].type = CONST_IMM;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+
+ } else if (opcode > BPF_END) {
+ verbose("invalid BPF_ALU opcode %x\n", opcode);
+ return -EINVAL;
+
+ } else { /* all other ALU ops: and, sub, xor, add, ... */
+
+ int stack_relative = 0;
+
+ if (BPF_SRC(insn->code) == BPF_X) {
+ if (insn->imm != 0 || insn->off != 0) {
+ verbose("BPF_ALU uses reserved fields\n");
+ return -EINVAL;
+ }
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+ } else {
+ if (insn->src_reg != BPF_REG_0 || insn->off != 0) {
+ verbose("BPF_ALU uses reserved fields\n");
+ return -EINVAL;
+ }
+ }
+
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ if ((opcode == BPF_MOD || opcode == BPF_DIV) &&
+ BPF_SRC(insn->code) == BPF_K && insn->imm == 0) {
+ verbose("div by zero\n");
+ return -EINVAL;
+ }
+
+ if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
+ regs[insn->dst_reg].type == FRAME_PTR &&
+ BPF_SRC(insn->code) == BPF_K)
+ stack_relative = 1;
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ if (stack_relative) {
+ regs[insn->dst_reg].type = PTR_TO_STACK;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+ }
+
+ return 0;
+}
+
+static int check_cond_jmp_op(struct verifier_env *env,
+ struct bpf_insn *insn, int *insn_idx)
+{
+ struct reg_state *regs = env->cur_state.regs;
+ struct verifier_state *other_branch;
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode > BPF_EXIT) {
+ verbose("invalid BPF_JMP opcode %x\n", opcode);
+ return -EINVAL;
+ }
+
+ if (BPF_SRC(insn->code) == BPF_X) {
+ if (insn->imm != 0) {
+ verbose("BPF_JMP uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+ } else {
+ if (insn->src_reg != BPF_REG_0) {
+ verbose("BPF_JMP uses reserved fields\n");
+ return -EINVAL;
+ }
+ }
+
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ /* detect if R == 0 where R was initialized to zero earlier */
+ if (BPF_SRC(insn->code) == BPF_K &&
+ (opcode == BPF_JEQ || opcode == BPF_JNE) &&
+ regs[insn->dst_reg].type == CONST_IMM &&
+ regs[insn->dst_reg].imm == insn->imm) {
+ if (opcode == BPF_JEQ) {
+ /* if (imm == imm) goto pc+off;
+ * only follow the goto, ignore fall-through
+ */
+ *insn_idx += insn->off;
+ return 0;
+ } else {
+ /* if (imm != imm) goto pc+off;
+ * only follow fall-through branch, since
+ * that's where the program will go
+ */
+ return 0;
+ }
+ }
+
+ other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
+ if (!other_branch)
+ return -EFAULT;
+
+ /* detect if R == 0 where R is returned value from bpf_map_lookup_elem() */
+ if (BPF_SRC(insn->code) == BPF_K &&
+ insn->imm == 0 && (opcode == BPF_JEQ ||
+ opcode == BPF_JNE) &&
+ regs[insn->dst_reg].type == PTR_TO_MAP_VALUE_OR_NULL) {
+ if (opcode == BPF_JEQ) {
+ /* next fallthrough insn can access memory via
+ * this register
+ */
+ regs[insn->dst_reg].type = PTR_TO_MAP_VALUE;
+ /* branch targer cannot access it, since reg == 0 */
+ other_branch->regs[insn->dst_reg].type = CONST_IMM;
+ other_branch->regs[insn->dst_reg].imm = 0;
+ } else {
+ other_branch->regs[insn->dst_reg].type = PTR_TO_MAP_VALUE;
+ regs[insn->dst_reg].type = CONST_IMM;
+ regs[insn->dst_reg].imm = 0;
+ }
+ } else if (BPF_SRC(insn->code) == BPF_K &&
+ (opcode == BPF_JEQ || opcode == BPF_JNE)) {
+
+ if (opcode == BPF_JEQ) {
+ /* detect if (R == imm) goto
+ * and in the target state recognize that R = imm
+ */
+ other_branch->regs[insn->dst_reg].type = CONST_IMM;
+ other_branch->regs[insn->dst_reg].imm = insn->imm;
+ } else {
+ /* detect if (R != imm) goto
+ * and in the fall-through state recognize that R = imm
+ */
+ regs[insn->dst_reg].type = CONST_IMM;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+ }
+ if (log_level)
+ print_verifier_state(env);
+ return 0;
+}
+
/* return the map pointer stored inside BPF_LD_IMM64 instruction */
static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn)
{
@@ -331,6 +1080,93 @@ static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn)
return (struct bpf_map *) (unsigned long) imm64;
}

+/* verify BPF_LD_IMM64 instruction */
+static int check_ld_imm(struct verifier_env *env, struct bpf_insn *insn)
+{
+ struct reg_state *regs = env->cur_state.regs;
+
+ if (BPF_SIZE(insn->code) != BPF_DW) {
+ verbose("invalid BPF_LD_IMM insn\n");
+ return -EINVAL;
+ }
+ if (insn->off != 0) {
+ verbose("BPF_LD_IMM64 uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ if (insn->src_reg == 0)
+ /* generic move 64-bit immediate into a register */
+ return 0;
+
+ /* replace_map_fd_with_map_ptr() should have caught bad ld_imm64 */
+ BUG_ON(insn->src_reg != BPF_PSEUDO_MAP_FD);
+
+ regs[insn->dst_reg].type = CONST_PTR_TO_MAP;
+ regs[insn->dst_reg].map_ptr = ld_imm64_to_map_ptr(insn);
+ return 0;
+}
+
+/* verify safety of LD_ABS|LD_IND instructions:
+ * - they can only appear in the programs where ctx == skb
+ * - since they are wrappers of function calls, they scratch R1-R5 registers,
+ * preserve R6-R9, and store return value into R0
+ *
+ * Implicit input:
+ * ctx == skb == R6 == CTX
+ *
+ * Explicit input:
+ * SRC == any register
+ * IMM == 32-bit immediate
+ *
+ * Output:
+ * R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
+{
+ struct reg_state *regs = env->cur_state.regs;
+ u8 mode = BPF_MODE(insn->code);
+ struct reg_state *reg;
+ int i;
+
+ if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+ verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
+ return -EINVAL;
+ }
+
+ if (insn->dst_reg != BPF_REG_0 || insn->off != 0 ||
+ (mode == BPF_ABS && insn->src_reg != BPF_REG_0)) {
+ verbose("BPF_LD_ABS uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ /* check whether implicit source operand (register R6) is readable */
+ _(check_reg_arg(regs, BPF_REG_6, 1));
+
+ if (regs[BPF_REG_6].type != PTR_TO_CTX) {
+ verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
+ return -EINVAL;
+ }
+
+ if (mode == BPF_IND)
+ /* check explicit source operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ /* reset caller saved regs to unreadable */
+ for (i = 0; i < CALLER_SAVED_REGS; i++) {
+ reg = regs + caller_saved[i];
+ reg->type = NOT_INIT;
+ reg->imm = 0;
+ }
+
+ /* mark destination R0 register as readable, since it contains
+ * the value fetched from the packet
+ */
+ regs[BPF_REG_0].type = UNKNOWN_VALUE;
+ return 0;
+}
+
/* non-recursive DFS pseudo code
* 1 procedure DFS-iterative(G,v):
* 2 label v as discovered
@@ -510,6 +1346,157 @@ free_st:
return ret;
}

+static int do_check(struct verifier_env *env)
+{
+ struct verifier_state *state = &env->cur_state;
+ struct bpf_insn *insns = env->prog->insnsi;
+ struct reg_state *regs = state->regs;
+ int insn_cnt = env->prog->len;
+ int insn_idx, prev_insn_idx = 0;
+ int insn_processed = 0;
+ bool do_print_state = false;
+
+ init_reg_state(regs);
+ insn_idx = 0;
+ for (;;) {
+ struct bpf_insn *insn;
+ u8 class;
+
+ if (insn_idx >= insn_cnt) {
+ verbose("invalid insn idx %d insn_cnt %d\n",
+ insn_idx, insn_cnt);
+ return -EFAULT;
+ }
+
+ insn = &insns[insn_idx];
+ class = BPF_CLASS(insn->code);
+
+ if (++insn_processed > 32768) {
+ verbose("BPF program is too large. Proccessed %d insn\n",
+ insn_processed);
+ return -E2BIG;
+ }
+
+ if (log_level && do_print_state) {
+ verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
+ print_verifier_state(env);
+ do_print_state = false;
+ }
+
+ if (log_level) {
+ verbose("%d: ", insn_idx);
+ print_bpf_insn(insn);
+ }
+
+ if (class == BPF_ALU || class == BPF_ALU64) {
+ _(check_alu_op(regs, insn));
+
+ } else if (class == BPF_LDX) {
+ if (BPF_MODE(insn->code) != BPF_MEM)
+ return -EINVAL;
+
+ /* check src operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ _(check_mem_access(env, insn->src_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_READ,
+ insn->dst_reg));
+
+ /* dest reg state will be updated by mem_access */
+
+ } else if (class == BPF_STX) {
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+ _(check_mem_access(env, insn->dst_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_WRITE,
+ insn->src_reg));
+
+ } else if (class == BPF_ST) {
+ if (BPF_MODE(insn->code) != BPF_MEM)
+ return -EINVAL;
+ /* check src operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+ _(check_mem_access(env, insn->dst_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_WRITE,
+ -1));
+
+ } else if (class == BPF_JMP) {
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_CALL) {
+ if (BPF_SRC(insn->code) != BPF_K ||
+ insn->off != 0 ||
+ insn->src_reg != BPF_REG_0 ||
+ insn->dst_reg != BPF_REG_0) {
+ verbose("BPF_CALL uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ _(check_call(env, insn->imm));
+
+ } else if (opcode == BPF_JA) {
+ if (BPF_SRC(insn->code) != BPF_K ||
+ insn->imm != 0 ||
+ insn->src_reg != BPF_REG_0 ||
+ insn->dst_reg != BPF_REG_0) {
+ verbose("BPF_JA uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ insn_idx += insn->off + 1;
+ continue;
+
+ } else if (opcode == BPF_EXIT) {
+ if (BPF_SRC(insn->code) != BPF_K ||
+ insn->imm != 0 ||
+ insn->src_reg != BPF_REG_0 ||
+ insn->dst_reg != BPF_REG_0) {
+ verbose("BPF_EXIT uses reserved fields\n");
+ return -EINVAL;
+ }
+
+ /* eBPF calling convetion is such that R0 is used
+ * to return the value from eBPF program.
+ * Make sure that it's readable at this time
+ * of bpf_exit, which means that program wrote
+ * something into it earlier
+ */
+ _(check_reg_arg(regs, BPF_REG_0, 1));
+ insn_idx = pop_stack(env, &prev_insn_idx);
+ if (insn_idx < 0) {
+ break;
+ } else {
+ do_print_state = true;
+ continue;
+ }
+ } else {
+ _(check_cond_jmp_op(env, insn, &insn_idx));
+ }
+ } else if (class == BPF_LD) {
+ u8 mode = BPF_MODE(insn->code);
+
+ if (mode == BPF_ABS || mode == BPF_IND) {
+ _(check_ld_abs(env, insn));
+ } else if (mode == BPF_IMM) {
+ _(check_ld_imm(env, insn));
+ insn_idx++;
+ } else {
+ verbose("invalid BPF_LD mode\n");
+ return -EINVAL;
+ }
+ } else {
+ verbose("unknown insn class %d\n", class);
+ return -EINVAL;
+ }
+
+ insn_idx++;
+ }
+
+ return 0;
+}
+
/* look for pseudo eBPF instructions that access map FDs and
* replace them with actual map pointers
*/
@@ -663,9 +1650,10 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
if (ret < 0)
goto skip_full_check;

- /* ret = do_check(env); */
+ ret = do_check(env);

skip_full_check:
+ while (pop_stack(env, NULL) >= 0);

if (log_level && log_len >= log_size - 1) {
BUG_ON(log_len >= log_size);
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:23 UTC
Permalink
check that control flow graph of eBPF program is a directed acyclic graph

check_cfg() does:
- detect loops
- detect unreachable instructions
- check that program terminates with BPF_EXIT insn
- check that all branches are within program boundary

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---
kernel/bpf/verifier.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 183 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f71d4c494aec..19c853f68c30 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -331,6 +331,185 @@ static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn)
return (struct bpf_map *) (unsigned long) imm64;
}

+/* non-recursive DFS pseudo code
+ * 1 procedure DFS-iterative(G,v):
+ * 2 label v as discovered
+ * 3 let S be a stack
+ * 4 S.push(v)
+ * 5 while S is not empty
+ * 6 t <- S.pop()
+ * 7 if t is what we're looking for:
+ * 8 return t
+ * 9 for all edges e in G.adjacentEdges(t) do
+ * 10 if edge e is already labelled
+ * 11 continue with the next edge
+ * 12 w <- G.adjacentVertex(t,e)
+ * 13 if vertex w is not discovered and not explored
+ * 14 label e as tree-edge
+ * 15 label w as discovered
+ * 16 S.push(w)
+ * 17 continue at 5
+ * 18 else if vertex w is discovered
+ * 19 label e as back-edge
+ * 20 else
+ * 21 // vertex w is explored
+ * 22 label e as forward- or cross-edge
+ * 23 label t as explored
+ * 24 S.pop()
+ *
+ * convention:
+ * 0x10 - discovered
+ * 0x11 - discovered and fall-through edge labelled
+ * 0x12 - discovered and fall-through and branch edges labelled
+ * 0x20 - explored
+ */
+
+enum {
+ DISCOVERED = 0x10,
+ EXPLORED = 0x20,
+ FALLTHROUGH = 1,
+ BRANCH = 2,
+};
+
+#define PUSH_INT(I) \
+ do { \
+ if (cur_stack >= insn_cnt) { \
+ ret = -E2BIG; \
+ goto free_st; \
+ } \
+ stack[cur_stack++] = I; \
+ } while (0)
+
+#define PEEK_INT() \
+ ({ \
+ int _ret; \
+ if (cur_stack == 0) \
+ _ret = -1; \
+ else \
+ _ret = stack[cur_stack - 1]; \
+ _ret; \
+ })
+
+#define POP_INT() \
+ ({ \
+ int _ret; \
+ if (cur_stack == 0) \
+ _ret = -1; \
+ else \
+ _ret = stack[--cur_stack]; \
+ _ret; \
+ })
+
+#define PUSH_INSN(T, W, E) \
+ do { \
+ int w = W; \
+ if (E == FALLTHROUGH && st[T] >= (DISCOVERED | FALLTHROUGH)) \
+ break; \
+ if (E == BRANCH && st[T] >= (DISCOVERED | BRANCH)) \
+ break; \
+ if (w < 0 || w >= insn_cnt) { \
+ verbose("jump out of range from insn %d to %d\n", T, w); \
+ ret = -EINVAL; \
+ goto free_st; \
+ } \
+ if (st[w] == 0) { \
+ /* tree-edge */ \
+ st[T] = DISCOVERED | E; \
+ st[w] = DISCOVERED; \
+ PUSH_INT(w); \
+ goto peek_stack; \
+ } else if ((st[w] & 0xF0) == DISCOVERED) { \
+ verbose("back-edge from insn %d to %d\n", T, w); \
+ ret = -EINVAL; \
+ goto free_st; \
+ } else if (st[w] == EXPLORED) { \
+ /* forward- or cross-edge */ \
+ st[T] = DISCOVERED | E; \
+ } else { \
+ verbose("insn state internal bug\n"); \
+ ret = -EFAULT; \
+ goto free_st; \
+ } \
+ } while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env)
+{
+ struct bpf_insn *insns = env->prog->insnsi;
+ int insn_cnt = env->prog->len;
+ int cur_stack = 0;
+ int *stack;
+ int ret = 0;
+ int *st;
+ int i, t;
+
+ st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+ if (!st)
+ return -ENOMEM;
+
+ stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+ if (!stack) {
+ kfree(st);
+ return -ENOMEM;
+ }
+
+ st[0] = DISCOVERED; /* mark 1st insn as discovered */
+ PUSH_INT(0);
+
+peek_stack:
+ while ((t = PEEK_INT()) != -1) {
+ if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+ u8 opcode = BPF_OP(insns[t].code);
+
+ if (opcode == BPF_EXIT) {
+ goto mark_explored;
+ } else if (opcode == BPF_CALL) {
+ PUSH_INSN(t, t + 1, FALLTHROUGH);
+ } else if (opcode == BPF_JA) {
+ if (BPF_SRC(insns[t].code) != BPF_K) {
+ ret = -EINVAL;
+ goto free_st;
+ }
+ /* unconditional jump with single edge */
+ PUSH_INSN(t, t + insns[t].off + 1, FALLTHROUGH);
+ } else {
+ /* conditional jump with two edges */
+ PUSH_INSN(t, t + 1, FALLTHROUGH);
+ PUSH_INSN(t, t + insns[t].off + 1, BRANCH);
+ }
+ } else {
+ /* all other non-branch instructions with single
+ * fall-through edge
+ */
+ PUSH_INSN(t, t + 1, FALLTHROUGH);
+ }
+
+mark_explored:
+ st[t] = EXPLORED;
+ if (POP_INT() == -1) {
+ verbose("pop_int internal bug\n");
+ ret = -EFAULT;
+ goto free_st;
+ }
+ }
+
+
+ for (i = 0; i < insn_cnt; i++) {
+ if (st[i] != EXPLORED) {
+ verbose("unreachable insn %d\n", i);
+ ret = -EINVAL;
+ goto free_st;
+ }
+ }
+
+free_st:
+ kfree(st);
+ kfree(stack);
+ return ret;
+}
+
/* look for pseudo eBPF instructions that access map FDs and
* replace them with actual map pointers
*/
@@ -480,6 +659,10 @@ int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
if (ret < 0)
goto skip_full_check;

+ ret = check_cfg(env);
+ if (ret < 0)
+ goto skip_full_check;
+
/* ret = do_check(env); */

skip_full_check:
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:20 UTC
Permalink
this patch adds all of eBPF verfier documentation and empty bpf_check()

The end goal for the verifier is to statically check safety of the program.

Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
Documentation/networking/filter.txt | 230 +++++++++++++++++++++++++++++++++++
include/linux/bpf.h | 2 +
kernel/bpf/Makefile | 2 +-
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 151 +++++++++++++++++++++++
5 files changed, 385 insertions(+), 2 deletions(-)
create mode 100644 kernel/bpf/verifier.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 27a0a6c6acb4..b121b01c3af4 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1001,6 +1001,105 @@ instruction that loads 64-bit immediate value into a dst_reg.
Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
32-bit immediate value into a register.

+eBPF verifier
+-------------
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
+since addition of two valid pointers makes invalid pointer.
+(In 'secure' mode verifier will reject any type of pointer arithmetic to make
+sure that kernel addresses don't leak to unprivileged users)
+
+If register was never written to, it's not readable:
+ bpf_mov R0 = R2
+ bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+ bpf_mov R6 = 1
+ bpf_call foo
+ bpf_mov R0 = R6
+ bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
+so that its state is preserved across calls.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
+ctx is generic. verifier is configured to known what context is for particular
+class of bpf programs. For example, context == skb (for socket filters) and
+ctx == seccomp_data for seccomp filters.
+A callback is used to customize verifier to restrict eBPF program access to only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+ bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=FRAME_PTR, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+ bpf_ld R0 = *(u32 *)(R10 - 4)
+ bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type FRAME_PTR
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is tracked as well, since four (R6-R9)
+callee saved registers may not be enough for some programs.
+
+Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
+The eBPF verifier will check that registers match argument constraints.
+After the call register R0 will be set to return type of the function.
+
+Function calls is a main mechanism to extend functionality of eBPF programs.
+Socket filters may let programs to call one set of functions, whereas tracing
+filters may allow completely different set.
+
+If a function made accessible to eBPF program, it needs to be thought through
+from security point of view. The verifier will guarantee that the function is
+called with valid arguments.
+
+seccomp vs socket filters have different security restrictions for classic BPF.
+Seccomp solves this by two stage verifier: classic BPF verifier is followed
+by seccomp verifier. In case of eBPF one configurable verifier is shared for
+all use cases.
+
+See details of eBPF verifier in kernel/bpf/verifier.c
+
eBPF maps
---------
'maps' is a generic storage of different types for sharing data between kernel
@@ -1072,6 +1171,137 @@ size. It will not let programs pass junk values to bpf_map_*_elem() functions,
so these functions (implemented in C inside kernel) can safely access
the pointers in all cases.

+Understanding eBPF verifier messages
+------------------------------------
+
+The following are few examples of invalid eBPF programs and verifier error
+messages as seen in the log:
+
+Program with unreachable instructions:
+static struct bpf_insn prog[] = {
+ BPF_EXIT_INSN(),
+ BPF_EXIT_INSN(),
+};
+Error:
+ unreachable insn 1
+
+Program that reads uninitialized register:
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r0 = r2
+ R2 !read_ok
+
+Program that doesn't initialize R0 before exiting:
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r1
+ 1: (95) exit
+ R0 !read_ok
+
+Program that accesses stack out of bounds:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 +8) = 0
+ invalid stack off=8 size=8
+
+Program that doesn't initialize stack before passing its address into function:
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r10
+ 1: (07) r2 += -8
+ 2: (b7) r1 = 0x0
+ 3: (85) call 1
+ invalid indirect read from stack off -8+0 size 8
+
+Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 0x0
+ 4: (85) call 1
+ fd 0 is not pointing to valid bpf_map
+
+Program that doesn't check return value of map_lookup_elem() before accessing
+map element:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 0x0
+ 4: (85) call 1
+ 5: (7a) *(u64 *)(r0 +0) = 0
+ R0 invalid mem access 'map_value_or_null'
+
+Program that correctly checks map_lookup_elem() returned value for NULL, but
+accesses the memory with incorrect alignment:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+1
+ R0=map_ptr R10=fp
+ 6: (7a) *(u64 *)(r0 +4) = 0
+ misaligned access off 4 size 8
+
+Program that correctly checks map_lookup_elem() returned value for NULL and
+accesses memory with correct alignment in one side of 'if' branch, but fails
+to do so in the other side of 'if' branch:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+2
+ R0=map_ptr R10=fp
+ 6: (7a) *(u64 *)(r0 +0) = 0
+ 7: (95) exit
+
+ from 5 to 8: R0=imm0 R10=fp
+ 8: (7a) *(u64 *)(r0 +0) = 1
+ R0 invalid mem access 'imm'
+
Testing
-------

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ac6320f44812..d818e473d12c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -82,5 +82,7 @@ struct bpf_prog;

void bpf_prog_put(struct bpf_prog *prog);
struct bpf_prog *bpf_prog_get(u32 ufd);
+/* verify correctness of eBPF program */
+int bpf_check(struct bpf_prog *fp, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1]);

#endif /* _LINUX_BPF_H */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 558e12712ebc..95a9035e0f29 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o hashtab.o
+obj-y := core.o syscall.o hashtab.o verifier.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5a336af61858..a3581646ee11 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -514,7 +514,7 @@ static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
goto free_prog_info;

/* run eBPF verifier */
- /* err = bpf_check(prog, tb); */
+ err = bpf_check(prog, tb);

if (err < 0)
goto free_prog_info;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
new file mode 100644
index 000000000000..cf8a0131cd91
--- /dev/null
+++ b/kernel/bpf/verifier.c
@@ -0,0 +1,151 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <net/netlink.h>
+#include <linux/file.h>
+
+/* bpf_check() is a static code analyzer that walks eBPF program
+ * instruction by instruction and updates register/stack state.
+ * All paths of conditional branches are analyzed until 'bpf_exit' insn.
+ *
+ * At the first pass depth-first-search verifies that the BPF program is a DAG.
+ * It rejects the following programs:
+ * - larger than BPF_MAXINSNS insns
+ * - if loop is present (detected via back-edge)
+ * - unreachable insns exist (shouldn't be a forest. program = one function)
+ * - out of bounds or malformed jumps
+ * The second pass is all possible path descent from the 1st insn.
+ * Conditional branch target insns keep a link list of verifier states.
+ * If the state already visited, this path can be pruned.
+ * If it wasn't a DAG, such state prunning would be incorrect, since it would
+ * skip cycles. Since it's analyzing all pathes through the program,
+ * the length of the analysis is limited to 32k insn, which may be hit even
+ * if insn_cnt < 4K, but there are too many branches that change stack/regs.
+ * Number of 'branches to be analyzed' is limited to 1k
+ *
+ * On entry to each instruction, each register has a type, and the instruction
+ * changes the types of the registers depending on instruction semantics.
+ * If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is
+ * copied to R1.
+ *
+ * All registers are 64-bit (even on 32-bit arch)
+ * R0 - return register
+ * R1-R5 argument passing registers
+ * R6-R9 callee saved registers
+ * R10 - frame pointer read-only
+ *
+ * At the start of BPF program the register R1 contains a pointer to bpf_context
+ * and has type PTR_TO_CTX.
+ *
+ * Most of the time the registers have UNKNOWN_VALUE type, which
+ * means the register has some value, but it's not a valid pointer.
+ * Verifier doesn't attemp to track all arithmetic operations on pointers.
+ * The only special case is the sequence:
+ * BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+ * BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
+ * 1st insn copies R10 (which has FRAME_PTR) type into R1
+ * and 2nd arithmetic instruction is pattern matched to recognize
+ * that it wants to construct a pointer to some element within stack.
+ * So after 2nd insn, the register R1 has type PTR_TO_STACK
+ * (and -20 constant is saved for further stack bounds checking).
+ * Meaning that this reg is a pointer to stack plus known immediate constant.
+ *
+ * When program is doing load or store insns the type of base register can be:
+ * PTR_TO_MAP_VALUE, PTR_TO_CTX, FRAME_PTR. These are three pointer types recognized
+ * by check_mem_access() function.
+ *
+ * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
+ * and the range of [ptr, ptr + map's value_size) is accessible.
+ *
+ * registers used to pass pointers to function calls are verified against
+ * function prototypes
+ *
+ * ARG_PTR_TO_MAP_KEY is a function argument constraint.
+ * It means that the register type passed to this function must be
+ * PTR_TO_STACK and it will be used inside the function as
+ * 'pointer to map element key'
+ *
+ * For example the argument constraints for bpf_map_lookup_elem():
+ * .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ * .arg1_type = ARG_CONST_MAP_ID,
+ * .arg2_type = ARG_PTR_TO_MAP_KEY,
+ *
+ * ret_type says that this function returns 'pointer to map elem value or null'
+ * 1st argument is a 'const immediate' value which must be one of valid map_ids.
+ * 2nd argument is a pointer to stack, which will be used inside the function as
+ * a pointer to map element key.
+ *
+ * On the kernel side the helper function looks like:
+ * u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+ * {
+ * struct bpf_map *map;
+ * int map_id = r1;
+ * void *key = (void *) (unsigned long) r2;
+ * void *value;
+ *
+ * here kernel can access 'key' pointer safely, knowing that
+ * [key, key + map->key_size) bytes are valid and were initialized on
+ * the stack of eBPF program.
+ * }
+ *
+ * Corresponding eBPF program looked like:
+ * BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // after this insn R2 type is FRAME_PTR
+ * BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
+ * BPF_MOV64_IMM(BPF_REG_1, MAP_ID), // after this insn R1 type is CONST_ARG
+ * BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ * here verifier looks a prototype of map_lookup_elem and sees:
+ * .arg1_type == ARG_CONST_MAP_ID and R1->type == CONST_ARG, which is ok so far,
+ * then it goes and finds a map with map_id equal to R1->imm value.
+ * Now verifier knows that this map has key of key_size bytes
+ *
+ * Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,
+ * Now verifier checks that [R2, R2 + map's key_size) are within stack limits
+ * and were initialized prior to this call.
+ * If it's ok, then verifier allows this BPF_CALL insn and looks at
+ * .ret_type which is RET_PTR_TO_MAP_VALUE_OR_NULL, so it sets
+ * R0->type = PTR_TO_MAP_VALUE_OR_NULL which means bpf_map_lookup_elem() function
+ * returns ether pointer to map value or NULL.
+ *
+ * When type PTR_TO_MAP_VALUE_OR_NULL passes through 'if (reg != 0) goto +off'
+ * insn, the register holding that pointer in the true branch changes state to
+ * PTR_TO_MAP_VALUE and the same register changes state to CONST_IMM in the false
+ * branch. See check_cond_jmp_op().
+ *
+ * After the call R0 is set to return type of the function and registers R1-R5
+ * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * load/store alignment is checked:
+ * BPF_STX_MEM(BPF_DW, dest_reg, src_reg, 3)
+ * is rejected, because it's misaligned
+ *
+ * load/store to stack are bounds checked and register spill is tracked
+ * BPF_STX_MEM(BPF_B, BPF_REG_10, src_reg, 0)
+ * is rejected, because it's out of bounds
+ *
+ * load/store to map are bounds checked:
+ * BPF_STX_MEM(BPF_H, dest_reg, src_reg, 8)
+ * is ok, if dest_reg->type == PTR_TO_MAP_VALUE and
+ * 8 + sizeof(u16) <= map_info->value_size
+ *
+ * load/store to bpf_context are checked against known fields
+ */
+
+int bpf_check(struct bpf_prog *prog, struct nlattr *tb[BPF_PROG_ATTR_MAX + 1])
+{
+ int ret = -EINVAL;
+
+ return ret;
+}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:18 UTC
Permalink
eBPF programs are safe run-to-completion functions with load/unload
methods from userspace similar to kernel modules.

User space API:

- load eBPF program
fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)

where 'prog' is a sequence of sections (TEXT, LICENSE)
TEXT - array of eBPF instructions
LICENSE - must be GPL compatible to call helper functions marked gpl_only

- unload eBPF program
close(fd)

User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, ...)
follows in later patches

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org>
---
include/linux/bpf.h | 36 +++++++++
include/linux/filter.h | 9 ++-
include/uapi/linux/bpf.h | 28 +++++++
kernel/bpf/syscall.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 2 +
5 files changed, 269 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fd1ac4b5ba8b..ac6320f44812 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,4 +47,40 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
void bpf_map_put(struct bpf_map *map);
struct bpf_map *bpf_map_get(struct fd f);

+/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
+ * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
+ * instructions after verifying
+ */
+struct bpf_func_proto {
+ u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+ bool gpl_only;
+};
+
+struct bpf_verifier_ops {
+ /* return eBPF function prototype for verification */
+ const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+};
+
+struct bpf_prog_type_list {
+ struct list_head list_node;
+ struct bpf_verifier_ops *ops;
+ enum bpf_prog_type type;
+};
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
+struct bpf_prog_info {
+ atomic_t refcnt;
+ bool is_gpl_compatible;
+ enum bpf_prog_type prog_type;
+ struct bpf_verifier_ops *ops;
+ struct bpf_map **used_maps;
+ u32 used_map_cnt;
+};
+
+struct bpf_prog;
+
+void bpf_prog_put(struct bpf_prog *prog);
+struct bpf_prog *bpf_prog_get(u32 ufd);
+
#endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f04793474d16..f06913b29861 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -31,11 +31,16 @@ struct sock_fprog_kern {
struct sk_buff;
struct sock;
struct seccomp_data;
+struct bpf_prog_info;

struct bpf_prog {
u32 jited:1, /* Is our filter JIT'ed? */
- len:31; /* Number of filter blocks */
- struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ has_info:1, /* whether 'info' is valid */
+ len:30; /* Number of filter blocks */
+ union {
+ struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ struct bpf_prog_info *info;
+ };
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 828e873fa435..aa09ba084ebc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -344,6 +344,13 @@ enum bpf_cmd {
* returns zero and stores next key or negative error
*/
BPF_MAP_GET_NEXT_KEY,
+
+ /* verify and load eBPF program
+ * prog_id = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
+ * prog is a sequence of sections
+ * returns fd or negative error
+ */
+ BPF_PROG_LOAD,
};

enum bpf_map_attributes {
@@ -361,4 +368,25 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
};

+enum bpf_prog_attributes {
+ BPF_PROG_UNSPEC,
+ BPF_PROG_TEXT, /* array of eBPF instructions */
+ BPF_PROG_LICENSE, /* license string */
+ __BPF_PROG_ATTR_MAX,
+};
+#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
+#define BPF_PROG_MAX_ATTR_SIZE 65535
+
+enum bpf_prog_type {
+ BPF_PROG_TYPE_UNSPEC,
+};
+
+/* integer value in 'imm' field of BPF_CALL instruction selects which helper
+ * function eBPF program intends to call
+ */
+enum bpf_func_id {
+ BPF_FUNC_unspec,
+ __BPF_FUNC_MAX_ID,
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 45e100ece1b7..4c5f5169f6fc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -14,6 +14,8 @@
#include <net/netlink.h>
#include <linux/anon_inodes.h>
#include <linux/file.h>
+#include <linux/license.h>
+#include <linux/filter.h>

static LIST_HEAD(bpf_map_types);

@@ -315,6 +317,197 @@ err_put:
return err;
}

+static LIST_HEAD(bpf_prog_types);
+
+static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog)
+{
+ struct bpf_prog_type_list *tl;
+
+ list_for_each_entry(tl, &bpf_prog_types, list_node) {
+ if (tl->type == type) {
+ prog->info->ops = tl->ops;
+ prog->info->prog_type = type;
+ return 0;
+ }
+ }
+ return -EINVAL;
+}
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl)
+{
+ list_add(&tl->list_node, &bpf_prog_types);
+}
+
+/* drop refcnt on maps used by eBPF program and free auxilary data */
+static void free_bpf_prog_info(struct bpf_prog_info *info)
+{
+ int i;
+
+ for (i = 0; i < info->used_map_cnt; i++)
+ bpf_map_put(info->used_maps[i]);
+
+ kfree(info->used_maps);
+ kfree(info);
+}
+
+void bpf_prog_put(struct bpf_prog *prog)
+{
+ BUG_ON(!prog->has_info);
+ if (atomic_dec_and_test(&prog->info->refcnt)) {
+ free_bpf_prog_info(prog->info);
+ bpf_prog_free(prog);
+ }
+}
+
+static int bpf_prog_release(struct inode *inode, struct file *filp)
+{
+ struct bpf_prog *prog = filp->private_data;
+
+ bpf_prog_put(prog);
+ return 0;
+}
+
+static const struct file_operations bpf_prog_fops = {
+ .release = bpf_prog_release,
+};
+
+static struct bpf_prog *get_prog(struct fd f)
+{
+ struct bpf_prog *prog;
+
+ if (!f.file)
+ return ERR_PTR(-EBADF);
+
+ if (f.file->f_op != &bpf_prog_fops) {
+ fdput(f);
+ return ERR_PTR(-EINVAL);
+ }
+
+ prog = f.file->private_data;
+
+ return prog;
+}
+
+/* called by sockets/tracing/seccomp before attaching program to an event
+ * pairs with bpf_prog_put()
+ */
+struct bpf_prog *bpf_prog_get(u32 ufd)
+{
+ struct fd f = fdget(ufd);
+ struct bpf_prog *prog;
+
+ prog = get_prog(f);
+
+ if (IS_ERR(prog))
+ return prog;
+
+ atomic_inc(&prog->info->refcnt);
+ fdput(f);
+ return prog;
+}
+
+static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
+ [BPF_PROG_TEXT] = { .type = NLA_BINARY },
+ [BPF_PROG_LICENSE] = { .type = NLA_NUL_STRING },
+};
+
+static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
+ int len)
+{
+ struct nlattr *tb[BPF_PROG_ATTR_MAX + 1];
+ struct bpf_prog *prog;
+ struct nlattr *attr;
+ size_t insn_len;
+ int err;
+ bool is_gpl;
+
+ if (len <= 0 || len > BPF_PROG_MAX_ATTR_SIZE)
+ return -EINVAL;
+
+ attr = kmalloc(len, GFP_USER);
+ if (!attr)
+ return -ENOMEM;
+
+ /* copy eBPF program from user space */
+ err = -EFAULT;
+ if (copy_from_user(attr, uattr, len) != 0)
+ goto free_attr;
+
+ /* perform basic validation */
+ err = nla_parse(tb, BPF_PROG_ATTR_MAX, attr, len, prog_policy);
+ if (err < 0)
+ goto free_attr;
+
+ err = -EINVAL;
+ /* look for mandatory license string */
+ if (!tb[BPF_PROG_LICENSE])
+ goto free_attr;
+
+ /* eBPF programs must be GPL compatible to use GPL-ed functions */
+ is_gpl = license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE]));
+
+ /* look for mandatory array of eBPF instructions */
+ if (!tb[BPF_PROG_TEXT])
+ goto free_attr;
+
+ insn_len = nla_len(tb[BPF_PROG_TEXT]);
+ if (insn_len % sizeof(struct bpf_insn) != 0 || insn_len <= 0)
+ goto free_attr;
+
+ /* plain bpf_prog allocation */
+ err = -ENOMEM;
+ prog = kmalloc(bpf_prog_size(insn_len), GFP_USER);
+ if (!prog)
+ goto free_attr;
+
+ prog->len = insn_len / sizeof(struct bpf_insn);
+ memcpy(prog->insns, nla_data(tb[BPF_PROG_TEXT]), insn_len);
+ prog->orig_prog = NULL;
+ prog->jited = 0;
+ prog->has_info = 0;
+
+ /* allocate eBPF related auxilary data */
+ prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
+ if (!prog->info)
+ goto free_prog;
+ prog->has_info = 1;
+ atomic_set(&prog->info->refcnt, 1);
+ prog->info->is_gpl_compatible = is_gpl;
+
+ /* find program type: socket_filter vs tracing_filter */
+ err = find_prog_type(type, prog);
+ if (err < 0)
+ goto free_prog_info;
+
+ /* run eBPF verifier */
+ /* err = bpf_check(prog, tb); */
+
+ if (err < 0)
+ goto free_prog_info;
+
+ /* eBPF program is ready to be JITed */
+ bpf_prog_select_runtime(prog);
+
+ err = anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog, O_RDWR | O_CLOEXEC);
+
+ if (err < 0)
+ /* failed to allocate fd */
+ goto free_prog_info;
+
+ /* user supplied eBPF prog attributes are no longer needed */
+ kfree(attr);
+
+ return err;
+
+free_prog_info:
+ free_bpf_prog_info(prog->info);
+free_prog:
+ bpf_prog_free(prog);
+free_attr:
+ kfree(attr);
+ return err;
+}
+
SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -348,6 +541,9 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
case BPF_MAP_GET_NEXT_KEY:
return map_get_next_key((int) arg2, (void __user *) arg3,
(void __user *) arg4);
+ case BPF_PROG_LOAD:
+ return bpf_prog_load((enum bpf_prog_type) arg2,
+ (struct nlattr __user *) arg3, (int) arg4);
default:
return -EINVAL;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index d814b8a89d0f..ed15874a9beb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -835,6 +835,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
{
struct sock_fprog_kern *fprog = fp->orig_prog;

+ BUG_ON(fp->has_info);
if (fprog) {
kfree(fprog->filter);
kfree(fprog);
@@ -973,6 +974,7 @@ static struct bpf_prog *bpf_prepare_filter(struct bpf_prog *fp)

fp->bpf_func = NULL;
fp->jited = 0;
+ fp->has_info = 0;

err = bpf_check_classic(fp->insns, fp->len);
if (err) {
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:16 UTC
Permalink
'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given type and attributes
fd = bpf_map_create(map_type, struct nlattr *attr, int len)
returns fd or negative error

- lookup key in a given map referenced by fd
err = bpf_map_lookup_elem(int fd, void *key, void *value)
returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
err = bpf_map_update_elem(int fd, void *key, void *value)
returns zero or negative error

- find and delete element by key in a given map
err = bpf_map_delete_elem(int fd, void *key)

- iterate map elements (based on input key return next_key)
err = bpf_map_get_next_key(int fd, void *key, void *next_key)

- close(fd) deletes the map

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/linux/bpf.h | 8 ++
include/uapi/linux/bpf.h | 25 ++++++
kernel/bpf/syscall.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 231 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 607ca53fe2af..fd1ac4b5ba8b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -9,6 +9,7 @@

#include <uapi/linux/bpf.h>
#include <linux/workqueue.h>
+#include <linux/file.h>

struct bpf_map;
struct nlattr;
@@ -18,6 +19,12 @@ struct bpf_map_ops {
/* funcs callable from userspace (via syscall) */
struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
void (*map_free)(struct bpf_map *);
+ int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+
+ /* funcs callable from userspace and from eBPF programs */
+ void *(*map_lookup_elem)(struct bpf_map *map, void *key);
+ int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+ int (*map_delete_elem)(struct bpf_map *map, void *key);
};

struct bpf_map {
@@ -38,5 +45,6 @@ struct bpf_map_type_list {

void bpf_register_map_type(struct bpf_map_type_list *tl);
void bpf_map_put(struct bpf_map *map);
+struct bpf_map *bpf_map_get(struct fd f);

#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 88b703d59b8c..804dd8b2ca19 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -319,6 +319,31 @@ enum bpf_cmd {
* map is deleted when fd is closed
*/
BPF_MAP_CREATE,
+
+ /* lookup key in a given map
+ * err = bpf_map_lookup_elem(int fd, void *key, void *value)
+ * returns zero and stores found elem into value
+ * or negative error
+ */
+ BPF_MAP_LOOKUP_ELEM,
+
+ /* create or update key/value pair in a given map
+ * err = bpf_map_update_elem(int fd, void *key, void *value)
+ * returns zero or negative error
+ */
+ BPF_MAP_UPDATE_ELEM,
+
+ /* find and delete elem by key in a given map
+ * err = bpf_map_delete_elem(int fd, void *key)
+ * returns zero or negative error
+ */
+ BPF_MAP_DELETE_ELEM,
+
+ /* lookup key in a given map and return next key
+ * err = bpf_map_get_elem(int fd, void *key, void *next_key)
+ * returns zero and stores next key or negative error
+ */
+ BPF_MAP_GET_NEXT_KEY,
};

enum bpf_map_attributes {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 04cdf7948f8f..45e100ece1b7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -13,6 +13,7 @@
#include <linux/syscalls.h>
#include <net/netlink.h>
#include <linux/anon_inodes.h>
+#include <linux/file.h>

static LIST_HEAD(bpf_map_types);

@@ -131,6 +132,189 @@ free_attr:
return err;
}

+/* if error is returned, fd is released.
+ * On success caller should complete fd access with matching fdput()
+ */
+struct bpf_map *bpf_map_get(struct fd f)
+{
+ struct bpf_map *map;
+
+ if (!f.file)
+ return ERR_PTR(-EBADF);
+
+ if (f.file->f_op != &bpf_map_fops) {
+ fdput(f);
+ return ERR_PTR(-EINVAL);
+ }
+
+ map = f.file->private_data;
+
+ return map;
+}
+
+static int map_lookup_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+ struct fd f = fdget(ufd);
+ struct bpf_map *map;
+ void *key, *value;
+ int err;
+
+ map = bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_USER);
+ if (!key)
+ goto err_put;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ESRCH;
+ rcu_read_lock();
+ value = map->ops->map_lookup_elem(map, key);
+ if (!value)
+ goto err_unlock;
+
+ err = -EFAULT;
+ if (copy_to_user(uvalue, value, map->value_size) != 0)
+ goto err_unlock;
+
+ err = 0;
+
+err_unlock:
+ rcu_read_unlock();
+free_key:
+ kfree(key);
+err_put:
+ fdput(f);
+ return err;
+}
+
+static int map_update_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+ struct fd f = fdget(ufd);
+ struct bpf_map *map;
+ void *key, *value;
+ int err;
+
+ map = bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_USER);
+ if (!key)
+ goto err_put;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ENOMEM;
+ value = kmalloc(map->value_size, GFP_USER);
+ if (!value)
+ goto free_key;
+
+ err = -EFAULT;
+ if (copy_from_user(value, uvalue, map->value_size) != 0)
+ goto free_value;
+
+ /* eBPF program that use maps are running under rcu_read_lock(),
+ * therefore all map accessors rely on this fact, so do the same here
+ */
+ rcu_read_lock();
+ err = map->ops->map_update_elem(map, key, value);
+ rcu_read_unlock();
+
+free_value:
+ kfree(value);
+free_key:
+ kfree(key);
+err_put:
+ fdput(f);
+ return err;
+}
+
+static int map_delete_elem(int ufd, void __user *ukey)
+{
+ struct fd f = fdget(ufd);
+ struct bpf_map *map;
+ void *key;
+ int err;
+
+ map = bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_USER);
+ if (!key)
+ goto err_put;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ rcu_read_lock();
+ err = map->ops->map_delete_elem(map, key);
+ rcu_read_unlock();
+
+free_key:
+ kfree(key);
+err_put:
+ fdput(f);
+ return err;
+}
+
+static int map_get_next_key(int ufd, void __user *ukey, void __user *unext_key)
+{
+ struct fd f = fdget(ufd);
+ struct bpf_map *map;
+ void *key, *next_key;
+ int err;
+
+ map = bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_USER);
+ if (!key)
+ goto err_put;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ENOMEM;
+ next_key = kmalloc(map->key_size, GFP_USER);
+ if (!next_key)
+ goto free_key;
+
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, key, next_key);
+ rcu_read_unlock();
+ if (err)
+ goto free_next_key;
+
+ err = -EFAULT;
+ if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+ goto free_next_key;
+
+ err = 0;
+
+free_next_key:
+ kfree(next_key);
+free_key:
+ kfree(key);
+err_put:
+ fdput(f);
+ return err;
+}
+
SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -150,6 +334,20 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
case BPF_MAP_CREATE:
return map_create((enum bpf_map_type) arg2,
(struct nlattr __user *) arg3, (int) arg4);
+ case BPF_MAP_LOOKUP_ELEM:
+ return map_lookup_elem((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
+ case BPF_MAP_UPDATE_ELEM:
+ return map_update_elem((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
+ case BPF_MAP_DELETE_ELEM:
+ if (arg4 != 0)
+ return -EINVAL;
+ return map_delete_elem((int) arg2, (void __user *) arg3);
+
+ case BPF_MAP_GET_NEXT_KEY:
+ return map_get_next_key((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
default:
return -EINVAL;
}
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:15 UTC
Permalink
done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 3 ++-
include/uapi/asm-generic/unistd.h | 4 +++-
kernel/sys_ni.c | 3 +++
4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 252c804bb1aa..1416e32f20f1 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,6 +325,7 @@
316 common renameat2 sys_renameat2
317 common seccomp sys_seccomp
318 common getrandom sys_getrandom
+319 common bpf sys_bpf

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 701daff5d899..cc2e197e0875 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -870,5 +870,6 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
const char __user *uargs);
asmlinkage long sys_getrandom(char __user *buf, size_t count,
unsigned int flags);
-
+asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f1afd607f043..3b625b281804 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -703,9 +703,11 @@ __SYSCALL(__NR_renameat2, sys_renameat2)
__SYSCALL(__NR_seccomp, sys_seccomp)
#define __NR_getrandom 278
__SYSCALL(__NR_getrandom, sys_getrandom)
+#define __NR_bpf 279
+__SYSCALL(__NR_bpf, sys_bpf)

#undef __NR_syscalls
-#define __NR_syscalls 279
+#define __NR_syscalls 280

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2904a2105914..d67ab4e618f7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -216,3 +216,6 @@ cond_syscall(sys_kcmp);

/* operate on Secure Computing state */
cond_syscall(sys_seccomp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
--
1.7.9.5
Alexei Starovoitov
2014-08-13 07:57:13 UTC
Permalink
eBPF can be used from user space.

uapi/linux/bpf.h: eBPF instruction set definition

linux/filter.h: the rest

This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.

These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.

Signed-off-by: Alexei Starovoitov <***@plumgrid.com>
---
include/linux/filter.h | 305 +------------------------------------------
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/bpf.h | 314 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 316 insertions(+), 304 deletions(-)
create mode 100644 include/uapi/linux/bpf.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73a6d505e729..f04793474d16 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -9,315 +9,12 @@
#include <linux/skbuff.h>
#include <linux/workqueue.h>
#include <uapi/linux/filter.h>
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64 0x07 /* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW 0x18 /* double word */
-#define BPF_XADD 0xc0 /* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV 0xb0 /* mov reg to reg */
-#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END 0xd0 /* flags for endianness conversion: */
-#define BPF_TO_LE 0x00 /* convert to little-endian */
-#define BPF_TO_BE 0x08 /* convert to big-endian */
-#define BPF_FROM_LE BPF_TO_LE
-#define BPF_FROM_BE BPF_TO_BE
-
-#define BPF_JNE 0x50 /* jump != */
-#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
-#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
-#define BPF_CALL 0x80 /* function call */
-#define BPF_EXIT 0x90 /* function return */
-
-/* Register numbers */
-enum {
- BPF_REG_0 = 0,
- BPF_REG_1,
- BPF_REG_2,
- BPF_REG_3,
- BPF_REG_4,
- BPF_REG_5,
- BPF_REG_6,
- BPF_REG_7,
- BPF_REG_8,
- BPF_REG_9,
- BPF_REG_10,
- __MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG __MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1 BPF_REG_1
-#define BPF_REG_ARG2 BPF_REG_2
-#define BPF_REG_ARG3 BPF_REG_3
-#define BPF_REG_ARG4 BPF_REG_4
-#define BPF_REG_ARG5 BPF_REG_5
-#define BPF_REG_CTX BPF_REG_6
-#define BPF_REG_FP BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A BPF_REG_0
-#define BPF_REG_X BPF_REG_7
-#define BPF_REG_TMP BPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK 512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC) \
- ((struct bpf_insn) { \
- .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
-
-#define BPF_ENDIAN(TYPE, DST, LEN) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = LEN })
-
-/* Short form of mov, dst_reg = src_reg */
-
-#define BPF_MOV64_REG(DST, SRC) \
- ((struct bpf_insn) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-#define BPF_MOV32_REG(DST, SRC) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_MOV | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-/* Short form of mov, dst_reg = imm32 */
-
-#define BPF_MOV64_IMM(DST, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_MOV32_IMM(DST, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_MOV | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* use two of BPF_LD_IMM64 to encode single move 64-bit insn
- * first macro to carry lower 32-bits and second for higher 32-bits
- */
-#define BPF_LD_IMM64(DST, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_LD | BPF_DW | BPF_IMM, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
-
-#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
-
-#define BPF_LD_ABS(SIZE, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
-
-#define BPF_LD_IND(SIZE, SRC, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
- .dst_reg = 0, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
-
-#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
- ((struct bpf_insn) { \
- .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
-
-#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
- ((struct bpf_insn) { \
- .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
-
-#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
- ((struct bpf_insn) { \
- .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = OFF, \
- .imm = IMM })
-
-/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
-
-#define BPF_JMP_REG(OP, DST, SRC, OFF) \
- ((struct bpf_insn) { \
- .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
-
-#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
- ((struct bpf_insn) { \
- .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = OFF, \
- .imm = IMM })
-
-/* Function call */
-
-#define BPF_EMIT_CALL(FUNC) \
- ((struct bpf_insn) { \
- .code = BPF_JMP | BPF_CALL, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = ((FUNC) - __bpf_call_base) })
-
-/* Raw code statement block */
-
-#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
- ((struct bpf_insn) { \
- .code = CODE, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = IMM })
-
-/* Program exit */
-
-#define BPF_EXIT_INSN() \
- ((struct bpf_insn) { \
- .code = BPF_JMP | BPF_EXIT, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = 0 })
-
-#define bytes_to_bpf_size(bytes) \
-({ \
- int bpf_size = -EINVAL; \
- \
- if (bytes == sizeof(u8)) \
- bpf_size = BPF_B; \
- else if (bytes == sizeof(u16)) \
- bpf_size = BPF_H; \
- else if (bytes == sizeof(u32)) \
- bpf_size = BPF_W; \
- else if (bytes == sizeof(u64)) \
- bpf_size = BPF_DW; \
- \
- bpf_size; \
-})
+#include <uapi/linux/bpf.h>

/* Macro to invoke filter function. */
#define SK_RUN_FILTER(filter, ctx) \
(*filter->prog->bpf_func)(ctx, filter->prog->insnsi)

-struct bpf_insn {
- __u8 code; /* opcode */
- __u8 dst_reg:4; /* dest register */
- __u8 src_reg:4; /* source register */
- __s16 off; /* signed offset */
- __s32 imm; /* signed immediate constant */
-};
-
#ifdef CONFIG_COMPAT
/* A struct sock_filter is architecture independent. */
struct compat_sock_fprog {
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 24e9033f8b3f..fb3f7b675229 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -67,6 +67,7 @@ header-y += bfs_fs.h
header-y += binfmts.h
header-y += blkpg.h
header-y += blktrace_api.h
+header-y += bpf.h
header-y += bpqether.h
header-y += bsg.h
header-y += btrfs.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
new file mode 100644
index 000000000000..6f6e10875e95
--- /dev/null
+++ b/include/uapi/linux/bpf.h
@@ -0,0 +1,314 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_H__
+#define _UAPI__LINUX_BPF_H__
+
+#include <linux/types.h>
+
+/* Extended instruction set based on top of classic BPF */
+
+/* instruction classes */
+#define BPF_ALU64 0x07 /* alu mode in double word width */
+
+/* ld/ldx fields */
+#define BPF_DW 0x18 /* double word */
+#define BPF_XADD 0xc0 /* exclusive add */
+
+/* alu/jmp fields */
+#define BPF_MOV 0xb0 /* mov reg to reg */
+#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
+
+/* change endianness of a register */
+#define BPF_END 0xd0 /* flags for endianness conversion: */
+#define BPF_TO_LE 0x00 /* convert to little-endian */
+#define BPF_TO_BE 0x08 /* convert to big-endian */
+#define BPF_FROM_LE BPF_TO_LE
+#define BPF_FROM_BE BPF_TO_BE
+
+#define BPF_JNE 0x50 /* jump != */
+#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
+#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
+#define BPF_CALL 0x80 /* function call */
+#define BPF_EXIT 0x90 /* function return */
+
+/* Register numbers */
+enum {
+ BPF_REG_0 = 0,
+ BPF_REG_1,
+ BPF_REG_2,
+ BPF_REG_3,
+ BPF_REG_4,
+ BPF_REG_5,
+ BPF_REG_6,
+ BPF_REG_7,
+ BPF_REG_8,
+ BPF_REG_9,
+ BPF_REG_10,
+ __MAX_BPF_REG,
+};
+
+/* BPF has 10 general purpose 64-bit registers and stack frame. */
+#define MAX_BPF_REG __MAX_BPF_REG
+
+/* ArgX, context and stack frame pointer register positions. Note,
+ * Arg1, Arg2, Arg3, etc are used as argument mappings of function
+ * calls in BPF_CALL instruction.
+ */
+#define BPF_REG_ARG1 BPF_REG_1
+#define BPF_REG_ARG2 BPF_REG_2
+#define BPF_REG_ARG3 BPF_REG_3
+#define BPF_REG_ARG4 BPF_REG_4
+#define BPF_REG_ARG5 BPF_REG_5
+#define BPF_REG_CTX BPF_REG_6
+#define BPF_REG_FP BPF_REG_10
+
+/* Additional register mappings for converted user programs. */
+#define BPF_REG_A BPF_REG_0
+#define BPF_REG_X BPF_REG_7
+#define BPF_REG_TMP BPF_REG_8
+
+/* BPF program can access up to 512 bytes of stack space. */
+#define MAX_BPF_STACK 512
+
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_MOV32_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* use two of BPF_LD_IMM64 to encode single move 64-bit insn
+ * first macro to carry lower 32-bits and second for higher 32-bits
+ */
+#define BPF_LD_IMM64(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_DW | BPF_IMM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
+ .dst_reg = 0, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_CALL, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = CODE, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN() \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_EXIT, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = 0 })
+
+#define bytes_to_bpf_size(bytes) \
+({ \
+ int bpf_size = -EINVAL; \
+ \
+ if (bytes == sizeof(u8)) \
+ bpf_size = BPF_B; \
+ else if (bytes == sizeof(u16)) \
+ bpf_size = BPF_H; \
+ else if (bytes == sizeof(u32)) \
+ bpf_size = BPF_W; \
+ else if (bytes == sizeof(u64)) \
+ bpf_size = BPF_DW; \
+ \
+ bpf_size; \
+})
+
+struct bpf_insn {
+ __u8 code; /* opcode */
+ __u8 dst_reg:4; /* dest register */
+ __u8 src_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
+
+#endif /* _UAPI__LINUX_BPF_H__ */
--
1.7.9.5
David Laight
2014-08-13 08:52:30 UTC
Permalink
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
There is nothing to stop any JIT software detecting the instruction pair.

David
Alexei Starovoitov
2014-08-13 17:30:54 UTC
Permalink
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
that was what I used before in pred_tree_walker->ebpf patch
(4 existing instructions (2 movs, shift, or) to load 'pred' pointer)
It's slower in interpreter than single instruction.
Post by David Laight
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
size of instruction is not important. 99% of instructions are 8 byte long
and one is 16 byte. Big deal. It doesn't affect interpreter performance,
easy for verifier and was straightforward to do in LLVM as well.
Post by David Laight
There is nothing to stop any JIT software detecting the instruction pair.
well, it's actually very complicated to detect a sequence of
instructions that compute single 64-bit value.
Patch #11 detects and patches pseudo BPF_LD_IMM64 in
a single 'for' loop (see replace_map_fd_with_map_ptr), because
it's _single_ instruction. Any sequence of insns would require
building control and data flow graphs for verifier and JIT.
If you remember I resisted initially when Chema proposed
'load 64-bit immediate' equivalent, since back then the use cases
didn't require it. With maps done via FDs, the need has arisen.
Andy Lutomirski
2014-08-13 17:40:55 UTC
Permalink
Post by Alexei Starovoitov
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
that was what I used before in pred_tree_walker->ebpf patch
(4 existing instructions (2 movs, shift, or) to load 'pred' pointer)
It's slower in interpreter than single instruction.
Post by David Laight
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
size of instruction is not important. 99% of instructions are 8 byte long
and one is 16 byte. Big deal. It doesn't affect interpreter performance,
easy for verifier and was straightforward to do in LLVM as well.
Post by David Laight
There is nothing to stop any JIT software detecting the instruction pair.
well, it's actually very complicated to detect a sequence of
instructions that compute single 64-bit value.
Patch #11 detects and patches pseudo BPF_LD_IMM64 in
a single 'for' loop (see replace_map_fd_with_map_ptr), because
it's _single_ instruction. Any sequence of insns would require
building control and data flow graphs for verifier and JIT.
If you remember I resisted initially when Chema proposed
'load 64-bit immediate' equivalent, since back then the use cases
didn't require it. With maps done via FDs, the need has arisen.
But don't you need some kind of detection anyway to handle the case
where something jumps to the middle of the "load 64-bit immediate"? I
think it would be fine to require a particular sequence to load 64-bit
immediates if you want the JIT to optimize it well, though.

--Andy
Alexei Starovoitov
2014-08-13 18:00:24 UTC
Permalink
Post by Andy Lutomirski
But don't you need some kind of detection anyway to handle the case
where something jumps to the middle of the "load 64-bit immediate"? I
added few test cases to test_verifier to see that this case is caught
and it looks ok, but you got me worried. May be few more checks needed.
Thanks!
David Miller
2014-08-13 23:25:16 UTC
Permalink
From: David Laight <David.Laight-ZS65k/***@public.gmane.org>
Date: Wed, 13 Aug 2014 08:52:30 +0000
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
There is nothing to stop any JIT software detecting the instruction pair.
The opposite argument is that JITs can expand the IMM64 load into whatever
sequence of instructions is most optimal.

My only real gripe with IMM64 loads is that it's not mainly for
loading an immediate, it's for loading a pointer. And this
distinction is important for some JITs.

For example, on sparc64 all symbol based addresses are actually 32-bit
because of the code model we use to compile the kernel and all modules.
So if we knew this is a pointer load and it's to a symbol in a kernel
or module image, we could do a 32-bit load.
Andy Lutomirski
2014-08-13 23:34:59 UTC
Permalink
Post by David Miller
Date: Wed, 13 Aug 2014 08:52:30 +0000
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
There is nothing to stop any JIT software detecting the instruction pair.
The opposite argument is that JITs can expand the IMM64 load into whatever
sequence of instructions is most optimal.
My only real gripe with IMM64 loads is that it's not mainly for
loading an immediate, it's for loading a pointer. And this
distinction is important for some JITs.
For example, on sparc64 all symbol based addresses are actually 32-bit
because of the code model we use to compile the kernel and all modules.
So if we knew this is a pointer load and it's to a symbol in a kernel
or module image, we could do a 32-bit load.
This is true for x86_64 as well, I think.

(Almost. For x86_64 we have a choice between a sign-extended load of
a value in the top 2GB of the address space and lea reg,offset(%rip).)

--Andy
Alexei Starovoitov
2014-08-13 23:46:41 UTC
Permalink
Post by Andy Lutomirski
Post by David Miller
Date: Wed, 13 Aug 2014 08:52:30 +0000
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
There is nothing to stop any JIT software detecting the instruction pair.
The opposite argument is that JITs can expand the IMM64 load into whatever
sequence of instructions is most optimal.
My only real gripe with IMM64 loads is that it's not mainly for
loading an immediate, it's for loading a pointer. And this
distinction is important for some JITs.
For example, on sparc64 all symbol based addresses are actually 32-bit
because of the code model we use to compile the kernel and all modules.
So if we knew this is a pointer load and it's to a symbol in a kernel
or module image, we could do a 32-bit load.
This is true for x86_64 as well, I think.
(Almost. For x86_64 we have a choice between a sign-extended load of
a value in the top 2GB of the address space and lea reg,offset(%rip).)
That would be an interesting optimization. I did movabsq just
because it was straightforward. JITs can play interesting tricks here.
Since it's really a constant value, there is no difference whether
it's a pointer or a constant. If JIT can use $rip trick on x64 or reduce
number of sethi insns on sparc, it should try to do it regardless of
how value in dst_reg will be used later on by the program.
JITs can also allocate some read-only area for constants and
do a relative load from there. Not sure that it will be faster though.
JITs can get more complex and smarter as time goes by. They can
even randomly do some ld_imm64 via movabsq and some via a
sequence of mov, shift, or. That will through away JIT spraying attacks.
If JITed code itself is random, that would be nice defense.
Andy Lutomirski
2014-08-13 23:53:02 UTC
Permalink
Post by Alexei Starovoitov
Post by Andy Lutomirski
Post by David Miller
Date: Wed, 13 Aug 2014 08:52:30 +0000
Post by David Laight
From: Of Alexei Starovoitov
Post by Alexei Starovoitov
one more RFC...
Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn.
Which is first 16-byte instruction. It shows how eBPF ISA can be extended
while maintaining backward compatibility, but mainly it cleans up eBPF
program access to maps and improves run-time performance.
Wouldn't it be more sensible to follow the scheme used by a lot of cpus
and add a 'load high' instruction (follow with 'add' or 'or').
It still takes 16 bytes to load a 64bit immediate value, but the instruction
size remains constant.
There is nothing to stop any JIT software detecting the instruction pair.
The opposite argument is that JITs can expand the IMM64 load into whatever
sequence of instructions is most optimal.
My only real gripe with IMM64 loads is that it's not mainly for
loading an immediate, it's for loading a pointer. And this
distinction is important for some JITs.
For example, on sparc64 all symbol based addresses are actually 32-bit
because of the code model we use to compile the kernel and all modules.
So if we knew this is a pointer load and it's to a symbol in a kernel
or module image, we could do a 32-bit load.
This is true for x86_64 as well, I think.
(Almost. For x86_64 we have a choice between a sign-extended load of
a value in the top 2GB of the address space and lea reg,offset(%rip).)
That would be an interesting optimization. I did movabsq just
because it was straightforward. JITs can play interesting tricks here.
Since it's really a constant value, there is no difference whether
it's a pointer or a constant. If JIT can use $rip trick on x64 or reduce
number of sethi insns on sparc, it should try to do it regardless of
how value in dst_reg will be used later on by the program.
JITs can also allocate some read-only area for constants and
do a relative load from there. Not sure that it will be faster though.
JITs can get more complex and smarter as time goes by. They can
even randomly do some ld_imm64 via movabsq and some via a
sequence of mov, shift, or. That will through away JIT spraying attacks.
If JITed code itself is random, that would be nice defense.
You can be even fancier on x86_64: if the JIT code ends up being
allocated withing 2GB of the maps, then you can access kernel code
using absolute addresses and the maps using rip-relative addresses.

Depending on exactly what's going on, though, the best option may be
to use x86's fancy addressing modes for calls and loads. That will be
harder.

--Andy
Brendan Gregg
2014-08-14 19:17:39 UTC
Permalink
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/***@public.gmane.org> wrote:
[...]
Post by Alexei Starovoitov
Tracing use case got some improvements as well. Now eBPF programs can be
ex1_kern.c - demonstrate how programs can walk in-kernel data structures
ex2_kern.c - in-kernel event accounting and user space histograms
See patch #25
This is great, thanks! I've been using this new support, and
successfully ported an an older tool of mine (bitesize) to eBPF. I was
using the block:block_rq_issue tracepoint, and performing a custom
in-kernel histogram, like in the ex2_kern.c example, for I/O size.

I also did some quick overhead testing and found eBPF with JIT to be
relatively fast. (I'd share numbers but it's platform specific.) The
syscall tracepoints were a bit slower than hoped, for what I think is
a well known issue.

Are there thoughts in general for how this might be used for embedded
devices, where installing clang/llvm might be prohibitive? Compile on
another system and move the binaries over? thanks,

Brendan
Continue reading on narkive:
Loading...