If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
is automatically set ‘y’ on x86/x86-64, non-preemptive kernel) and
the “debug.kprobes_optimization” kernel parameter is set to 1 (see
sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
instruction instead of a breakpoint instruction at each probepoint.
1.4.1 Init a Kprobe
When a probe is registered, before attempting this optimization,
Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
address. So, even if it’s not possible to optimize this particular
probepoint, there’ll be a probe there.
1.4.2 Safety Check
Before optimizing a probe, Kprobes performs the following safety checks:
Kprobes verifies that the region that will be replaced by the jump
instruction (the “optimized region”) lies entirely within one function.
(A jump instruction is multiple bytes, and so may overlay multiple
Kprobes analyzes the entire function and verifies that there is no
jump into the optimized region. Specifically:
the function contains no indirect jump;
the function contains no instruction that causes an exception (since
the fixup code triggered by the exception could jump back into the
optimized region – Kprobes checks the exception tables to verify this);
there is no near jump to the optimized region (other than to the first
For each instruction in the optimized region, Kprobes verifies that
the instruction can be executed out of line.
1.4.3 Preparing Detour Buffer
Next, Kprobes prepares a “detour” buffer, which contains the following
- code to push the CPU’s registers (emulating a breakpoint trap)
- a call to the trampoline code which calls user’s probe handlers.
- code to restore registers
- the instructions from the optimized region
- a jump back to the original execution path.
After preparing the detour buffer, Kprobes verifies that none of the
following situations exist:
- The probe has either a break_handler (i.e., it’s a jprobe) or a post_handler.
- Other instructions in the optimized region are probed.
- The probe is disabled.
In any of the above cases, Kprobes won’t start optimizing the probe.
Since these are temporary situations, Kprobes tries to start
optimizing it again if the situation is changed.
If the kprobe can be optimized, Kprobes enqueues the kprobe to an
optimizing list, and kicks the kprobe-optimizer workqueue to optimize
it. If the to-be-optimized probepoint is hit before being optimized,
Kprobes returns control to the original instruction path by setting
the CPU’s instruction pointer to the copied code in the detour buffer
– thus at least avoiding the single-step.
The Kprobe-optimizer doesn’t insert the jump instruction immediately;
rather, it calls synchronize_sched() for safety first, because it’s
possible for a CPU to be interrupted in the middle of executing the
optimized region(*). As you know, synchronize_sched() can ensure
that all interruptions that were active when synchronize_sched()
was called are done, but only if CONFIG_PREEMPT=n. So, this version
of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**)
After that, the Kprobe-optimizer calls stop_machine() to replace
the optimized region with a jump instruction to the detour buffer,
When an optimized kprobe is unregistered, disabled, or blocked by
another kprobe, it will be unoptimized. If this happens before
the optimization is complete, the kprobe is just dequeued from the
optimized list. If the optimization has been done, the jump is
replaced with the original code (except for an int3 breakpoint in
the first byte) by using text_poke_smp().
()Please imagine that the 2nd instruction is interrupted and then
the optimizer replaces the 2nd instruction with the jump address*
while the interrupt handler is running. When the interrupt
returns to original address, there is no valid instruction,
and it causes an unexpected result.
(**)This optimization-safety checking may be replaced with the
stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
NOTE for geeks: The jump optimization changes the kprobe’s pre_handler behavior. Without optimization, the pre_handler can change the kernel’s execution path by changing regs->ip and returning 1. However, when the probe is optimized, that modification is ignored. Thus, if you want to tweak the kernel’s execution path, you need to suppress optimization, using one of the following techniques:
- Specify an empty function for the kprobe’s post_handler or break_handler.
- Execute ‘sysctl -w debug.kprobes_optimization=n’
5. Kprobes Features and Limitations
Kprobes allows multiple probes at the same address. Currently,
however, there cannot be multiple jprobes on the same function at
the same time. Also, a probepoint for which there is a jprobe or
a post_handler cannot be optimized. So if you install a jprobe,
or a kprobe with a post_handler, at an optimized probepoint, the
probepoint will be unoptimized automatically.
In general, you can install a probe anywhere in the kernel.
In particular, you can probe interrupt handlers. Known exceptions
are discussed in this section.
The register_probe functions will return -EINVAL if you attempt
to install a probe in the code that implements Kprobes (mostly
kernel/kprobes.c and arch//kernel/kprobes.c, but also functions such
as do_page_fault and notifier_call_chain).
If you install a probe in an inline-able function, Kprobes makes
no attempt to chase down all inline instances of the function and
install probes there. gcc may inline a function without being asked,
so keep this in mind if you’re not seeing the probe hits you expect.
A probe handler can modify the environment of the probed function
– e.g., by modifying kernel data structures, or by modifying the
contents of the pt_regs struct (which are restored to the registers
upon return from the breakpoint). So Kprobes can be used, for example,
to install a bug fix or to inject faults for testing. Kprobes, of
course, has no way to distinguish the deliberately injected faults
from the accidental ones. Don’t drink and probe.
Kprobes makes no attempt to prevent probe handlers from stepping on
each other – e.g., probing printk() and then calling printk() from a
probe handler. If a probe handler hits a probe, that second probe’s
handlers won’t be run in that instance, and the kprobe.nmissed member
of the second probe will be incremented.
As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
the same handler) may run concurrently on different CPUs.
Kprobes does not use mutexes or allocate memory except during
registration and unregistration.
Probe handlers are run with preemption disabled. Depending on the
architecture and optimization state, handlers may also run with
interrupts disabled (e.g., kretprobe handlers and optimized kprobe
handlers run without interrupt disabled on x86/x86-64). In any case,
your handler should not yield the CPU (e.g., by attempting to acquire
Since a return probe is implemented by replacing the return
address with the trampoline’s address, stack backtraces and calls
to __builtin_return_address() will typically yield the trampoline’s
address instead of the real return address for kretprobed functions.
(As far as we can tell, __builtin_return_address() is used only
for instrumentation and error reporting.)
If the number of times a function is called does not match the number
of times it returns, registering a return probe on that function may
produce undesirable results. In such a case, a line:
kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
gets printed. With this information, one will be able to correlate the
exact instance of the kretprobe that caused the problem. We have the
do_exit() case covered. do_execve() and do_fork() are not an issue.
We’re unaware of other specific cases where this could be a problem.
If, upon entry to or exit from a function, the CPU is running on
a stack other than that of the current task, registering a return
probe on that function may produce undesirable results. For this
reason, Kprobes doesn’t support return probes (or kprobes or jprobes)
on the x86_64 version of __switch_to(); the registration functions
On x86/x86-64, since the Jump Optimization of Kprobes modifies
instructions widely, there are some limitations to optimization. To
explain it, we introduce some terminology. Imagine a 3-instruction
sequence consisting of a two 2-byte instructions and one 3-byte
The instructions in DCR are copied to the out-of-line buffer
of the kprobe, because the bytes in DCR are replaced by
a 5-byte jump instruction. So there are several limitations.
a) The instructions in DCR must be relocatable.
b) The instructions in DCR must not include a call instruction.
c) JTPR must not be targeted by any jump or call instruction.
d) DCR must not straddle the border between functions.
Anyway, these limitations are checked by the in-kernel instruction
decoder, so you don’t need to worry about that.
6. Probe Overhead
On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
microseconds to process. Specifically, a benchmark that hits the same
probepoint repeatedly, firing a simple handler each time, reports 1-2
million hits per second, depending on the architecture. A jprobe or
return-probe hit typically takes 50-75% longer than a kprobe hit.
When you have a return probe set on a function, adding a kprobe at
the entry to that function adds essentially no overhead.
Here are sample overhead figures (in usec) for different architectures.
k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe
on same function; jr = jprobe + return probe on same function
i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40
x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07
ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99
6.1 Optimized Probe Overhead
Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
process. Here are sample overhead figures (in usec) for x86 architectures.
k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
i386: Intel® Xeon® E5410, 2.33GHz, 4656.90 bogomips
k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
x86-64: Intel® Xeon® E5410, 2.33GHz, 4656.90 bogomips k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
Application binaries are a result of compile and build operations performed on a single or a set of source files. Program Source files contain functions, data variables of various types (local, global, register, static), and abstract data objects, all written and neatly indented with nested control structures as per high level programming language syntax (C/C++). Compilers translate code in each source file into machine instructions (1’s and 0’s) as per target processors Instruction set Architecture and bury that code into object files. Further, Linkers integrate compiled object files with other pre-compiled objects files (libraries, runtime binaries) to create end application binary image called executable.
Source debuggers are tools used to trace execution of an application executable binary. Most amazing feature of a source debugger is its ability to list source code of the program being debugged; it can show the line or expression in the source code that resulted in a particular machine code instruction of a running program loaded in memory. This helps the programmer to analyze a program’s behavior in the high-level terms like source-level flow control constructs, procedure calls, named variables, etc, instead of machine instructions and memory locations. Source-level debugging also makes it possible to step through execution a line at a time and set source-level breakpoints. (If you do not have any prior hands on experience with source debuggers I suggest you to look at this before continuing with following.)
lets explore how source debuggers like gnu gdb work ? So how does a debugger know where to stop when you ask it to break at the entry to some function? How does it manage to find what to show you when you ask it for the value of a variable? The answer is – debug information. All modern compilers are designed to generate Debug information together with the machine code of the source file. It is a representation of the relationship between the executable program and the original source code. This information is encoded as per a pre-defined format and stored alongside the machine code. Many such formats were invented over the years for different platforms and executable files (aim of this article isn’t to survey the history of these formats, but rather to show how they work). Gnu compiler and ELF executable on Linux/ UNIX platforms use DWARF, which is widely used today as default debugging information format.
Word of Advice : Does an Application/ Kernel programmer need to know Dwarf?
Obvious answer to this question is a big NO. It is purely subject matter for developers involved in implementation of a Debugger tool. A normal Application developer using debugger tools would never need to learn or dig into binary files for debug information. This in no way adds any edge to your debugging skills nor adds any new skills into your armory. However, if you are a developer using debuggers for years and curious about how debuggers work read this document for an outline into debug information. If you are a beginner to systems programming or fresher’s learning programming I would suggest not to waste your time as you can safely ignore this.
ELF -DWARF sections
Gnu compiler generates debug information which is organized into various sections of the ELF object file. Let’s use the following source file for compiling and observing DWARF sections
All of the sections with naming debug_xxx are debugging information sections. Information in these sections is interpreted by source debugger like gdb. Each debug_ section holds specific information like
.debug_info core DWARF data containing DIEs
.debug_line Line Number Program
.debug_frame Call Frame Information
.debug_macinfo lookup table for global objects and functions
.debug_pubnames lookup table for global objects and functions
.debug_pubtypes lookup table for global types
.debug_loc Macro descriptions
.debug_abbrev Abbreviations used in the .debug_info section
.debug_aranges mapping between memory address and compilation
.debug_ranges Address ranges referenced by DIEs
.debug_str String table used by .debug_info
Debugging Information Entry (DIE)
Dwarf format organizes debug data in all of the above sections using special objects (program descriptive entities) called Debugging Information Entry (DIE). Each DIE has a tag filed whose value specifies its type, and a set of attributes. DIEs are interlinked via sibling and child links, and values of attributes can point at other DIEs. Now let’s dig into ELF file to view how a DIE looks like. We will begin our exploration with .debug_info section of the ELF file since core DIE’s are listed in it.
root@techveda:~# objdump --dwarf=info . /sample
Above operation shows long list of DIE’s. Let’s limit ourselves to relevant information
./sample: file format elf32-i386
Contents of the .debug_info section:
Compilation Unit @ offset 0x0:
Length: 0x110 (32-bit)
Abbrev Offset: 0
Pointer Size: 4
<0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
< c> DW_AT_producer : (indirect string, offset: 0xe): GNU C 4.5.2
<10> DW_AT_language : 1 (ANSI C)
<11> DW_AT_name : (indirect string, offset: 0x44): sample.c
<15> DW_AT_comp_dir : (indirect string, offset: 0x71): /root
<19> DW_AT_low_pc : 0x80483c4
<1d> DW_AT_high_pc : 0x8048423
<21> DW_AT_stmt_list : 0x0
Each source file in the application is referred in dwarf terminology as a “compilation unit”. Dwarf data for each compilation unit (source file) starts with a compilation unit DIE. Above dump shows the first DIE’s and tag value “DW_TAG_compile_unit “. This DIE provides general information about compilation unit like source file name (DW_AT_name : (indirect string, offset: 0×44): sample.c), high level programming language used to write source file(DW_AT_language : 1 (ANSI C)) , directory of the source file(DW_AT_comp_dir : (indirect string, offset: 0×71): /root) , compiler and producer of dwarf data( DW_AT_producer : (indirect string, offset: 0xe): GNU C 4.5.2) , start virtual address of the compilation unit (DW_AT_low_pc : 0x80483c4), end virtual address of the unit (DW_AT_high_pc : 0×8048423).
Compilation Unit DIE is the parent for all the other DIE’s that describe elements of source file. Generally, the list of DIE’s that follow will describe data types, followed by global data, then the functions that make up the source file. The DIEs for variables and functions are in the same order in which they appear in the source file.
How does debugger locate Function Information ?
While using source debuggers we often instruct debugger to insert or place break point at some function, expecting the debugger to pause program execution at functions. To be able to perform this task, debugger must have some mapping between a function name in the high-level code and the address in the machine code where the instructions for this function begin. For this mapping information debuggers rely on DIE’s that describes specified function. DIE’s describing functions in a compilation unit are assigned tag value “DW_TAG_subprogram” subprogram as per dwarf terminology is a function.
In our sample application source we have two functions (main, add), dwarf should generate a “DW_TAG_subprogram” DIE’s for each function, these DIE attributes would define function mapping information that debugger needs for resolving machine code addresses with function name.
Each “DW_TAG_subprogam” DIE contains
source file or compilation unit in which function is located
line no in the source file where the function starts
Functions return type
Start address of the fucntion
End address of the function
Frame information of the function.
We now have accessed DIE description of function’s main and add. Let’s analyze attribute information of add fucntions DIE.
Function scope: DW_AT_external : 1 (scope external)
Function name: DW_AT_name : add
Source file or compilation unit in which function is located: DW_AT_decl_file : 1 (indicates 1st compilation unit which is sample.c)
line no in the source file where the function starts: DW_AT_decl_line : 4 ( indicates line no 4 in source file)
4int add(int x, int y)
6 return x + y;
10 int a = 10, b = 20;
11 int result;
12 int (*fp) (int, int);
14 fp = add;
15 result = (*fp) (a, b);
16 printf(" %dn", result);
Function’s source line no matched with DIE description of line no. let’s continue with rest of the attribute values
Functions return type: DW_AT_type : <0x4f>
As we have already understood that values of attributes can point to other DIE , here is an example of it. Value DW_AT_type : <0x4f> indicates that return type description is stored in other DIE at offset 0x4f.
How does debugger find program data (variables…) Information?
When the program hits assigned break point in a function, debugger pauses the program execution, at this time we can instruct debugger to show or print values of variables, by using debugger commands like print or display followed by variable name (ex: print a) How does debugger know where to find memory location of the variable ? Variables can be located in global storage, on the stack, and even in registers.The debugging information has to be able to reflect all these variations, and indeed DWARF does. As an example let’s take a look at complete DIE information set for main function.
Note the first number inside the angle brackets in each entry. This is the nesting level – in this example entries with <2> are children of the entry with <1>. main function has three integer variables a,b and result each of these variables are described with DW_TAG_variable nested DIE’s (0xc4, 0xd0, 0xdc). main function also has a function pointer fp described in DIE 0xea . Variable DIE attributes specify variable name (DW_AT_name), declaration line no in source function (DW_AT_decl_line ), pointer to address of DIE describing variables data type (DW_AT_type) and relative location of the variable within function’s frame (DW_AT_location).
To locate the variable in the memory image of the executing process, the debugger will look at the DW_AT_location attribute of DIE. For a its value is DW_OP_fbreg4 (esp):28. This means that the variable is stored at offset 28 from the top in the frame of containing function. The DW_AT_frame_base attribute of main has the value 0×38(location list), which means that this value actually has to be looked up in the location list section. Let’s look at it:
Offset column 0×38 values are the entries for main function variables. Each entry here describes possible frame base address with respect to where debugger may be paused by break point within function instructions; it specifies the current frame base from which offsets to variables are to be computed as an offset from a register. For x86, bpreg4 refers to esp and bpreg5 refers to ebp. Before analyzing further lets look at disassemble dump for main function
First two instructions deal with function’s preamble, function’s stack frame base pointer is determined after pre-amble instructions are executed. Ebp remains constant throughout function’s execution and esp keeps changing with data being pushed and popped from the stack frame. From the above dump instructions at offset 80483db and 80483e3 are assigning values 10 and 20 to variables a, and b. These variables are being accessed their offset in stack frame relative to location of current esp( variable a: 0x1c(%esp), variable b: 0×18(%esp)). Now let’s assume that break point was set after initializing a and b variables and program paused, and we have run print command to view conents of a or b variables. Debugger would access 3rd record of the main function’s dwarf debug.loc table since our break point falls between 080483d5 – 08048422 region of the function code.
Now as per records debugger will locate a with esp + 28 , b with esp +24 and so on…
Looking up line number information
We can set breakpoints mentioning line no’s Lets now look at how debuggers resolve line no’s to machine instruction’s? DWARF encodes a full mapping between lines in the C source code and machine code addresses in the executable. This information is contained in the .debug_line section and can be extracted using objdump
root@techveda:~# objdump --dwarf=decodedline ./sample
./sample: file format elf32-i386
Decoded dump of debug contents of section .debug_line:
File name Line number Starting address
sample.c 5 0x80483c4
sample.c 6 0x80483c7
sample.c 7 0x80483d0
sample.c 9 0x80483d2
sample.c 10 0x80483db
sample.c 14 0x80483eb
sample.c 15 0x80483f3
sample.c 16 0x804840c
sample.c 17 0x8048421
This dump shows machine instruction line no’s for our program. It is quite obvious that line no’s for non-executable statements of the source file need not be tracked by dwarf .we can map the above dump to our source code for analysis.
1 #include <stdio.h>
2 #include <stdlib.h>
4 int add(int x, int y)
6 return x + y;
8 int main()
10 int a = 10, b = 20;
11 int result;
12 int (*fp) (int, int);
14 fp = add;
15 result = (*fp) (a, b);
16 printf(" %dn", result);
From the above it should be clear of what debugger does it is instructed to set breakpoint at entry into function add, it would insert break point at line no 6 and pause after pre-amble of function add is executed.
if you are into implementation of debugging tools or involved in writing programs/tools that simulate debugger facilities/ read binary files , you may be interested in specific programming libraries libbfd or libdwarf.
Binary File Descriptor library (BFD) or libbfd as it is called provides ready to use functions to read into ELF and other popular binary files. BFD works by presenting a common abstract view of object files. An object file has a “header” with descriptive info; a variable number of “sections” that each has a name, some attributes, and a block of data; a symbol table; relocation entries; and so forth. Gnu binutils package tools like objdump, readelf and others have been written using these libraries.
Libdwarf is a C library intended to simplify reading (and writing) applications built with DWARF2, DWARF3 debug information.
Dwarfdump is an application written using libdwarf to print dwarf information in a human readable format. It is also open sourced and is copyrighted GPL. It provides an example of using libdwarf to read DWARF2/3 information as well as providing readable text output.