kk Blog —— 通用基础

date [-d @int|str] [+%s|"+%F %T"]

深入浅出指令编码之三:64位计算

http://www.pediy.com/kssd/pediy10/77824.html

AMD 在x86体系的32位计算扩展为64位计算,这是通过什么来实现的?它是怎样设计的?具体细节是什么?这就是这一节要讲解的。

一、硬件编程资源

  了解现在processor提供编程资源是很重要的,对要进一步学习提供材料,下面分别讲解x86的编程资源和x64的编程资源。

1、x86的32位编程资源
1
2
3
4
5
6
7
8
9
●  8个32位通用寄存器:EAX、ECX、EDX、EBX、ESP、EBP、ESI、EDI
   这些寄存器还可分解为8个8位寄存器:AL、CL、DL、BL、AH、CH、DH、BH
   和8个16位寄存器:AX、CX、DX、BX、SP、BP、SI、DI
●  6个段寄存器:ES、CS、SS、DS、FS、GS
●  32位的EFLAGS 标志位寄存器
●  32位的指令指针寄存器EIP
●  8个64位MMX寄存器
●  8个128位XMM寄存器
●  还有就是32位的寻址空间(Virtual Address Space)
2、x64的64位编程资源
1
2
3
4
5
6
●  32位通用寄存器被扩展至64位,除了原有的8个寄存器,新增8个寄存器,共16个通用寄存器:RAX、RCX、RDX、RBX、RSP、RBP、RSI、RDI、R8、R9、R10、R11、R12、R13、R14、R15
●  保留了原有的6个寄存器,但是作用被限制
●  32位的标志寄存器被扩展为64位的标志寄存器RELAGS
●  8个64位MMX寄存器不变
●  新增8个XMM寄存器,共16个XMM寄存器
●  还有就是64位的寻址空间(Virtaul Address Space)

二、寄存器编码(或者说ID值)

1
2
3
4
5
6
●  16个64位通用寄存器是: 0000 ~ 1111,也就是:0 ~ 15
    8个32位通用寄存器是: 000 ~ 111 也就是:0 ~ 7
●  6个段寄存器的编码是:000 ~ 101 也就是:0 ~ 5
●  MMX寄存器编码是: 000 ~ 111 也就是:0 ~ 7
●  16个XMM寄存器编码是: 0000 ~ 1111 也就是:0 ~ 15
    8个XMM寄存器编码是:000 ~ 111 也就是:0 ~ 7

所谓寄存器编码是寄存器对应的二进制编码,按顺序来定义,看下面的表格:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
RAX/ES/MMX0/XMM0 ->  0000
RCX/CS/MMX1/XMM1  ->  0001
RDX/SS/MMX2/XMM2  ->  0010
RBX/DS/MMX3/XMM3  ->  0011
RSP/FS/MMX4/XMM4   ->  0100
RBP/GS/MMX5/XMM5  ->  0101
RSI/MMX6/XMM6      ->  0110
RDI/MMX7/XMM7     ->  0111
R8/XMM8   ->  1000
R9/XMM9   ->  1001
R10/XMM10  ->  1010
R11/XMM11  ->  1011
R12/XMM12  ->  1100
R13/XMM13  ->  1101
R14/XMM14  ->  1110
R15/XMM15  ->  1111

RAX ~ RDI 与 EAX ~ EDI 的编码是相同的,这里有一个情况是,EAX ~ EDI的编码是3位,为什么RAX~RDI的编码却是4位呢?这就是下面要讲到的REX prefix会将寄存器编码进行扩展。

三、 开启64位计算的基石(REX prefix)

  AMD64体系的64位计算是这样设计:操作数的Default Operand-Size是32位,而Address-Size是固定为64位的,这里就引发3个问题要解决的:

1
2
3
●  问题1:当要访问是64位的寄存器时,那么必须要有一种机制去开启或者说确认访问的寄存器是64位的。
●  问题2:而要访问的内存操作数寄存器寻址的话,那么也必须要去开启或确认寄存器是64位的以及访问新增寄存的问题。
●  问题3:如何去访问新增加的几个寄存器呢?那么也必须要有方法去访问增加的寄存器?

那么在64位Long模式下,为什么不将操作数的Default Operand-Size设计为64位呢?那是由于体系限制,本来AMD64就是在 x86的基础上扩展为64位的。x86体系当初设计时就没想到有会被扩展到64位的时候。所以在Segment-Descriptor(段描述符)里就没 有可以扩展为64位的标志位。DS.D位只有置1时是32位,清0时为16位,这两种情况。

AMD在保持兼容的大提前下,只好令谋计策,AMD的解决方案是:增加一个64位模式下特有Prefix,以起到扩展访问64位的能力。这就是 REX prefix。

1、REX prefix 的具体格式及含义

REX prefix的取值范围是:40 ~ 4F(0100 0000 ~ 0100 1111),来看下原来opcode取值范围的40 ~ 4F的是什么指令:
Opcode为40 ~ 47在x86下是inc eax ~ inc edi 指令,48 ~ 4F在x86下是dec eax ~ dec edi 指令。
在64位模式下,40 ~ 4F 就已经不是指令而变身为 prefix了。

1.1 REX prefix字节的组成部分:
1
2
3
4
5
●  bit0:REX.B
●  bit1:REX.X
●  bit2:REX.R
●  bit3:REX.W
●  bit4 ~ bit7:此域固定为0100,也就是高半字节为4。

★ REX.W域是设定操作数的大小(Operand-Size),当REX.W为1时,操作数是64位,为0时,操作数的大小是缺省大小(Default Opeand-Size)。这就解决了访问64位寄存器的问题。

★ REX.R域是用于扩展ModRM字节中的R(Reg)域,ModRM中的Reg域除了对Opcode的补充外,是用来定义寄存器的编码,即寄存器 值。REX.R将原来3位的寄存器ID(000 ~ 111)扩展为4位(0000 ~ 1111),这就解决了访新增寄存器的问题。

★ REX.X域是用于扩展SIB字节中的Index域,SIB中的Index域是指明Index 寄存器的编码,即ID值。这就解决了寄存器寻址内存中使用新增寄存器的问题。

★ REX.B域是用于扩展ModRM字节中的r/m域和SIB中的Base域,SIB中的Base域指明Base寄存器编码即ID值。这就解决了寄存器寻址内存中使用新增寄存器的问题。

★ REX.B域的另一个作用是:若指令中没有ModRM和SIB,也就是在Opcode中直接给出寄存器ID值,REX.B起到扩展寄存器的作用。

1.2、下面使用几个例子来说明问题:

例1:指令 mov eax, 1   这条指令的Default Operand-Size是32位,在32位下它的机器编码是:b8 01 00 00 00(其5个字节)若改成64位编码时,变成 mov rax, 1。
  此时,它的机器编码是 48 b8 01 00 00 00 00 00 00 00 (共10个字节)
在这里48 就是 REX prefix字节,即:0100 1000 它的各个域值是:REX.W = 1,定义操作数是64位的,REX.R = 0、REX.X = 0、 REX.B = 0 这条指令不需要ModRM和SIB字节,所以RXB域都为0。
  这里有个值得思考的地方,若 REX.W域为0时,这条指令的操作数是32位的,也就是说,机器编码:40 b8 01 00 00 00(其6个字节)是与 b8 01 00 00 00结果一样的,都是mov eax, 1

例2:指令:mov rax, r14
  这是一条常见64位指令,源寄存器是r14,目标寄存器是rax 它的机器编码是:
   4c 89 f0(共3个字节)
在这个编码里4c是REX prefix,89是opcode,f0是ModRM。
REX Prefix的值是4c (0100 1100),其中REX.W = 1,REX.R = 1,XB都为0。
ModRM的值是F0(11-110-000),Mod=11,Reg=110, R/M = 000,在这里先不讲ModRM的含义,在后面的章节再详述。在这条指令里,Reg表示源操作数r14的ID值。
r14是新增加寄存器,所以需要REX.R进行扩展,得出最终寄存器的ID值,1+110 = 1110,这是r14寄存器的ID值,从而得出正确的编码。

例3:回到序言里的例子:mov word ptr es:[eax + ecx * 8 + 0x11223344], 0x12345678
作为例子,我将它改为64位指令,如下:
mov qword ptr [rax + rcx * 8 + 0x11223344], 0x12345678
操作数大小变为64位,而base 寄存器和index寄存器都改为64位,disp(offset)和imme(值不变),为啥不变?在以后的章节会有详述。

好,现在来看看指令怎么译:

1
2
3
4
(1)  REX.W: 要置为 1 以扩展64位大小。
(2)  REX.B:  由于base不是新增的寄存器,所以置为 0
(3)  REX.X: 由于index 也不是新增的寄存器,所以置为 0
(4)  REX.R: 源操作数和目标作数不是寄存器,所以置为 0

所以,REX prefix就等于 48(0100 1000)
故,整条指令编码是:48 c7 84 c8 44 33 22 11 78 56 34 12(共12个字节)

例4:我将上面的例子再改一改,变为:mov qword ptr [r8 + r9 * 8 + 0x11223344], 0x12345678
那么,看看这指令怎么译:

1
2
3
4
(1)  REX.W:置1,使用64位大小
(2)  REX.B:base寄存器是r8,是新增寄存器,所以置为1
(3)  REX.X:index寄存器是r9,是新增寄存器,所以置为1
(4)  REX.R:操作数中没有寄存器,所在置为0

所以,REX prefix就等于(0100 1011)4b
故,整条指令编码是:4b c7 84 c8 44 33 22 11 78 56 34 12(共12个字节)

例5:看看这条指令 mov r8, 1

1
2
3
4
(1)  REX.W:置1
(2)  REX.B:访问Opcode中的寄存器ID值,它是新增寄存器,所为置1
(3)  REX.X:置0
(4)  REX.R:置0

所以,REX是 49(0100 1001)
故整条指令编码是:49 b8 01 00 00 00 00 00 00 00

2、REX prefix补充说明

(1)关于顺序:REX一定是在x86 prefix之后,而在Opcode之前。
(2)关于冲突:当x86 prefix和 REX prefix同时出现,而又出现冲突时,REX的优先权要优于 x86 prefix,
举个例子:指令 mov r8, 1
若出现以下编码怎么办:66 49 b8 01 00 00 00 00 00 00 00 既有66 又有49,那么结果66会被忽略,也就等于:49 b8 01 00 00 00 00 00 00 00。
而对于 66 b8 01 00 00 00 00 00 00 00 这个编码来说:会被解析为:mov ax, 1
去掉了49这个REX prefix操作数被调整为 16 位。
(3)关于原来Opcode码,由于40 ~ 4F被作为 REX prefix,那么原指令inc reg/dec reg,只能使用 FF/0 和 FF/1 这两个Opcode了。
(4)缺省操作数大小(Default Operand-Size)
64位绝大部分缺省操作数是32位的,但有一部分是64位的,依赖于rsp的寻址和短跳转(near jmp/near call)是64位的。
如下指令:push r8
REX值是41(0100 0001),即REX.W为0,使用default opearnd-size
它的编码是 41 ff f0

How source debuggers work?

http://blog.techveda.org/howsourcedebuggerswork/

Application binaries are a result of compile and build operations performed on a single or a set of source files. Program Source files contain functions, data variables of various types (local, global, register, static), and abstract data objects, all written and neatly indented with nested control structures as per high level programming language syntax (C/C++). Compilers translate code in each source file into machine instructions (1’s and 0’s) as per target processors Instruction set Architecture and bury that code into object files. Further, Linkers integrate compiled object files with other pre-compiled objects files (libraries, runtime binaries) to create end application binary image called executable.

Source debuggers are tools used to trace execution of an application executable binary. Most amazing feature of a source debugger is its ability to list source code of the program being debugged; it can show the line or expression in the source code that resulted in a particular machine code instruction of a running program loaded in memory. This helps the programmer to analyze a program’s behavior in the high-level terms like source-level flow control constructs, procedure calls, named variables, etc, instead of machine instructions and memory locations. Source-level debugging also makes it possible to step through execution a line at a time and set source-level breakpoints. (If you do not have any prior hands on experience with source debuggers I suggest you to look at this before continuing with following.)

lets explore how source debuggers like gnu gdb work ? So how does a debugger know where to stop when you ask it to break at the entry to some function? How does it manage to find what to show you when you ask it for the value of a variable? The answer is – debug information. All modern compilers are designed to generate Debug information together with the machine code of the source file. It is a representation of the relationship between the executable program and the original source code. This information is encoded as per a pre-defined format and stored alongside the machine code. Many such formats were invented over the years for different platforms and executable files (aim of this article isn’t to survey the history of these formats, but rather to show how they work). Gnu compiler and ELF executable on Linux/ UNIX platforms use DWARF, which is widely used today as default debugging information format.

Word of Advice : Does an Application/ Kernel programmer need to know Dwarf?

Obvious answer to this question is a big NO. It is purely subject matter for developers involved in implementation of a Debugger tool. A normal Application developer using debugger tools would never need to learn or dig into binary files for debug information. This in no way adds any edge to your debugging skills nor adds any new skills into your armory. However, if you are a developer using debuggers for years and curious about how debuggers work read this document for an outline into debug information. If you are a beginner to systems programming or fresher’s learning programming I would suggest not to waste your time as you can safely ignore this.

ELF -DWARF sections

Gnu compiler generates debug information which is organized into various sections of the ELF object file. Let’s use the following source file for compiling and observing DWARF sections

root@techveda:~# vim sample.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <stdio.h>
#include <stdlib.h>
 
int add(int x, int y)
{
	return x + y;
}
int main()
{
	int a = 10, b = 20;
	int result;
	int (*fp) (int, int);
 
	fp = add;
	result = (*fp) (a, b);
	printf(" %dn", result);
}

root@techveda:~# gcc -c -g sample.c -o sample.o
root@techveda:~# objdump -h sample.o | more

sample.o:     file format elf32-i386
 
Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         0000005f  00000000  00000000  00000034  2**2
              CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000094  2**2
              CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000094  2**2
              ALLOC
  3 .debug_abbrev 000000a2  00000000  00000000  00000094  2**0
              CONTENTS, READONLY, DEBUGGING
  4 .debug_info   00000114  00000000  00000000  00000136  2**0
              CONTENTS, RELOC, READONLY, DEBUGGING
  5 .debug_line   00000040  00000000  00000000  0000024a  2**0
              CONTENTS, RELOC, READONLY, DEBUGGING
  6 .rodata       00000005  00000000  00000000  0000028a  2**0
              CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 .debug_loc    00000070  00000000  00000000  0000028f  2**0
              CONTENTS, READONLY, DEBUGGING
  8 .debug_pubnames 00000023  00000000  00000000  000002ff  2**0
              CONTENTS, RELOC, READONLY, DEBUGGING
  9 .debug_pubtypes 00000012  00000000  00000000  00000322  2**0
              CONTENTS, RELOC, READONLY, DEBUGGING
 10 .debug_aranges 00000020  00000000  00000000  00000334  2**0
              CONTENTS, RELOC, READONLY, DEBUGGING
 11 .debug_str    000000b0  00000000  00000000  00000354  2**0
              CONTENTS, READONLY, DEBUGGING
 12 .comment      0000002b  00000000  00000000  00000404  2**0
              CONTENTS, READONLY
 13 .note.GNU-stack 00000000  00000000  00000000  0000042f  2**0
              CONTENTS, READONLY
 14 .debug_frame  00000054  00000000  00000000  00000430  2**2
              CONTENTS, RELOC, READONLY, DEBUGGING

All of the sections with naming debug_xxx are debugging information sections. Information in these sections is interpreted by source debugger like gdb. Each debug_ section holds specific information like

1
2
3
4
5
6
7
8
9
10
11
.debug_info               core DWARF data containing DIEs
.debug_line               Line Number Program
.debug_frame              Call Frame Information
.debug_macinfo            lookup table for global objects and functions
.debug_pubnames           lookup table for global objects and functions
.debug_pubtypes           lookup table for global types
.debug_loc                Macro descriptions
.debug_abbrev             Abbreviations used in the .debug_info section
.debug_aranges            mapping between memory address and compilation
.debug_ranges             Address ranges referenced by DIEs
.debug_str                String table used by .debug_info

Debugging Information Entry (DIE)

Dwarf format organizes debug data in all of the above sections using special objects (program descriptive entities) called Debugging Information Entry (DIE). Each DIE has a tag filed whose value specifies its type, and a set of attributes. DIEs are interlinked via sibling and child links, and values of attributes can point at other DIEs. Now let’s dig into ELF file to view how a DIE looks like. We will begin our exploration with .debug_info section of the ELF file since core DIE’s are listed in it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@techveda:~# objdump --dwarf=info . /sample

Above operation shows long list of DIE’s. Let’s limit ourselves to relevant information

./sample:     file format elf32-i386
 
Contents of the .debug_info section:
 
  Compilation Unit @ offset 0x0:
   Length:        0x110 (32-bit)
   Version:       2
   Abbrev Offset: 0
   Pointer Size:  4
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    < c>   DW_AT_producer    : (indirect string, offset: 0xe): GNU C 4.5.2   
    <10>   DW_AT_language    : 1  (ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x44): sample.c 
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0x71): /root
    <19>   DW_AT_low_pc      : 0x80483c4 
    <1d>   DW_AT_high_pc     : 0x8048423 
    <21>   DW_AT_stmt_list   : 0x0

Each source file in the application is referred in dwarf terminology as a “compilation unit”. Dwarf data for each compilation unit (source file) starts with a compilation unit DIE. Above dump shows the first DIE’s and tag value “DW_TAG_compile_unit “. This DIE provides general information about compilation unit like source file name (DW_AT_name : (indirect string, offset: 0×44): sample.c), high level programming language used to write source file(DW_AT_language : 1 (ANSI C)) , directory of the source file(DW_AT_comp_dir : (indirect string, offset: 0×71): /root) , compiler and producer of dwarf data( DW_AT_producer : (indirect string, offset: 0xe): GNU C 4.5.2) , start virtual address of the compilation unit (DW_AT_low_pc : 0x80483c4), end virtual address of the unit (DW_AT_high_pc : 0×8048423).

Compilation Unit DIE is the parent for all the other DIE’s that describe elements of source file. Generally, the list of DIE’s that follow will describe data types, followed by global data, then the functions that make up the source file. The DIEs for variables and functions are in the same order in which they appear in the source file.

How does debugger locate Function Information ?

While using source debuggers we often instruct debugger to insert or place break point at some function, expecting the debugger to pause program execution at functions. To be able to perform this task, debugger must have some mapping between a function name in the high-level code and the address in the machine code where the instructions for this function begin. For this mapping information debuggers rely on DIE’s that describes specified function. DIE’s describing functions in a compilation unit are assigned tag value “DW_TAG_subprogram” subprogram as per dwarf terminology is a function.

In our sample application source we have two functions (main, add), dwarf should generate a “DW_TAG_subprogram” DIE’s for each function, these DIE attributes would define function mapping information that debugger needs for resolving machine code addresses with function name.

1
2
3
4
5
6
7
8
9
10
Each “DW_TAG_subprogam” DIE contains

    function scope
    function name
    source file or compilation unit in which function is located
     line no in the source file where the function starts
    Functions return type
    Start address of the fucntion
    End address of the function
    Frame information of the function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@techveda:~# objdump --dwarf=info ./sample | grep "DW_TAG_subprogram"
 <1><72>: Abbrev Number: 4 (DW_TAG_subprogram)
 <1><a8>: Abbrev Number: 6 (DW_TAG_subprogram)
 
<1><72>: Abbrev Number: 4 (DW_TAG_subprogram)
    <73>   DW_AT_external    : 1 
    <74>   DW_AT_name        : add   
    <78>   DW_AT_decl_file   : 1 
    <79>   DW_AT_decl_line   : 4 
    <7a>   DW_AT_prototyped  : 1 
    <7b>   DW_AT_type        : <0x4f>  
    <7f>   DW_AT_low_pc      : 0x80483c4 
    <83>   DW_AT_high_pc     : 0x80483d1 
    <87>   DW_AT_frame_base  : 0x0    (location list)
    <8b>   DW_AT_sibling     : <0xa8>  
 
<1><a8>: Abbrev Number: 6 (DW_TAG_subprogram)
    <a9>   DW_AT_external    : 1 
    <aa>   DW_AT_name        : (indirect string, offset: 0x1a): main 
    <ae>   DW_AT_decl_file   : 1 
    <af>   DW_AT_decl_line   : 8 
    <b0>   DW_AT_type        : <0x4f>  
    <b4>   DW_AT_low_pc      : 0x80483d2 
    <b8>   DW_AT_high_pc     : 0x8048423 
    <bc>   DW_AT_frame_base  : 0x38   (location list)
    <c0>   DW_AT_sibling     : <0xf8>

We now have accessed DIE description of function’s main and add. Let’s analyze attribute information of add fucntions DIE.

Function scope: DW_AT_external : 1 (scope external)

Function name: DW_AT_name : add

Source file or compilation unit in which function is located: DW_AT_decl_file : 1 (indicates 1st compilation unit which is sample.c)

line no in the source file where the function starts: DW_AT_decl_line : 4 ( indicates line no 4 in source file)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1#include <stdio.h>
2#include <stdlib.h>
3 
4int add(int x, int y)
5{
6        return x + y;
7}
8int main()
9{
10        int a = 10, b = 20;
11        int result;
12        int (*fp) (int, int);
13 
14        fp = add;
15        result = (*fp) (a, b);
16        printf(" %dn", result);
17}

Function’s source line no matched with DIE description of line no. let’s continue with rest of the attribute values

Functions return type: DW_AT_type : <0x4f>

As we have already understood that values of attributes can point to other DIE , here is an example of it. Value DW_AT_type : <0x4f> indicates that return type description is stored in other DIE at offset 0x4f.

1
2
3
4
<4f>: Abbrev Number: 3 (DW_TAG_base_type)
    <50>   DW_AT_byte_size   : 4 
    <51>   DW_AT_encoding    : 5  (signed)
    <52>   DW_AT_name        : int

This DIE describes data type and composition of return type of the function add, as per DIE attribute values return type is signed int of size 4 bytes.

Start address of the function : DW_AT_low_pc : 0x80483c4

End address of the function: DW_AT_high_pc : 0x80483d1

Above values indicate start and end virtual address of the machine instructions of add function, we can verify that with binary dump of the function

1
2
3
4
5
6
7
8
080483c4 <add>:
 80483c4:   55                      push   %ebp
 80483c5:   89 e5                   mov    %esp,%ebp
 80483c7:   8b 45 0c                mov    0xc(%ebp),%eax
 80483ca:   8b 55 08                mov    0x8(%ebp),%edx
 80483cd:   8d 04 02                lea    (%edx,%eax,1),%eax
 80483d0:   5d                      pop    %ebp
 80483d1:   c3                      ret

How does debugger find program data (variables…) Information?

When the program hits assigned break point in a function, debugger pauses the program execution, at this time we can instruct debugger to show or print values of variables, by using debugger commands like print or display followed by variable name (ex: print a) How does debugger know where to find memory location of the variable ? Variables can be located in global storage, on the stack, and even in registers.The debugging information has to be able to reflect all these variations, and indeed DWARF does. As an example let’s take a look at complete DIE information set for main function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<1><a8>: Abbrev Number: 6 (DW_TAG_subprogram)
    <a9>   DW_AT_external    : 1 
    <aa>   DW_AT_name        : (indirect string, offset: 0x1a): main 
    <ae>   DW_AT_decl_file   : 1 
    <af>   DW_AT_decl_line   : 8 
    <b0>   DW_AT_type        : <0x4f>  
    <b4>   DW_AT_low_pc      : 0x80483d2 
    <b8>   DW_AT_high_pc     : 0x8048423 
    <bc>   DW_AT_frame_base  : 0x38   (location list)
    <c0>   DW_AT_sibling     : <0xf8>  
 <2><c4>: Abbrev Number: 7 (DW_TAG_variable)
    <c5>   DW_AT_name        : a 
    <c7>   DW_AT_decl_file   : 1 
    <c8>   DW_AT_decl_line   : 10
    <c9>   DW_AT_type        : <0x4f>  
    <cd>   DW_AT_location    : 2 byte block: 74 1c    (DW_OP_breg4 (esp): 28)
 <2><d0>: Abbrev Number: 7 (DW_TAG_variable)
    <d1>   DW_AT_name        : b 
    <d3>   DW_AT_decl_file   : 1 
    <d4>   DW_AT_decl_line   : 10
    <d5>   DW_AT_type        : <0x4f>  
    <d9>   DW_AT_location    : 2 byte block: 74 18    (DW_OP_breg4 (esp): 24)
 <2><dc>: Abbrev Number: 8 (DW_TAG_variable)
    <dd>           : (indirect string, offset: 0x6a): result 
    <e1>   DW_AT_decl_file   : 1 
    <e2>   DW_AT_decl_line   : 11
    <e3>   DW_AT_type        : <0x4f>  
    <e7>   DW_AT_location    : 2 byte block: 74 10    (DW_OP_breg4 (esp): 16)
 <2><ea>: Abbrev Number: 7 (DW_TAG_variable)
    <eb>   DW_AT_name        : fp
    <ee>   DW_AT_decl_file   : 1 
    <ef>   DW_AT_decl_line   : 12
    <f0>   DW_AT_type        : <0x10d> 
    <f4>   DW_AT_location    : 2 byte block: 74 14    (DW_OP_breg4 (esp): 20)

Note the first number inside the angle brackets in each entry. This is the nesting level – in this example entries with <2> are children of the entry with <1>. main function has three integer variables a,b and result each of these variables are described with DW_TAG_variable nested DIE’s (0xc4, 0xd0, 0xdc). main function also has a function pointer fp described in DIE 0xea . Variable DIE attributes specify variable name (DW_AT_name), declaration line no in source function (DW_AT_decl_line ), pointer to address of DIE describing variables data type (DW_AT_type) and relative location of the variable within function’s frame (DW_AT_location).

To locate the variable in the memory image of the executing process, the debugger will look at the DW_AT_location attribute of DIE. For a its value is DW_OP_fbreg4 (esp):28. This means that the variable is stored at offset 28 from the top in the frame of containing function. The DW_AT_frame_base attribute of main has the value 0×38(location list), which means that this value actually has to be looked up in the location list section. Let’s look at it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
root@techveda:~# objdump --dwarf=loc sample
 
sample:     file format elf32-i386
 
Contents of the .debug_loc section:
 
    Offset   Begin    End      Expression
    00000000 080483c4 080483c5 (DW_OP_breg4 (esp): 4)
    00000000 080483c5 080483c7 (DW_OP_breg4 (esp): 8)
    00000000 080483c7 080483d1 (DW_OP_breg5 (ebp): 8)
    00000000 080483d1 080483d2 (DW_OP_breg4 (esp): 4)
    00000000 <End of list>
    00000038 080483d2 080483d3 (DW_OP_breg4 (esp): 4)
    00000038 080483d3 080483d5 (DW_OP_breg4 (esp): 8)
    00000038 080483d5 08048422 (DW_OP_breg5 (ebp): 8)
    00000038 08048422 08048423 (DW_OP_breg4 (esp): 4)
    00000038 <End of list>

Offset column 0×38 values are the entries for main function variables. Each entry here describes possible frame base address with respect to where debugger may be paused by break point within function instructions; it specifies the current frame base from which offsets to variables are to be computed as an offset from a register. For x86, bpreg4 refers to esp and bpreg5 refers to ebp. Before analyzing further lets look at disassemble dump for main function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
080483d2 <main>:
 80483d2:   55                      push   %ebp
 80483d3:   89 e5                   mov    %esp,%ebp
 80483d5:   83 e4 f0                and    $0xfffffff0,%esp
 80483d8:   83 ec 20                sub    $0x20,%esp
 80483db:   c7 44 24 1c 0a 00 00    movl   $0xa,0x1c(%esp)
 80483e2:   00
 80483e3:   c7 44 24 18 14 00 00    movl   $0x14,0x18(%esp)
 80483ea:   00
 80483eb:   c7 44 24 14 c4 83 04    movl   $0x80483c4,0x14(%esp)
 80483f2:   08
 80483f3:   8b 44 24 18             mov    0x18(%esp),%eax
 80483f7:   89 44 24 04             mov    %eax,0x4(%esp)
 80483fb:   8b 44 24 1c             mov    0x1c(%esp),%eax
 80483ff:   89 04 24                mov    %eax,(%esp)
 8048402:   8b 44 24 14             mov    0x14(%esp),%eax
 8048406:   ff d0                               call   *%eax
 8048408:   89 44 24 10             mov    %eax,0x10(%esp)
 804840c:   b8 f0 84 04 08          mov    $0x80484f0,%eax
 8048411:   8b 54 24 10             mov    0x10(%esp),%edx
 8048415:   89 54 24 04             mov    %edx,0x4(%esp)
 8048419:   89 04 24                mov    %eax,(%esp)
 804841c:   e8 d3 fe ff ff          call   80482f4 <printf@plt>
 8048421:   c9                      leave 
 8048422:   c3                      ret

First two instructions deal with function’s preamble, function’s stack frame base pointer is determined after pre-amble instructions are executed. Ebp remains constant throughout function’s execution and esp keeps changing with data being pushed and popped from the stack frame. From the above dump instructions at offset 80483db and 80483e3 are assigning values 10 and 20 to variables a, and b. These variables are being accessed their offset in stack frame relative to location of current esp( variable a: 0x1c(%esp), variable b: 0×18(%esp)). Now let’s assume that break point was set after initializing a and b variables and program paused, and we have run print command to view conents of a or b variables. Debugger would access 3rd record of the main function’s dwarf debug.loc table since our break point falls between 080483d5 – 08048422 region of the function code.

1
2
3
4
00000038 080483d2 080483d3 (DW_OP_breg4 (esp): 4)
00000038 080483d3 080483d5 (DW_OP_breg4 (esp): 8)
00000038 080483d5 08048422 (DW_OP_breg5 (ebp): 8)--- 3rd record
00000038 08048422 08048423 (DW_OP_breg4 (esp): 4)

Now as per records debugger will locate a with esp + 28 , b with esp +24 and so on…

Looking up line number information

We can set breakpoints mentioning line no’s Lets now look at how debuggers resolve line no’s to machine instruction’s? DWARF encodes a full mapping between lines in the C source code and machine code addresses in the executable. This information is contained in the .debug_line section and can be extracted using objdump

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
root@techveda:~# objdump --dwarf=decodedline ./sample
 
./sample:     file format elf32-i386
 
Decoded dump of debug contents of section .debug_line:
 
CU: sample.c:
File name                            Line number    Starting address                   
sample.c                                       5          0x80483c4
sample.c                                       6          0x80483c7
sample.c                                       7          0x80483d0
sample.c                                       9          0x80483d2
sample.c                                      10          0x80483db
sample.c                                      14          0x80483eb
sample.c                                      15          0x80483f3
sample.c                                      16          0x804840c
sample.c                                      17          0x8048421

This dump shows machine instruction line no’s for our program. It is quite obvious that line no’s for non-executable statements of the source file need not be tracked by dwarf .we can map the above dump to our source code for analysis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1 #include <stdio.h>
 2 #include <stdlib.h>
 3
 4 int add(int x, int y)
 5 {
 6         return x + y;
 7 }
 8 int main()
 9 {
10         int a = 10, b = 20;
11         int result;
12         int (*fp) (int, int);
13
14         fp = add;
15         result = (*fp) (a, b);
16         printf(" %dn", result);
17 }

From the above it should be clear of what debugger does it is instructed to set breakpoint at entry into function add, it would insert break point at line no 6 and pause after pre-amble of function add is executed.

What’s next?

if you are into implementation of debugging tools or involved in writing programs/tools that simulate debugger facilities/ read binary files , you may be interested in specific programming libraries libbfd or libdwarf.

Binary File Descriptor library (BFD) or libbfd as it is called provides ready to use functions to read into ELF and other popular binary files. BFD works by presenting a common abstract view of object files. An object file has a “header” with descriptive info; a variable number of “sections” that each has a name, some attributes, and a block of data; a symbol table; relocation entries; and so forth. Gnu binutils package tools like objdump, readelf and others have been written using these libraries.

Libdwarf is a C library intended to simplify reading (and writing) applications built with DWARF2, DWARF3 debug information.

Dwarfdump is an application written using libdwarf to print dwarf information in a human readable format. It is also open sourced and is copyrighted GPL. It provides an example of using libdwarf to read DWARF2/3 information as well as providing readable text output.

References

Dwarf Standard
Libdwarf Wiki
Libbfd draft

GDB MI接口相关

所谓GDB MI就是GNU Debugger Machine-Interface,是GNU设计来给其它前端使用的交互协议.

说实在的,这个接口设计得并不是很好,仅仅是能用而已.它的指令和GDB/CLI即GDB Command Line Interface基本是对应的.
为 了方便机器交互,它把允许所有的指令有一个前缀,比如901-stack-list-frames 0 99.这样,在GDB返回的结果前,也会有同样的前缀901,我们可以根据这个前缀进行命令/结果匹配. 同时它还保证结果格式的统一性.不会出现CLI那种百花齐放的结果.当然,如果你安了GDB的python插件来做变量格式化,就可能出现例外的情况.

它的命令有同步的,也有异步的.这对于前端的设计是个很大的障碍,而且除了命令结果以外,它还会有很多异步的消息,事件…这些东西混在一起,处理起来会相当麻烦.

这是 GDB/MI的官方文档 有兴趣的话可以仔细研究一下.