kk Blog —— 通用基础


date [-d @int|str] [+%s|"+%F %T"]
netstat -ltunp
sar -n DEV 1

尾调用 尾递归

尾调用

尾调用大概是这么一种情况:
函数A里面调用了函数B。
函数B执行后,函数A马上返回。
也就是说调用函数B(并返回执行结果)是函数A所做的最后一件事。
相当于执行完函数B后,函数A也就执行完。
因此在执行函数B时,函数A的栈帧其实是已经大部分没用了,可以被修改或覆盖。编译器可以利用这一点进行优化,函数B执行后直接返回到函数A的调用者。

这里有一点需要注意:它是来自于编译器的优化。 这一点点的优化对于普通的尾调用来说可能意义不大,但是对于尾递归来说就很重要了。

尾递归

尾递归是一种基于尾调用形式的递归,相当于前面说的函数B就是函数A本身。

普通递归会存在的一些问题是,每递归一层就会消耗一定的栈空间,递归过深还可能导致栈溢出,同时又是函数调用,免不了push来pop去的,消耗时间。

采用尾调用的形式来实现递归,即尾递归,理论上可以解决普通递归的存在的问题。因为下一层递归所用的栈帧可以与上一层有重叠(利用jmp来实现),局部变量可重复利用,不需要额外消耗栈空间,也没有push和pop。

再次提一下,它的实际效果是来自于编译器的优化(目前的理解)。在使用尾递归的情况下,编译器觉得合适就会将递归调用优化成循环。目前大多数编译器 都是支持尾递归优化的。有一些语言甚至十分依赖于尾递归(尾调用),比如erlang或其他函数式语言(传说它们为了更好的处理 continuation-passing style)。

假如不存在优化,大家真刀真枪进行函数调用,那尾递归是毫无优势可言的,甚至还有缺点——代码写起来不直观。

现代编译器的优化能力很强悍,很多情况下编译器优化起来毫不手软(于是有了volatile)。但有时编译器又很傲娇,你需要主动给它一点信号,它才肯优化。尾递归就相当于传递一个信号给编译器,尽情优化我吧!

测试

为了验证尾递归优化,可以写个小程序进行测试。在VS2010下将使用/O1或以上的优化选项,一般就会尾递归优化。Gcc3.2.2(这版本好旧)下一般需要使用-O2优化选项。
先看看普通递归:

1
2
3
4
5
6
7
8
9
10
11
12
// 递归
int factorial(int n)
{
	if(n <= 2)
	{
		return 1;
	}
	else
	{
		return factorial(n-1) + factorial(n-2);
	}
}

其汇编代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0100401371    push   %ebp
0200401372    mov    %esp,%ebp
0300401374    push   %ebx
0400401375    sub    $0x14,%esp
0500401378    cmpl   $0x2,0x8(%ebp)
060040137C    jg     0x401385 <factorial+20>
070040137E    mov    $0x1,%eax
0800401383    jmp    0x4013a4 <factorial+51>
0900401385    mov    0x8(%ebp),%eax
1000401388    dec    %eax
1100401389    mov    %eax,(%esp)
120040138C    call   0x401371 <factorial>
1300401391    mov    %eax,%ebx
1400401393    mov    0x8(%ebp),%eax
1500401396    sub    $0x2,%eax
1600401399    mov    %eax,(%esp)
170040139C    call   0x401371 <factorial>
18004013A1    lea    (%ebx,%eax,1),%eax
19004013A4    add    $0x14,%esp
20004013A7    pop    %ebx
21004013A8    leave
22004013A9    ret

在0040138C,0040139C这些位置,我们看到递归仍然是使用call指令,那么尾递归在汇编角度是怎么处理的呢?
尾递归:

1
2
3
4
5
6
7
8
9
10
11
int factorial_tail(int n,int acc1,int acc2)
{
  if (n < 2)
  {
      return acc1;
  }
  else
  {
      return factorial_tail(n-1,acc2,acc1+acc2);
  }
}

其汇编代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0100401381    push   %ebp
0200401382    mov    %esp,%ebp
0300401384    sub    $0x18,%esp
0400401387    cmpl   $0x1,0x8(%ebp)
050040138B    jg     0x401392 <factorial_tail+17>
060040138D    mov    0xc(%ebp),%eax
0700401390    jmp    0x4013b2 <factorial_tail+49>
0800401392    mov    0x10(%ebp),%eax
0900401395    mov    0xc(%ebp),%edx
1000401398    lea    (%edx,%eax,1),%eax
110040139B    mov    0x8(%ebp),%edx
120040139E    dec    %edx
130040139F    mov    %eax,0x8(%esp)
14004013A3    mov    0x10(%ebp),%eax
15004013A6    mov    %eax,0x4(%esp)
16004013AA    mov    %edx,(%esp)
17004013AD    call   0x401381 <factorial_tail>
18004013B2    leave
19004013B3    ret

在00401390位置上,尾递归是直接使用jmp实现循环跳转。

kprobes Documentation

https://www.kernel.org/doc/Documentation/kprobes.txt

Documentation/kprobes.txt

1.4 How Does Jump Optimization Work?

If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
is automatically set ‘y’ on x86/x86-64, non-preemptive kernel) and
the “debug.kprobes_optimization” kernel parameter is set to 1 (see
sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
instruction instead of a breakpoint instruction at each probepoint.

1.4.1 Init a Kprobe

When a probe is registered, before attempting this optimization,
Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
address. So, even if it’s not possible to optimize this particular
probepoint, there’ll be a probe there.

1.4.2 Safety Check

Before optimizing a probe, Kprobes performs the following safety checks:

  • Kprobes verifies that the region that will be replaced by the jump
    instruction (the “optimized region”) lies entirely within one function.
    (A jump instruction is multiple bytes, and so may overlay multiple
    instructions.)

  • Kprobes analyzes the entire function and verifies that there is no
    jump into the optimized region. Specifically:

    • the function contains no indirect jump;
    • the function contains no instruction that causes an exception (since
      the fixup code triggered by the exception could jump back into the
      optimized region – Kprobes checks the exception tables to verify this);
      and
    • there is no near jump to the optimized region (other than to the first
      byte).
  • For each instruction in the optimized region, Kprobes verifies that
    the instruction can be executed out of line.

1.4.3 Preparing Detour Buffer

Next, Kprobes prepares a “detour” buffer, which contains the following
instruction sequence:
- code to push the CPU’s registers (emulating a breakpoint trap)
- a call to the trampoline code which calls user’s probe handlers.
- code to restore registers
- the instructions from the optimized region
- a jump back to the original execution path.

1.4.4 Pre-optimization

After preparing the detour buffer, Kprobes verifies that none of the
following situations exist:
- The probe has either a break_handler (i.e., it’s a jprobe) or a post_handler.
- Other instructions in the optimized region are probed.
- The probe is disabled.

In any of the above cases, Kprobes won’t start optimizing the probe.
Since these are temporary situations, Kprobes tries to start
optimizing it again if the situation is changed.

If the kprobe can be optimized, Kprobes enqueues the kprobe to an
optimizing list, and kicks the kprobe-optimizer workqueue to optimize
it. If the to-be-optimized probepoint is hit before being optimized,
Kprobes returns control to the original instruction path by setting
the CPU’s instruction pointer to the copied code in the detour buffer
– thus at least avoiding the single-step.

1.4.5 Optimization

The Kprobe-optimizer doesn’t insert the jump instruction immediately;
rather, it calls synchronize_sched() for safety first, because it’s
possible for a CPU to be interrupted in the middle of executing the
optimized region(*). As you know, synchronize_sched() can ensure
that all interruptions that were active when synchronize_sched()
was called are done, but only if CONFIG_PREEMPT=n. So, this version
of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**)

After that, the Kprobe-optimizer calls stop_machine() to replace
the optimized region with a jump instruction to the detour buffer,
using text_poke_smp().

1.4.6 Unoptimization

When an optimized kprobe is unregistered, disabled, or blocked by
another kprobe, it will be unoptimized. If this happens before
the optimization is complete, the kprobe is just dequeued from the
optimized list. If the optimization has been done, the jump is
replaced with the original code (except for an int3 breakpoint in
the first byte) by using text_poke_smp().

()Please imagine that the 2nd instruction is interrupted and then
the optimizer replaces the 2nd instruction with the jump
address*
while the interrupt handler is running. When the interrupt
returns to original address, there is no valid instruction,
and it causes an unexpected result.

(**)This optimization-safety checking may be replaced with the
stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
kernel.

NOTE for geeks:
The jump optimization changes the kprobe’s pre_handler behavior.
Without optimization, the pre_handler can change the kernel’s execution
path by changing regs->ip and returning 1. However, when the probe
is optimized, that modification is ignored. Thus, if you want to
tweak the kernel’s execution path, you need to suppress optimization,
using one of the following techniques:
- Specify an empty function for the kprobe’s post_handler or break_handler.
or
- Execute ‘sysctl -w debug.kprobes_optimization=n’

………………..

5. Kprobes Features and Limitations

Kprobes allows multiple probes at the same address. Currently,
however, there cannot be multiple jprobes on the same function at
the same time. Also, a probepoint for which there is a jprobe or
a post_handler cannot be optimized. So if you install a jprobe,
or a kprobe with a post_handler, at an optimized probepoint, the
probepoint will be unoptimized automatically.

In general, you can install a probe anywhere in the kernel.
In particular, you can probe interrupt handlers. Known exceptions
are discussed in this section.

The register_probe functions will return -EINVAL if you attempt
to install a probe in the code that implements Kprobes (mostly
kernel/kprobes.c and arch/
/kernel/kprobes.c, but also functions such
as do_page_fault and notifier_call_chain).

If you install a probe in an inline-able function, Kprobes makes
no attempt to chase down all inline instances of the function and
install probes there. gcc may inline a function without being asked,
so keep this in mind if you’re not seeing the probe hits you expect.

A probe handler can modify the environment of the probed function
– e.g., by modifying kernel data structures, or by modifying the
contents of the pt_regs struct (which are restored to the registers
upon return from the breakpoint). So Kprobes can be used, for example,
to install a bug fix or to inject faults for testing. Kprobes, of
course, has no way to distinguish the deliberately injected faults
from the accidental ones. Don’t drink and probe.

Kprobes makes no attempt to prevent probe handlers from stepping on
each other – e.g., probing printk() and then calling printk() from a
probe handler. If a probe handler hits a probe, that second probe’s
handlers won’t be run in that instance, and the kprobe.nmissed member
of the second probe will be incremented.

As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
the same handler) may run concurrently on different CPUs.

Kprobes does not use mutexes or allocate memory except during
registration and unregistration.

Probe handlers are run with preemption disabled. Depending on the
architecture and optimization state, handlers may also run with
interrupts disabled (e.g., kretprobe handlers and optimized kprobe
handlers run without interrupt disabled on x86/x86-64). In any case,
your handler should not yield the CPU (e.g., by attempting to acquire
a semaphore).

Since a return probe is implemented by replacing the return
address with the trampoline’s address, stack backtraces and calls
to __builtin_return_address() will typically yield the trampoline’s
address instead of the real return address for kretprobed functions.
(As far as we can tell, __builtin_return_address() is used only
for instrumentation and error reporting.)

If the number of times a function is called does not match the number
of times it returns, registering a return probe on that function may
produce undesirable results. In such a case, a line:
kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
gets printed. With this information, one will be able to correlate the
exact instance of the kretprobe that caused the problem. We have the
do_exit() case covered. do_execve() and do_fork() are not an issue.
We’re unaware of other specific cases where this could be a problem.

If, upon entry to or exit from a function, the CPU is running on
a stack other than that of the current task, registering a return
probe on that function may produce undesirable results. For this
reason, Kprobes doesn’t support return probes (or kprobes or jprobes)
on the x86_64 version of __switch_to(); the registration functions
return -EINVAL.

On x86/x86-64, since the Jump Optimization of Kprobes modifies
instructions widely, there are some limitations to optimization. To
explain it, we introduce some terminology. Imagine a 3-instruction
sequence consisting of a two 2-byte instructions and one 3-byte
instruction.

1
2
3
4
5
6
        IA  
         |  
[-2][-1][0][1][2][3][4][5][6][7]  
        [ins1][ins2][  ins3 ]  
        [<-     DCR       ->]  
           [<- JTPR ->]  

ins1: 1st Instruction
ins2: 2nd Instruction
ins3: 3rd Instruction
IA: Insertion Address
JTPR: Jump Target Prohibition Region
DCR: Detoured Code Region

The instructions in DCR are copied to the out-of-line buffer
of the kprobe, because the bytes in DCR are replaced by
a 5-byte jump instruction. So there are several limitations.

a) The instructions in DCR must be relocatable.
b) The instructions in DCR must not include a call instruction.
c) JTPR must not be targeted by any jump or call instruction.
d) DCR must not straddle the border between functions.

Anyway, these limitations are checked by the in-kernel instruction
decoder, so you don’t need to worry about that.

6. Probe Overhead

On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
microseconds to process. Specifically, a benchmark that hits the same
probepoint repeatedly, firing a simple handler each time, reports 1-2
million hits per second, depending on the architecture. A jprobe or
return-probe hit typically takes 50-75% longer than a kprobe hit.
When you have a return probe set on a function, adding a kprobe at
the entry to that function adds essentially no overhead.

Here are sample overhead figures (in usec) for different architectures.
k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe
on same function; jr = jprobe + return probe on same function

i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40

x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07

ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99

6.1 Optimized Probe Overhead

Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
process. Here are sample overhead figures (in usec) for x86 architectures.
k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.

i386: Intel® Xeon® E5410, 2.33GHz, 4656.90 bogomips
k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33

x86-64: Intel® Xeon® E5410, 2.33GHz, 4656.90 bogomips
k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30

深入浅出指令编码之三:64位计算

http://www.pediy.com/kssd/pediy10/77824.html

AMD 在x86体系的32位计算扩展为64位计算,这是通过什么来实现的?它是怎样设计的?具体细节是什么?这就是这一节要讲解的。

一、硬件编程资源

  了解现在processor提供编程资源是很重要的,对要进一步学习提供材料,下面分别讲解x86的编程资源和x64的编程资源。

1、x86的32位编程资源
1
2
3
4
5
6
7
8
9
●  8个32位通用寄存器:EAX、ECX、EDX、EBX、ESP、EBP、ESI、EDI
   这些寄存器还可分解为8个8位寄存器:AL、CL、DL、BL、AH、CH、DH、BH
   和8个16位寄存器:AX、CX、DX、BX、SP、BP、SI、DI
●  6个段寄存器:ES、CS、SS、DS、FS、GS
●  32位的EFLAGS 标志位寄存器
●  32位的指令指针寄存器EIP
●  8个64位MMX寄存器
●  8个128位XMM寄存器
●  还有就是32位的寻址空间(Virtual Address Space)
2、x64的64位编程资源
1
2
3
4
5
6
●  32位通用寄存器被扩展至64位,除了原有的8个寄存器,新增8个寄存器,共16个通用寄存器:RAX、RCX、RDX、RBX、RSP、RBP、RSI、RDI、R8、R9、R10、R11、R12、R13、R14、R15
●  保留了原有的6个寄存器,但是作用被限制
●  32位的标志寄存器被扩展为64位的标志寄存器RELAGS
●  8个64位MMX寄存器不变
●  新增8个XMM寄存器,共16个XMM寄存器
●  还有就是64位的寻址空间(Virtaul Address Space)

二、寄存器编码(或者说ID值)

1
2
3
4
5
6
●  16个64位通用寄存器是: 0000 ~ 1111,也就是:0 ~ 15
    8个32位通用寄存器是: 000 ~ 111 也就是:0 ~ 7
●  6个段寄存器的编码是:000 ~ 101 也就是:0 ~ 5
●  MMX寄存器编码是: 000 ~ 111 也就是:0 ~ 7
●  16个XMM寄存器编码是: 0000 ~ 1111 也就是:0 ~ 15
    8个XMM寄存器编码是:000 ~ 111 也就是:0 ~ 7

所谓寄存器编码是寄存器对应的二进制编码,按顺序来定义,看下面的表格:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
RAX/ES/MMX0/XMM0 ->  0000
RCX/CS/MMX1/XMM1  ->  0001
RDX/SS/MMX2/XMM2  ->  0010
RBX/DS/MMX3/XMM3  ->  0011
RSP/FS/MMX4/XMM4   ->  0100
RBP/GS/MMX5/XMM5  ->  0101
RSI/MMX6/XMM6      ->  0110
RDI/MMX7/XMM7     ->  0111
R8/XMM8   ->  1000
R9/XMM9   ->  1001
R10/XMM10  ->  1010
R11/XMM11  ->  1011
R12/XMM12  ->  1100
R13/XMM13  ->  1101
R14/XMM14  ->  1110
R15/XMM15  ->  1111

RAX ~ RDI 与 EAX ~ EDI 的编码是相同的,这里有一个情况是,EAX ~ EDI的编码是3位,为什么RAX~RDI的编码却是4位呢?这就是下面要讲到的REX prefix会将寄存器编码进行扩展。

三、 开启64位计算的基石(REX prefix)

  AMD64体系的64位计算是这样设计:操作数的Default Operand-Size是32位,而Address-Size是固定为64位的,这里就引发3个问题要解决的:

1
2
3
●  问题1:当要访问是64位的寄存器时,那么必须要有一种机制去开启或者说确认访问的寄存器是64位的。
●  问题2:而要访问的内存操作数寄存器寻址的话,那么也必须要去开启或确认寄存器是64位的以及访问新增寄存的问题。
●  问题3:如何去访问新增加的几个寄存器呢?那么也必须要有方法去访问增加的寄存器?

那么在64位Long模式下,为什么不将操作数的Default Operand-Size设计为64位呢?那是由于体系限制,本来AMD64就是在 x86的基础上扩展为64位的。x86体系当初设计时就没想到有会被扩展到64位的时候。所以在Segment-Descriptor(段描述符)里就没 有可以扩展为64位的标志位。DS.D位只有置1时是32位,清0时为16位,这两种情况。

AMD在保持兼容的大提前下,只好令谋计策,AMD的解决方案是:增加一个64位模式下特有Prefix,以起到扩展访问64位的能力。这就是 REX prefix。

1、REX prefix 的具体格式及含义

REX prefix的取值范围是:40 ~ 4F(0100 0000 ~ 0100 1111),来看下原来opcode取值范围的40 ~ 4F的是什么指令:
Opcode为40 ~ 47在x86下是inc eax ~ inc edi 指令,48 ~ 4F在x86下是dec eax ~ dec edi 指令。
在64位模式下,40 ~ 4F 就已经不是指令而变身为 prefix了。

1.1 REX prefix字节的组成部分:
1
2
3
4
5
●  bit0:REX.B
●  bit1:REX.X
●  bit2:REX.R
●  bit3:REX.W
●  bit4 ~ bit7:此域固定为0100,也就是高半字节为4。

★ REX.W域是设定操作数的大小(Operand-Size),当REX.W为1时,操作数是64位,为0时,操作数的大小是缺省大小(Default Opeand-Size)。这就解决了访问64位寄存器的问题。

★ REX.R域是用于扩展ModRM字节中的R(Reg)域,ModRM中的Reg域除了对Opcode的补充外,是用来定义寄存器的编码,即寄存器 值。REX.R将原来3位的寄存器ID(000 ~ 111)扩展为4位(0000 ~ 1111),这就解决了访新增寄存器的问题。

★ REX.X域是用于扩展SIB字节中的Index域,SIB中的Index域是指明Index 寄存器的编码,即ID值。这就解决了寄存器寻址内存中使用新增寄存器的问题。

★ REX.B域是用于扩展ModRM字节中的r/m域和SIB中的Base域,SIB中的Base域指明Base寄存器编码即ID值。这就解决了寄存器寻址内存中使用新增寄存器的问题。

★ REX.B域的另一个作用是:若指令中没有ModRM和SIB,也就是在Opcode中直接给出寄存器ID值,REX.B起到扩展寄存器的作用。

1.2、下面使用几个例子来说明问题:

例1:指令 mov eax, 1   这条指令的Default Operand-Size是32位,在32位下它的机器编码是:b8 01 00 00 00(其5个字节)若改成64位编码时,变成 mov rax, 1。
  此时,它的机器编码是 48 b8 01 00 00 00 00 00 00 00 (共10个字节)
在这里48 就是 REX prefix字节,即:0100 1000 它的各个域值是:REX.W = 1,定义操作数是64位的,REX.R = 0、REX.X = 0、 REX.B = 0 这条指令不需要ModRM和SIB字节,所以RXB域都为0。
  这里有个值得思考的地方,若 REX.W域为0时,这条指令的操作数是32位的,也就是说,机器编码:40 b8 01 00 00 00(其6个字节)是与 b8 01 00 00 00结果一样的,都是mov eax, 1

例2:指令:mov rax, r14
  这是一条常见64位指令,源寄存器是r14,目标寄存器是rax 它的机器编码是:
   4c 89 f0(共3个字节)
在这个编码里4c是REX prefix,89是opcode,f0是ModRM。
REX Prefix的值是4c (0100 1100),其中REX.W = 1,REX.R = 1,XB都为0。
ModRM的值是F0(11-110-000),Mod=11,Reg=110, R/M = 000,在这里先不讲ModRM的含义,在后面的章节再详述。在这条指令里,Reg表示源操作数r14的ID值。
r14是新增加寄存器,所以需要REX.R进行扩展,得出最终寄存器的ID值,1+110 = 1110,这是r14寄存器的ID值,从而得出正确的编码。

例3:回到序言里的例子:mov word ptr es:[eax + ecx * 8 + 0x11223344], 0x12345678
作为例子,我将它改为64位指令,如下:
mov qword ptr [rax + rcx * 8 + 0x11223344], 0x12345678
操作数大小变为64位,而base 寄存器和index寄存器都改为64位,disp(offset)和imme(值不变),为啥不变?在以后的章节会有详述。

好,现在来看看指令怎么译:

1
2
3
4
(1)  REX.W: 要置为 1 以扩展64位大小。
(2)  REX.B:  由于base不是新增的寄存器,所以置为 0
(3)  REX.X: 由于index 也不是新增的寄存器,所以置为 0
(4)  REX.R: 源操作数和目标作数不是寄存器,所以置为 0

所以,REX prefix就等于 48(0100 1000)
故,整条指令编码是:48 c7 84 c8 44 33 22 11 78 56 34 12(共12个字节)

例4:我将上面的例子再改一改,变为:mov qword ptr [r8 + r9 * 8 + 0x11223344], 0x12345678
那么,看看这指令怎么译:

1
2
3
4
(1)  REX.W:置1,使用64位大小
(2)  REX.B:base寄存器是r8,是新增寄存器,所以置为1
(3)  REX.X:index寄存器是r9,是新增寄存器,所以置为1
(4)  REX.R:操作数中没有寄存器,所在置为0

所以,REX prefix就等于(0100 1011)4b
故,整条指令编码是:4b c7 84 c8 44 33 22 11 78 56 34 12(共12个字节)

例5:看看这条指令 mov r8, 1

1
2
3
4
(1)  REX.W:置1
(2)  REX.B:访问Opcode中的寄存器ID值,它是新增寄存器,所为置1
(3)  REX.X:置0
(4)  REX.R:置0

所以,REX是 49(0100 1001)
故整条指令编码是:49 b8 01 00 00 00 00 00 00 00

2、REX prefix补充说明

(1)关于顺序:REX一定是在x86 prefix之后,而在Opcode之前。
(2)关于冲突:当x86 prefix和 REX prefix同时出现,而又出现冲突时,REX的优先权要优于 x86 prefix,
举个例子:指令 mov r8, 1
若出现以下编码怎么办:66 49 b8 01 00 00 00 00 00 00 00 既有66 又有49,那么结果66会被忽略,也就等于:49 b8 01 00 00 00 00 00 00 00。
而对于 66 b8 01 00 00 00 00 00 00 00 这个编码来说:会被解析为:mov ax, 1
去掉了49这个REX prefix操作数被调整为 16 位。
(3)关于原来Opcode码,由于40 ~ 4F被作为 REX prefix,那么原指令inc reg/dec reg,只能使用 FF/0 和 FF/1 这两个Opcode了。
(4)缺省操作数大小(Default Operand-Size)
64位绝大部分缺省操作数是32位的,但有一部分是64位的,依赖于rsp的寻址和短跳转(near jmp/near call)是64位的。
如下指令:push r8
REX值是41(0100 0001),即REX.W为0,使用default opearnd-size
它的编码是 41 ff f0