kk Blog —— 通用基础

CC_STACKPROTECTOR防内核堆栈溢出补丁分析

2015-11-17 16:01:00

http://blog.aliyun.com/1126

内核堆栈溢出通常有两种情况。一种是函数调用栈超出了内核栈THREAD_SIZE的大小，这是栈底越界，另一种是栈上缓冲越界访问，这是栈顶越界。

检测栈底越界

以arm平台为例，内核栈THREAD_SIZE为8K,当调用栈层次过多或某调用栈上分配过大的空间，就会导致它越界。越界后struct thread_info结构可能被破坏，轻则内核 panic，重则内核数据被覆盖仍继续运行。

检测栈顶越界

对于栈顶越界，gcc提供了支持。打开内核配置CONFIG_CC_STACKPROTECTOR后，会打开编译选项-fstack-protector.

CC_STACKPROTECT补丁是Tejun Heo在09年给主线kernel提交的一个用来防止内核堆栈溢出的补丁。默认的config是将这个选项关闭的，可以在编译内核的时候，修改.config文件为CONFIG_CC_STACKPROTECTOR=y来启用。未来飞天内核可以将这个选项开启来防止利用内核stack溢出的0day攻击。这个补丁的防溢出原理是：在进程启动的时候，在每个buffer的后面放置一个预先设置好的stack canary，你可以把它理解成一个哨兵，当buffer发生缓冲区溢出的时候，肯定会破坏stack canary的值，当stack canary的值被破坏的时候，内核就会直接当机。那么是怎么判断stack canary被覆盖了呢？其实这个事情是gcc来做的，内核在编译的时候给gcc加了个-fstack-protector参数，我们先来研究下这个参数是做什么用的。

先写个简单的有溢出的程序：

[wzt@localhost csaw]$ cat test.c

#include <stdio.h>
#include <stdlib.h>

void test(void)
{
	char buff[64];

	memset(buff, 0x41, 128);     //向64大小的buffer拷贝128字节， 肯定会发生缓冲区溢出。
}

int main(void)
{
	test();

	return 0;
}

[wzt@localhost csaw]$ gcc -o test test.c
[wzt@localhost csaw]$ ./test
段错误

反汇编看看：

[wzt@localhost csaw]$ objdump -d test > hex

08048384 <test>:
 8048384:       55                      push   %ebp
 8048385:       89 e5                   mov    %esp,%ebp
 8048387:       83 ec 58                sub    $0x58,%esp
 804838a:       c7 44 24 08 80 00 00    movl   $0x80,0x8(%esp)
 8048391:       00
 8048392:       c7 44 24 04 41 00 00    movl   $0x41,0x4(%esp)
 8048399:       00
 804839a:       8d 45 c0                lea    0xffffffc0(%ebp),%eax
 804839d:       89 04 24                mov    %eax,(%esp)
 80483a0:       e8 e3 fe ff ff          call   8048288 <memset@plt>
 80483a5:       c9                      leave
 80483a6:       c3                      ret

没什么特别的，我们在加上-fstack-protector参数看看：

[wzt@localhost csaw]$ gcc -o test test.c -fstack-protector
[wzt@localhost csaw]$ ./test
*** stack smashing detected ***: ./test terminated
已放弃

这次程序打印了一条堆栈被溢出的信息，然后就自动退出了。

在反汇编看下：

[wzt@localhost csaw]$ objdump -d test > hex1

080483d4 <test>:
 80483d4:       55                      push   %ebp
 80483d5:       89 e5                   mov    %esp,%ebp
 80483d7:       83 ec 68                sub    $0x68,%esp
 80483da:       65 a1 14 00 00 00       mov    %gs:0x14,%eax
 80483e0:       89 45 fc                mov    %eax,0xfffffffc(%ebp)
 80483e3:       31 c0                   xor    %eax,%eax
 80483e5:       c7 44 24 08 80 00 00    movl   $0x80,0x8(%esp)
 80483ec:       00
 80483ed:       c7 44 24 04 41 00 00    movl   $0x41,0x4(%esp)
 80483f4:       00
 80483f5:       8d 45 bc                lea    0xffffffbc(%ebp),%eax
 80483f8:       89 04 24                mov    %eax,(%esp)
 80483fb:       e8 cc fe ff ff          call   80482cc <memset@plt>
 8048400:       8b 45 fc                mov    0xfffffffc(%ebp),%eax
 8048403:       65 33 05 14 00 00 00    xor    %gs:0x14,%eax
 804840a:       74 05                   je     8048411 <test+0x3d>
 804840c:       e8 db fe ff ff          call   80482ec <__stack_chk_fail@plt>
 8048411:       c9                      leave
 8048412:       c3                      ret

使用-fstack-protector参数后， gcc在函数的开头放置了几条汇编代码：

 80483d7:       83 ec 68                sub    $0x68,%esp
 80483da:       65 a1 14 00 00 00       mov    %gs:0x14,%eax
 80483e0:       89 45 fc                mov    %eax,0xfffffffc(%ebp)

将代码段gs偏移0×14内存处的值赋值给了ebp-4，也就是第一个变量值的后面。

在call完memeset后，有如下汇编代码：

 80483fb:       e8 cc fe ff ff          call   80482cc <memset@plt>
 8048400:       8b 45 fc                mov    0xfffffffc(%ebp),%eax
 8048403:       65 33 05 14 00 00 00    xor    %gs:0x14,%eax
 804840a:       74 05                   je     8048411 <test+0x3d>
 804840c:       e8 db fe ff ff          call   80482ec <__stack_chk_fail@plt>

在memset后，gcc要检查这个操作是否发生了堆栈溢出, 将保存在ebp-4的这个值与原来的值对比一下，如果不相同，说明堆栈发生了溢出，那么就会执行stack_chk_fail这个函数，这个函数是glibc实现的，打印出上面看到的信息，然后进程退出。

从这个例子中我们可以看出gcc使用了-fstack-protector参数后，会自动检查堆栈是否发生了溢出，但是有一个前提就是内核要给每个进程提前设置好一个检测值放置在%gs:0×14位置处，这个值称之为stack canary。所以我们可以看到防止堆栈溢出是由内核和gcc共同来完成的。

gcc的任务就是放置几条汇编代码，然后和%gs:0×14位置处的值进行对比即可。主要任务还是内核如何来设置stack canary，也是CC_STACKPROTECTOR补丁要实现的目的，下面我们仔细来看下这个补丁是如何实现的。

既然gcc硬性规定了stack canary必须在%gs的某个偏移位置处，那么内核也必须按着这个规定来设置。

对于32位和64位内核， gs寄存器有着不同的功能。

64位内核gcc要求stack canary是放置在gs段的40偏移处，并且gs寄存器在每cpu变量中是共享的，每cpu变量irq_stack_union的结构如下：

arch/x86/include/asm/processor.h

union irq_stack_union {
	char irq_stack[IRQ_STACK_SIZE];
	/*
	 * GCC hardcodes the stack canary as %gs:40.  Since the
	 * irq_stack is the object at %gs:0, we reserve the bottom
	 * 48 bytes of the irq stack for the canary. 
	 */
	struct {
		char gs_base[40];
		unsigned long stack_canary;
	};
};

DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union);

gs_base只是一个40字节的站位空间， stack_canary就紧挨其后。并且在应用程序进出内核的时候，内核会使用swapgs指令自动更换gs寄存器的内容。

32位下就稍微有点复杂了。由于某些处理器在加载不同的段寄存器时很慢，所以内核使用fs段寄存器替换了gs寄存器。但是gcc在使用-fstack-protector的时候，还要用到gs段寄存器，所以内核还要管理gs寄存器，我们要把CONFIG_X86_32_LAZY_GS选项关闭， gs也只在进程切换的时候才改变。 32位用每cpu变量stack_canary保存stack canary。

struct stack_canary {
	char __pad[20];         /* canary at %gs:20 */
	unsigned long canary;
};      
DECLARE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);

内核是处于保护模式的，因此gs寄存器就变成了保护模式下的段选子，在GDT表中也要有相应的设置：

diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 1dc1b51..14e0ed8 100644 (file)
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -61,7 +61,7 @@
  *
  *  26 - ESPFIX small SS
  *  27 - per-cpu                       [ offset to per-cpu data area ]
- *  28 - unused
+ *  28 - stack_canary-20               [ for stack protector ]
  *  29 - unused
  *  30 - unused
  *  31 - TSS for double fault handler
@@ -95,6 +95,13 @@
 #define __KERNEL_PERCPU 0
 #endif

+#define GDT_ENTRY_STACK_CANARY         (GDT_ENTRY_KERNEL_BASE + 16)
+#ifdef CONFIG_CC_STACKPROTECTOR
+#define __KERNEL_STACK_CANARY          (GDT_ENTRY_STACK_CANARY * 8)
+#else
+#define __KERNEL_STACK_CANARY          0
+#endif
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS      31

GDT表中的第28个表项用来定为stack canary所在的段。

#define GDT_STACK_CANARY_INIT                                           \
        [GDT_ENTRY_STACK_CANARY] = GDT_ENTRY_INIT(0x4090, 0, 0x18),

GDT_STACK_CANARY_INIT在刚进入保护模式的时候被调用，这个段描述符项被设置为基地址为0，段大小设为24，因为只在基地址为0，偏移为0×14处放置一个4bytes的stack canary，所以24字节正好。不理解的同学可以看看intel保护模式的手册，对着段描述符结构一个个看就行了。

在进入保护模式后， start_kernel()会调用boot_init_stack_canary()来初始话一个stack canary。

/*      
 * Initialize the stackprotector canary value.
 *
 * NOTE: this must only be called from functions that never return,
 * and it must always be inlined.
 */
static __always_inline void boot_init_stack_canary(void)
{
	u64 canary;
	u64 tsc;

#ifdef CONFIG_X86_64
	BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
	/*
	 * We both use the random pool and the current TSC as a source
	 * of randomness. The TSC only matters for very early init,
	 * there it already has some randomness on most systems. Later
	 * on during the bootup the random pool has true entropy too.
	 */
	get_random_bytes(&canary, sizeof(canary));
	tsc = __native_read_tsc();
	canary += tsc + (tsc << 32UL);

	current->stack_canary = canary;
#ifdef CONFIG_X86_64
	percpu_write(irq_stack_union.stack_canary, canary);
#else
	percpu_write(stack_canary.canary, canary);
#endif
}

随机出了一个值赋值给每cpu变量， 32位是stack_canary, 64位是irq_stack_union。

内核在进一步初始化cpu的时候，会调用setup_stack_canary_segment()来设置每个cpu的GDT的stack canary描述符项：

start_kernel()->setup_per_cpu_areas()->setup_stack_canary_segment：

static inline void setup_stack_canary_segment(int cpu)
{
#ifdef CONFIG_X86_32
	unsigned long canary = (unsigned long)&per_cpu(stack_canary, cpu);
	struct desc_struct *gdt_table = get_cpu_gdt_table(cpu);
	struct desc_struct desc;

	desc = gdt_table[GDT_ENTRY_STACK_CANARY];
	set_desc_base(&desc, canary);
	write_gdt_entry(gdt_table, GDT_ENTRY_STACK_CANARY, &desc, DESCTYPE_S);
#endif
}

在内核刚进入保护模式的时候, stack canary描述符的基地址被初始化为0，现在在cpu初始化的时候要重新设置为每cpu变量stack_canary的地址，而不是变量保存的值。通过这些设置当内核代码在访问%gs:0×14的时候，就会访问stack canry保存的值。注意：setup_stack_canary_segment是针对32位内核做设置，因为64位内核中的irq_stack_union是每cpu共享的，不用针对每个cpu单独设置。然后就可以调用switch_to_new_gdt(cpu);来加载GDT表和加载gs寄存器。

经过上述初始化过程，在内核代码里访问%gs:0×14就可以定位stack canary的值了，那么每个进程的stack canary是什么时候设置的呢？

在内核启动一个进程的时候，会把gs寄存器的值设为KERNEL_STACK_CANARY

--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -212,6 +212,7 @@ int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
	    regs.ds = __USER_DS;
	    regs.es = __USER_DS;
	    regs.fs = __KERNEL_PERCPU;
+       regs.gs = __KERNEL_STACK_CANARY;
	    regs.orig_ax = -1;
	    regs.ip = (unsigned long) kernel_thread_helper;
	    regs.cs = __KERNEL_CS | get_kernel_rpl();

内核在fork一个进程的时候，有如下操作：

static struct task_struct *dup_task_struct(struct task_struct *orig)
{
#ifdef CONFIG_CC_STACKPROTECTOR
	tsk->stack_canary = get_random_int();
#endif
}

随机初始化了一个stack_canary保存在task_struct结构中的stack_canary变量中。当进程在切换的时候，通过switch宏把新进程的stack canary保存在每cpu变量stack_canary中，当前进程的stack_canary也保存在一个每cpu变量中，完成stack canary的切换。

diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
index 79b98e5..2692ee8 100644 (file)
--- a/arch/x86/include/asm/system.h
+++ b/arch/x86/include/asm/system.h
@@ -23,6 +23,22 @@ struct task_struct *__switch_to(struct task_struct *prev,

 #ifdef CONFIG_X86_32

+#ifdef CONFIG_CC_STACKPROTECTOR
+#define __switch_canary                                                \
+       "movl "__percpu_arg([current_task])",%%ebx\n\t"                 \
+       "movl %P[task_canary](%%ebx),%%ebx\n\t"                         \
+       "movl %%ebx,"__percpu_arg([stack_canary])"\n\t"
+#define __switch_canary_oparam                                         \
+       , [stack_canary] "=m" (per_cpu_var(stack_canary))
+#define __switch_canary_iparam                                         \
+       , [current_task] "m" (per_cpu_var(current_task))                \
+       , [task_canary] "i" (offsetof(struct task_struct, stack_canary))
+#else  /* CC_STACKPROTECTOR */
+#define __switch_canary
+#define __switch_canary_oparam
+#define __switch_canary_iparam
+#endif /* CC_STACKPROTECTOR */
+
 /*
  * Saving eflags is important. It switches not only IOPL between tasks,
  * it also protects other tasks from NT leaking through sysenter etc.
@@ -46,6 +62,7 @@ do {                                                  \
	                 "pushl %[next_ip]\n\t"     /* restore EIP   */     \
	                 "jmp __switch_to\n"        /* regparm call  */     \
	                 "1:\t"                                             \
+                    __switch_canary                                    \
	                 "popl %%ebp\n\t"           /* restore EBP   */     \
	                 "popfl\n"                  /* restore flags */     \
	                                                                    \
@@ -58,6 +75,8 @@ do {                                                  \
	                   "=b" (ebx), "=c" (ecx), "=d" (edx),              \
	                   "=S" (esi), "=D" (edi)                           \
	                                                                    \
+                      __switch_canary_oparam                           \
+                                                                       \
	                   /* input parameters: */                          \
	                 : [next_sp]  "m" (next->thread.sp),                \
	                   [next_ip]  "m" (next->thread.ip),                \
@@ -66,6 +85,8 @@ do {                                                  \
	                   [prev]     "a" (prev),                           \
	                   [next]     "d" (next)                            \
	                                                                    \
+                      __switch_canary_iparam                           \
+                                                                       \
	                 : /* reloaded segment registers */                 \
	                    "memory");                                      \
 } while (0)

前面讲过当gcc检测到堆栈溢出的时候，会调用glibc的stack_chk_fail函数，但是当内核堆栈发生溢出的时候，不能调用glibc的函数，所以内核自己实现了一个stack_chk_fail函数：

kernel/panic.c

#ifdef CONFIG_CC_STACKPROTECTOR

/*
 * Called when gcc's -fstack-protector feature is used, and
 * gcc detects corruption of the on-stack canary value
 */
void __stack_chk_fail(void)
{
	panic("stack-protector: Kernel stack is corrupted in: %p\n",
		 __builtin_return_address(0));
}
EXPORT_SYMBOL(__stack_chk_fail);

#endif

当内核堆栈发生溢出的时候，就会执行stack_chk_fail函数，内核当机。

这就是这个补丁的原理，不懂的同学请参考：

http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commitdiff;h=60a5317ff0f42dd313094b88f809f63041568b08

ixgbe

2015-11-17 15:16:00

http://www.pagefault.info/?p=403

这里分析的驱动代码是给予linux kernel 3.4.4

对应的文件在drivers/net/ethernet/intel 目录下，这个分析不涉及到很细节的地方，主要目的是理解下数据在协议栈和驱动之间是如何交互的。

首先我们知道网卡都是pci设备，因此这里每个网卡驱动其实就是一个pci驱动。并且intel这里是把好几个万兆网卡(82599/82598/x540)的驱动做在一起的。

首先我们来看对应的pci_driver的结构体，这里每个pci驱动都是一个pci_driver的结构体，而这里是多个万兆网卡共用这个结构体ixgbe_driver.

static struct pci_driver ixgbe_driver = {
	.name     = ixgbe_driver_name,
	.id_table = ixgbe_pci_tbl,
	.probe    = ixgbe_probe,
	.remove   = __devexit_p(ixgbe_remove),
#ifdef CONFIG_PM
	.suspend  = ixgbe_suspend,
	.resume   = ixgbe_resume,
#endif
	.shutdown = ixgbe_shutdown,
	.err_handler = &ixgbe_err_handler
};

然后是模块初始化方法,这里其实很简单，就是调用pci的驱动注册方法，把ixgbe挂载到pci设备链中。这里不对pci设备的初始化做太多介绍，我以前的blog有这方面的介绍，想了解的可以去看看。这里我们只需要知道最终内核会调用probe回调来初始化ixgbe。

char ixgbe_driver_name[] = "ixgbe";
static const char ixgbe_driver_string[] =
				"Intel(R) 10 Gigabit PCI Express Network Driver";
 
static int __init ixgbe_init_module(void)
{
	int ret;
	pr_info("%s - version %s\n", ixgbe_driver_string, ixgbe_driver_version);
	pr_info("%s\n", ixgbe_copyright);
 
#ifdef CONFIG_IXGBE_DCA
	dca_register_notify(&dca_notifier);
#endif
 
	ret = pci_register_driver(&ixgbe_driver);
	return ret;
}

这里不去追究具体如何调用probe的细节，我们直接来看probe函数，这个函数中通过硬件的信息来确定需要初始化那个驱动(82598/82599/x540),然后核心的驱动结构就放在下面的这个数组中。

static const struct ixgbe_info *ixgbe_info_tbl[] = {
	[board_82598] = &ixgbe_82598_info,
	[board_82599] = &ixgbe_82599_info,
	[board_X540] = &ixgbe_X540_info,
};

ixgbe_probe函数很长，我们这里就不详细分析了，因为这部分就是对网卡进行初始化。不过我们关注下面几个代码片段。

首先是根据硬件的参数来取得对应的驱动值:

const struct ixgbe_info *ii = ixgbe_info_tbl[ent->driver_data];

然后就是如何将不同的网卡驱动挂载到对应的回调中，这里做的很简单，就是通过对应的netdev的结构取得adapter，然后所有的核心操作都是保存在adapter中的，最后将ii的所有回调拷贝给adapter就可以了。我们来看代码：

	struct net_device *netdev;
	struct ixgbe_adapter *adapter = NULL;
	struct ixgbe_hw *hw;
	.....................................
 
	adapter = netdev_priv(netdev);
	pci_set_drvdata(pdev, adapter);
 
	adapter->netdev = netdev;
	adapter->pdev = pdev;
	hw = &adapter->hw;
	hw->back = adapter;
	.......................................
	memcpy(&hw->mac.ops, ii->mac_ops, sizeof(hw->mac.ops));
	hw->mac.type  = ii->mac;
 
	/* EEPROM */
	memcpy(&hw->eeprom.ops, ii->eeprom_ops, sizeof(hw->eeprom.ops));
	.....................................

最后需要关注的就是设置网卡属性，这些属性一般来说都是通过ethtool 可以设置的属性(比如tso/checksum等),这里我们就截取一部分:

	netdev->features = NETIF_F_SG |
			   NETIF_F_IP_CSUM |
			   NETIF_F_IPV6_CSUM |
			   NETIF_F_HW_VLAN_TX |
			   NETIF_F_HW_VLAN_RX |
			   NETIF_F_HW_VLAN_FILTER |
			   NETIF_F_TSO |
			   NETIF_F_TSO6 |
			   NETIF_F_RXHASH |
			   NETIF_F_RXCSUM;
 
	netdev->hw_features = netdev->features;
 
	switch (adapter->hw.mac.type) {
	case ixgbe_mac_82599EB:
	case ixgbe_mac_X540:
		netdev->features |= NETIF_F_SCTP_CSUM;
		netdev->hw_features |= NETIF_F_SCTP_CSUM |
					   NETIF_F_NTUPLE;
		break;
	default:
		break;
	}
 
	netdev->hw_features |= NETIF_F_RXALL;
	..................................................
 
	netdev->priv_flags |= IFF_UNICAST_FLT;
	netdev->priv_flags |= IFF_SUPP_NOFCS;
 
	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
		adapter->flags &= ~(IXGBE_FLAG_RSS_ENABLED |
					IXGBE_FLAG_DCB_ENABLED);
	...................................................................
	if (pci_using_dac) {
		netdev->features |= NETIF_F_HIGHDMA;
		netdev->vlan_features |= NETIF_F_HIGHDMA;
	}
 
	if (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE)
		netdev->hw_features |= NETIF_F_LRO;
	if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED)
		netdev->features |= NETIF_F_LRO;

然后我们来看下中断的注册，因为万兆网卡大部分都是多对列网卡(配合msix)，因此对于上层软件来说，就好像有多个网卡一样，它们之间的数据是相互独立的，这里读的话主要是napi驱动的poll方法，后面我们会分析这个.

到了这里或许要问那么网卡是如何挂载回调给上层，从而上层来发送数据呢，这里是这样子的，每个网络设备都有一个回调函数表(比如ndo_start_xmit)来供上层调用，而在ixgbe中的话，就是ixgbe_netdev_ops，下面就是这个结构，不过只是截取了我们很感兴趣的几个地方.

不过这里注意，读回调并不在里面，这是因为写是软件主动的，而读则是硬件主动的。现在ixgbe是NAPI的，因此它的poll回调是ixgbe_poll，是中断注册时候通过netif_napi_add添加进去的。

static const struct net_device_ops ixgbe_netdev_ops = {
	.ndo_open       = ixgbe_open,
	.ndo_stop       = ixgbe_close,
	.ndo_start_xmit     = ixgbe_xmit_frame,
	.ndo_select_queue   = ixgbe_select_queue,
	.ndo_set_rx_mode    = ixgbe_set_rx_mode,
	.ndo_validate_addr  = eth_validate_addr,
	.ndo_set_mac_address    = ixgbe_set_mac,
	.ndo_change_mtu     = ixgbe_change_mtu,
	.ndo_tx_timeout     = ixgbe_tx_timeout,
	.................................................
	.ndo_set_features = ixgbe_set_features,
	.ndo_fix_features = ixgbe_fix_features,
};

这里我们最关注的其实就是ndo_start_xmit回调，这个回调就是驱动提供给协议栈的发送回调接口。我们来看这个函数.

它的实现很简单，就是选取对应的队列，然后调用ixgbe_xmit_frame_ring来发送数据。

static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
					struct net_device *netdev)
{
	struct ixgbe_adapter *adapter = netdev_priv(netdev);
	struct ixgbe_ring *tx_ring;
 
	if (skb->len <= 0) {
		dev_kfree_skb_any(skb);
		return NETDEV_TX_OK;
	}
 
	/*
	 * The minimum packet size for olinfo paylen is 17 so pad the skb
	 * in order to meet this minimum size requirement.
	 */
	if (skb->len < 17) {
		if (skb_padto(skb, 17))
			return NETDEV_TX_OK;
		skb->len = 17;
	}
	//取得对应的队列
	tx_ring = adapter->tx_ring[skb->queue_mapping];
	//发送数据
	return ixgbe_xmit_frame_ring(skb, adapter, tx_ring);
}

而在ixgbe_xmit_frame_ring中，我们就关注两个地方，一个是tso(什么是TSO，请自行google)，一个是如何发送.

	tso = ixgbe_tso(tx_ring, first, &hdr_len);
	if (tso < 0)
		goto out_drop;
	else if (!tso)
		ixgbe_tx_csum(tx_ring, first);
 
	/* add the ATR filter if ATR is on */
	if (test_bit(__IXGBE_TX_FDIR_INIT_DONE, &tx_ring->state))
		ixgbe_atr(tx_ring, first);
 
#ifdef IXGBE_FCOE
xmit_fcoe:
#endif /* IXGBE_FCOE */
	ixgbe_tx_map(tx_ring, first, hdr_len);

调用ixgbe_tso处理完tso之后，就会调用ixgbe_tx_map来发送数据。而ixgbe_tx_map所做的最主要是两步，第一步请求DMA，第二步写寄存器，通知网卡发送数据.

	dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);
	if (dma_mapping_error(tx_ring->dev, dma))
		goto dma_error;
 
	/* record length, and DMA address */
	dma_unmap_len_set(first, len, size);
	dma_unmap_addr_set(first, dma, dma);
 
	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
	for (;;) {
		while (unlikely(size > IXGBE_MAX_DATA_PER_TXD)) {
			tx_desc->read.cmd_type_len =
				cmd_type | cpu_to_le32(IXGBE_MAX_DATA_PER_TXD);
 
			i++;
			tx_desc++;
			if (i == tx_ring->count) {
				tx_desc = IXGBE_TX_DESC(tx_ring, 0);
				i = 0;
			}
 
			dma += IXGBE_MAX_DATA_PER_TXD;
			size -= IXGBE_MAX_DATA_PER_TXD;
 
			tx_desc->read.buffer_addr = cpu_to_le64(dma);
			tx_desc->read.olinfo_status = 0;
		}
 
		...................................................
		data_len -= size;
 
		dma = skb_frag_dma_map(tx_ring->dev, frag, 0, size,
					   DMA_TO_DEVICE);
		..........................................................
 
		frag++;
	}
	.................................
	tx_ring->next_to_use = i;
 
	/* notify HW of packet */
	writel(i, tx_ring->tail);
	.................

上面的操作是异步的，也就是说此时内核还不能释放SKB，而是网卡硬件发送完数据之后，会再次产生中断通知内核，然后内核才能释放内存.接下来我们来看这部分代码。

首先来看的是中断注册的代码，这里我们假设启用了MSIX,那么网卡的中断注册回调就是ixgbe_request_msix_irqs函数，这里我们可以看到调用request_irq函数来注册回调，并且每个队列都有自己的中断号。

static int ixgbe_request_msix_irqs(struct ixgbe_adapter *adapter)
{
	struct net_device *netdev = adapter->netdev;
	int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
	int vector, err;
	int ri = 0, ti = 0;
 
	for (vector = 0; vector < q_vectors; vector++) {
		struct ixgbe_q_vector *q_vector = adapter->q_vector[vector];
		struct msix_entry *entry = &adapter->msix_entries[vector];
		.......................................................................
		err = request_irq(entry->vector, &ixgbe_msix_clean_rings, 0,
				  q_vector->name, q_vector);
		if (err) {
			e_err(probe, "request_irq failed for MSIX interrupt "
				  "Error: %d\n", err);
			goto free_queue_irqs;
		}
		/* If Flow Director is enabled, set interrupt affinity */
		if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
			/* assign the mask for this irq */
			irq_set_affinity_hint(entry->vector,
						  &q_vector->affinity_mask);
		}
	}
 
	..............................................
 
	return 0;
 
free_queue_irqs:
	...............................
	return err;
}

而对应的中断回调是ixgbe_msix_clean_rings,而这个函数呢，做的事情很简单(需要熟悉NAPI的原理，我以前的blog有介绍),就是调用napi_schedule来重新加入软中断处理.

static irqreturn_t ixgbe_msix_clean_rings(int irq, void *data)
{
	struct ixgbe_q_vector *q_vector = data;
 
	/* EIAM disabled interrupts (on this vector) for us */
 
	if (q_vector->rx.ring || q_vector->tx.ring)
		napi_schedule(&q_vector->napi);
 
	return IRQ_HANDLED;
}

而NAPI驱动我们知道，最终是会调用网卡驱动挂载的poll回调，在ixgbe中，对应的回调就是ixgbe_poll，那么也就是说这个函数要做两个工作，一个是处理读，一个是处理写完之后的清理.

int ixgbe_poll(struct napi_struct *napi, int budget)
{
	struct ixgbe_q_vector *q_vector =
				container_of(napi, struct ixgbe_q_vector, napi);
	struct ixgbe_adapter *adapter = q_vector->adapter;
	struct ixgbe_ring *ring;
	int per_ring_budget;
	bool clean_complete = true;
 
#ifdef CONFIG_IXGBE_DCA
	if (adapter->flags & IXGBE_FLAG_DCA_ENABLED)
		ixgbe_update_dca(q_vector);
#endif
	//清理写
	ixgbe_for_each_ring(ring, q_vector->tx)
		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
 
	/* attempt to distribute budget to each queue fairly, but don't allow
	 * the budget to go below 1 because we'll exit polling */
	if (q_vector->rx.count > 1)
		per_ring_budget = max(budget/q_vector->rx.count, 1);
	else
		per_ring_budget = budget;
	//读数据，并清理已完成的
	ixgbe_for_each_ring(ring, q_vector->rx)
		clean_complete &= ixgbe_clean_rx_irq(q_vector, ring,
							 per_ring_budget);
 
	/* If all work not completed, return budget and keep polling */
	if (!clean_complete)
		return budget;
 
	/* all work done, exit the polling mode */
	napi_complete(napi);
	if (adapter->rx_itr_setting & 1)
		ixgbe_set_itr(q_vector);
	if (!test_bit(__IXGBE_DOWN, &adapter->state))
		ixgbe_irq_enable_queues(adapter, ((u64)1 << q_vector->v_idx));
 
	return 0;
}

cubic

2015-11-17 15:08:00

http://www.pagefault.info/?p=145

这次主要来看一下内核拥塞控制算法cubic的实现，在linux kernel中实现了很多种拥塞控制算法，不过新的内核(2.6.19之后)默认是cubic(想得到当前内核使用的拥塞控制算法可以察看/proc/sys/net/ipv4/tcp_congestion_control这个值).下面是最新的redhat 6的拥塞控制算法(rh5还是bic算法):

[root@rhel6 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control 
cubic

这个算法的paper在这里：

http://netsrv.csc.ncsu.edu/export/cubic_a_new_tcp_2008.pdf

拥塞控制算法会在tcp_ack中被调用，如果是正常的ack(比如不是重复的，不是sack等等)就会进入拥塞控制算法。

cubic会调用tcp_slow_start这个方法(基本上每种拥塞控制算法都会调用它)，这个方法主要是处理slow start，而内核中的slow start是这样子的，接收一个ack，snd_cwnd就会加1，然后当cwnd大于设置的拥塞窗口阀值snd_ssthresh的时候，就会进入拥塞避免状态。而在发送数据包的时候，会判断in_flight(可以认为是发送还没确认的数据包，它等于发送未确认的数据包－sack的数据段－丢失的数据段＋重传的数据段，我的前面的blog有详细解释这个数据段)是否大于snd_cwnd,如果大于等于则不会发送数据，如果小于才会继续发送数据。

而进入拥塞避免状态之后，窗口的增长速度将会减缓，

来看一下我用jprobe hook tcp_slow_start(slow start处理函数) 和 tcp_cong_avoid_ai (拥塞避免处理)的数据。

在下面的数据中sk表示当前socket的地址， in_flight packet表示发送还未接收的包, snd_cwnd表示发送拥塞窗口。然后详细解释下count后面的两个值，其中第一个是snd_cwnd_cnt，表示在当前的拥塞窗口中已经发送的数据段的个数，而第二个是struct bictcp的一个域cnt，它是cubic拥塞算法的核心，主要用来控制在拥塞避免状态的时候，什么时候才能增大拥塞窗口，具体实现是通过比较它和snd_cwnd_cnt，来决定是否增大拥塞窗口，而这个值的计算，我这里不会分析，想了解的，可以去看cubic的paper。

还有一个需要注意的地方就是ssthresh，可以看到这个值在一开始初始化为一个最大的值，然后在进入拥塞避免状态的时候被设置为前一次拥塞窗口的大小.这个处理可以看rfc2581的这段：

The initial value of ssthresh may be arbitrarily high (i.e., the size of the advertised window), but it may be reduced in response to congestion. When cwnd < ssthresh, the slow-start algorithm is used and when cwnd > ssthresh, the congestion avoidance algorithm is used. When cwnd and ssthresh are equal, the sender may use either of them.

我们后面会看到这个值在cubic中是如何被设置的。

//进入slow start，可以看到拥塞窗口默认初始值是3，然后每次接收到ack，都会加1.
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 2, snd_cwnd is 3, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 3, snd_cwnd is 4, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 4, snd_cwnd is 5, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 5, snd_cwnd is 6, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 4, snd_cwnd is 7, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 7, snd_cwnd is 8, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 6, snd_cwnd is 9, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 9, snd_cwnd is 10, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 10, snd_cwnd is 11, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 11, snd_cwnd is 12, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 10, snd_cwnd is 13, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 9, snd_cwnd is 14, ssthresh is 2147483647, count is [0:0]
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 14, snd_cwnd is 15, ssthresh is 2147483647, count is [0:0]
//ssthresh被更新为当前拥塞窗口的大小，后面会看到为什么是16
enter [slow start state] tcp_sock is 4129562112 in_flight packets is 13, snd_cwnd is 16, ssthresh is 16, count is [0:0]
//进入拥塞避免，可以清楚的看到，此时拥塞窗口大于对应的阀值ssthresh.
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 16, snd_cwnd is 17, ssthresh is 16, count is [0:877]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 15, snd_cwnd is 17, ssthresh is 16, count is [1:877]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 16, snd_cwnd is 17, ssthresh is 16, count is [2:877]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 15, snd_cwnd is 17, ssthresh is 16, count is [3:877]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 14, snd_cwnd is 17, ssthresh is 16, count is [4:877]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 13, snd_cwnd is 17, ssthresh is 16, count is [5:877]
//这里注意，其中count的第一个值是一直线性增长的，也就是说下面省略了大概80条log，而在这80几次中拥塞窗口一直维持在17没有变化
....................................................................................................................
//可以看到cnt变为3，也就是说明当执行完拥塞避免就会增加窗口了。
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 13, snd_cwnd is 17, ssthresh is 16, count is [91:3]
//增加窗口的大小，然后将snd_cwnd_cnt reset为0.
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 11, snd_cwnd is 18, ssthresh is 16, count is [0:6]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 16, snd_cwnd is 18, ssthresh is 16, count is [1:6]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 14, snd_cwnd is 18, ssthresh is 16, count is [2:6]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 12, snd_cwnd is 18, ssthresh is 16, count is [3:6]
enter [cong avoid state] tcp_sock is 4129562112 in_flight packets is 12, snd_cwnd is 18, ssthresh is 16, count is [4:6]

可以看到在slow start的状态，发送拥塞窗口就是很简单的每次加1，而当进入拥塞避免之后，明显的拥塞窗口的增大速度变慢很多。

接下来来看具体的代码是如何实现的.

首先来看bictcp_cong_avoid，也就是cubic拥塞控制算法的handler(一般来说在tcp_ack中被调用)，它有3个参数，第一个是对应的sock，第二个是对应的ack序列号，而第三个就是比较重要的一个变量，表示发送还没有被ack的数据包(在linux 内核tcp拥塞处理一中详细介绍过内核中这些变量)，这个变量是拥塞控制的核心。

static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct bictcp *ca = inet_csk_ca(sk);
	//判断发送拥塞窗口是否到达限制，如果到达限制则直接返回。
	if (!tcp_is_cwnd_limited(sk, in_flight))
		return;
	//开始决定进入slow start还是拥塞控制状态
	if (tp->snd_cwnd <= tp->snd_ssthresh) {
		//是否需要reset对应的bictcp的值
		if (hystart && after(ack, ca->end_seq))
			bictcp_hystart_reset(sk);
		//进入slow start状态
		tcp_slow_start(tp);
	} else {
		//进入拥塞避免状态，首先会更新ca->cnt.
		bictcp_update(ca, tp->snd_cwnd);
		//然后进入拥塞避免
		tcp_cong_avoid_ai(tp, ca->cnt);
	}
}

接下来就是看tcp_is_cwnd_limited，这个函数主要是实现RFC2861中对拥塞窗口的检测。它返回1说明拥塞窗口被限制，我们需要增加拥塞窗口，否则的话，就不需要增加拥塞窗口。

然后这里还有两个判断，先来看第一个 gso的概念，gso是Generic Segmentation Offload的简写，他的主要功能就是尽量的延迟数据包的传输，以便与在最恰当的时机传输数据包，这个机制是处于数据包离开协议栈与进入驱动之间。比如如果驱动支持TSO的话，gso就会将多个unsegmented的数据段传递给驱动。而TSO是TCP Segmentation Offload的缩写，它表示驱动支持协议栈发送大的MTU的数据段，然后硬件负责来切包，然后将数据发送出去，这样子的话，就能提高系统的吞吐。这几个东西(还有GRO)，以后我会详细分析，现在只需要大概知道他们是干什么的。

而在这里如果支持gso，就有可能是tso defer住了数据包，因此这里会进行几个相关的判断，来看需不需要增加拥塞窗口。。

然后是burst的概念，主要用来控制网络流量的突发性增大，也就是说当left数据(还能发送的数据段数)大于burst值的时候，我们需要暂时停止增加窗口，因为此时有可能我们这边数据发送过快。

int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	u32 left;
	//比较发送未确认和发送拥塞窗口的大小
	if (in_flight >= tp->snd_cwnd)
		return 1;
	//得到还能发送的数据包的段数
	left = tp->snd_cwnd - in_flight;
	if (sk_can_gso(sk) &&
		left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&
		left * tp->mss_cache < sk->sk_gso_max_size)
		return 1;
	//看是否还能发送的数据包是否小于等于burst
	return left <= tcp_max_burst(tp);
}

接下来来看snd_ssthresh是如何被设置的，这个值在加载cubic模块的时候可以传递一个我们制定的值给它，不过，默认是很大的值，我这里是2147483647,然后在接收ack期间(slow start)期间会调整这个值，在cubic中，默认是16（一般来说说当拥塞窗口到达16的时候，snd_ssthresh会被设置为16).

在cubic中有两个可以设置snd_ssthresh的地方一个是hystart_update，一个是bictcp_recalc_ssthresh，后一个我这里就不介绍了，以后介绍拥塞状态机的时候会详细介绍，现在只需要知道，只有遇到拥塞的时候，需要调整snd_ssthres的时候，我们才需要调用bictcp_recalc_ssthresh。

而hystart_update是在bictcp_acked中被调用，而bictcp_acked则是基本每次收到ack都会调用这个函数，我们来看在bictcp_acked中什么情况就会调用hystart_update：

/* hystart triggers when cwnd is larger than some threshold */
if (hystart && tp->snd_cwnd <= tp->snd_ssthresh &&
	tp->snd_cwnd >= hystart_low_window)
	hystart_update(sk, delay);

其中hystart是hybrid slow start打开的标志，默认是开启，hystart_low_window是设置snd_ssthresh的最小拥塞窗口值，默认是16。而tp->snd_ssthresh默认是一个很大的值，因此这里就知道了，当拥塞窗口增大到16的时候我们就会进去hystart_update来更新snd_ssthresh.因此hystart_updat换句话来说也就是主要用于是否退出slow start。

static void hystart_update(struct sock *sk, u32 delay)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct bictcp *ca = inet_csk_ca(sk);

	if (!(ca->found & hystart_detect)) {
		.................................................................
		/*
		 * Either one of two conditions are met,
		 * we exit from slow start immediately.
		 */
		//found是一个是否退出slow start的标记
		if (ca->found & hystart_detect)
			//设置snd_ssthresh
			tp->snd_ssthresh = tp->snd_cwnd;
	}
}

然后是slow start的处理,这里有关abc的处理，注释都很详细了，这里就不解释了，我们主要看abc关闭的部分。这里使用cnt，也是主要为了打开abc之后的slow start。

这是abc（Appropriate Byte Counting）相关的rfc：

http://www.faqs.org/rfcs/rfc3465.html

Appropriate Byte Countin会导致拥塞控制算法很激进，比如打开它之后就不一定每次ack都会执行slow start，而且窗口也会增加的快很多。

void tcp_slow_start(struct tcp_sock *tp)
{
	int cnt; /* increase in packets */

	/* RFC3465: ABC Slow start
	 * Increase only after a full MSS of bytes is acked
	 *
	 * TCP sender SHOULD increase cwnd by the number of
	 * previously unacknowledged bytes ACKed by each incoming
	 * acknowledgment, provided the increase is not more than L
	 */
	if (sysctl_tcp_abc && tp->bytes_acked < tp->mss_cache)
		return;
	//限制slow start的cnt
	if (sysctl_tcp_max_ssthresh > 0 && tp->snd_cwnd > sysctl_tcp_max_ssthresh)
		cnt = sysctl_tcp_max_ssthresh >> 1; /* limited slow start */
	else
		cnt = tp->snd_cwnd;            /* exponential increase */

	/* RFC3465: ABC
	 * We MAY increase by 2 if discovered delayed ack
	 */
	if (sysctl_tcp_abc > 1 && tp->bytes_acked >= 2*tp->mss_cache)
		cnt <<= 1;
	tp->bytes_acked = 0;
	//更新cnt，也就是当前拥塞窗口接受的段的个数.
	tp->snd_cwnd_cnt += cnt;
	while (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
		//这里snd_cwnd_cnt是snd_cwnd的几倍，拥塞窗口就增加几。
		tp->snd_cwnd_cnt -= tp->snd_cwnd;
		//如果拥塞窗口没有超过最大值，则加一
		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
			tp->snd_cwnd++;
	}
}

最后是拥塞避免的处理。这里主要的步骤就是通过判断当前的拥塞窗口下已经发送的数据段的个数是否大于算法计算出来的值w，如果大于我们才能增加拥塞窗口值，否则之需要增加snd_cwnd_cnt。

void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w)
{
	//判断是否大于我们的标记值
	if (tp->snd_cwnd_cnt >= w) {
		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
			tp->snd_cwnd++;
		tp->snd_cwnd_cnt = 0;
	} else {
		//增加计数值
		tp->snd_cwnd_cnt++;
	}
}

← Older Blog Archives Newer →