kk Blog —— 通用基础

date [-d @int|str] [+%s|"+%F %T"]

skb 申请释放

http://book.51cto.com/art/201206/345040.htm


一、SKB的缓存池

网络模块中,有两个用来分配SKB描述符的高速缓存,在SKB模块初始函数skb_init()中被创建。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2048 void __init skb_init(void)  
2049 {  
2050     skbuff_head_cache = kmem_cache_create("skbuff_head_cache",  
2051                           sizeof(struct sk_buff),  
2052                           0,  
2053                           SLAB_HWCACHE_ALIGN|SLAB_PANIC,  
2054                           NULL, NULL);  
2055     skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",  
2056                         (2*sizeof(struct sk_buff)) +  
2057                         sizeof(atomic_t),  
2058                         0,  
2059                         SLAB_HWCACHE_ALIGN|SLAB_PANIC,  
2060                         NULL, NULL);  
2061 }

2050-2054 创建skbuff_head_cache高速缓存,一般情况下,SKB都是从该高速缓存中分配的。

2055-2060 创建每次以两倍SKB描述符长度来分配空间的skbuff_fclone_cache高速缓存。如果在分配SKB时就知道可能被克隆,那么应该从这个高速缓存中分配空间,因为在这个高速缓存中分配SKB时,会同时分配一个后备的SKB,以便将来用于克隆,这样在克隆时就不用再次分配SKB了,直接使用后备的SKB即可,这样做的目的主要是提高效率。

两个高速缓存的区别在于创建时指定的单位内存区域大小不同,skbuff_head_cache的单位内存区域长度是sizeof(struct sk_buff),而skbuff_fclone_cache的单位内存区域长度是2*sizeof(struct sk_buff)+sizeof(atomic_t),即一对SKB和一个引用计数,可以说这一对SKB是"父子"关系,指向同一个数据缓存区,引用计数值为0,1或2,用来表示这一对SKB中有几个已被使用,如图3-12所示。


二、分配SKB

1. alloc_skb()

alloc_skb()用来分配SKB。数据缓存区和SKB描述符是两个不同的实体,这就意味着,在分配一个SKB时,需要分配两块内存,一块是数据缓存区,一块是SKB描述符。__alloc_skb()调用kmem_cache_alloc_node()从高速缓存中获取一个sk_buff结构的空间,然后调用kmalloc_node_track_caller()分配数据缓存区。参数说明如下:

size,待分配SKB的线性存储区的长度。

gfp_mask,分配内存的方式,见表25-3。

fclone,预测是否会克隆,用于确定从哪个高速缓存中分配。

node,当支持NUMA(非均匀质存储结构)时,用于确定何种区域中分配SKB。NUMA参见相关资料。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
144 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,  
145                 int fclone, int node)  
146 {  
147     struct kmem_cache *cache;  
148     struct skb_shared_info *shinfo;  
149     struct sk_buff *skb;  
150     u8 *data;  
151  
152     cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;  
153  
154     /* Get the HEAD */  
155     skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);  
156     if (!skb)  
157         goto out;  
158  
159     /* Get the DATA. Size must match skb_add_mtu(). */  
160     size = SKB_DATA_ALIGN(size);  
161     data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),  
162             gfp_mask, node);  
163     if (!data)  
164         goto nodata;  
165  
166     memset(skb, 0, offsetof(struct sk_buff, truesize));  
167     skb->truesize = size + sizeof(struct sk_buff);  
168     atomic_set(&skb->users, 1);  
169     skb->head = data;  
170     skb->datadata = data;  
171     skb->tail = data;  
172     skb->end  = data + size;  
173     /* make sure we initialize shinfo sequentially */  
174     shinfo = skb_shinfo(skb);  
175     atomic_set(&shinfo->dataref, 1);  
176     shinfo->nr_frags  = 0;  
177     shinfo->gso_size = 0;  
178     shinfo->gso_segs = 0;  
179     shinfo->gso_type = 0;  
180     shinfo->ip6_frag_id = 0;  
181     shinfo->frag_list = NULL;  
182  
183     if (fclone) {  
184         struct sk_buff *child = skb + 1;  
185         atomic_t *fclone_ref = (atomic_t *) (child + 1);  
186  
187         skb->fclone = SKB_FCLONE_ORIG;  
188         atomic_set(fclone_ref, 1);  
189  
190         child->fclone = SKB_FCLONE_UNAVAILABLE;  
191     }  
192 out:  
193     return skb;  
194 nodata:  
195     kmem_cache_free(cache, skb);  
196     skb = NULL;  
197     goto out;  
198 }

152 根据参数fclone确定从哪个高速缓存中分配SKB。

155 调用kmem_cache_alloc_node()从选定的高速缓存中分配一个SKB。在此从分配标志中去除GFP_DMA,是为了不从DMA内存区域中分配SKB描述符,因为DMA内存区域比较小且有特定用途,没有必要用来分配SKB描述符。而后面分配数据缓存区时,就不会去掉GFP_DMA标志,因为很有可能数据缓存区就需要在DMA内存区域中分配,这样硬件可以直接进行DMA操作,参见161~162行。

160 在分配数据缓存区之前,强制对给定的数据缓存区大小size作对齐操作。

161-165 调用kmalloc_node_track_caller()分配数据缓存区,其长度为size和sizeof(struct skb_shared_info)之和,因为在缓存区尾部紧跟着一个skb_shared_info结构。

168-181 初始化新分配SKB描述符和skb_shared_info结构。

183-191 如果是skbuff_fclone_cache高速缓存中分配SKB描述符,则还需置父SKB描述符的fclone为SKB_FCLONE_ORIG,表示可以被克隆;同时将子SKB描述符的fclone成员置为SKB_FCLONE_UNAVAILABLE,表示该SKB还没有被创建出来;最后将引用计数置为1。

最后SKB结构如图3-13所示,在图右边所示的内存块中部,可以看到对齐操作所带来的填充区域。需要说明的是,__alloc_skb()一般不被直接调用,而是被封装函数调用,如__netdev_alloc_skb()、alloc_skb()、alloc_skb_fclone()等函数。

2. dev_alloc_skb()

dev_alloc_skb()也是一个缓存区分配函数,通常被设备驱动用在中断上下文中。这是一个alloc_skb()的封装函数,因为是在中断处理函数中被调用的,因此要求原子操作(GFP_ATOMIC)。

1
2
3
4
5
6
7
8
9
10
11
12
13
1124 static inline struct sk_buff *dev_alloc_skb(unsigned int length)  
1125 {  
1126     return __dev_alloc_skb(length, GFP_ATOMIC);  
1127 }  
... ...  
1103 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,  
1104                           gfp_t gfp_mask)  
1105 {  
1106     struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);  
1107     if (likely(skb))  
1108         skb_reserve(skb, NET_SKB_PAD);  
1109     return skb;  
1110 }

1108 调用skb_reserve()在skb->head与skb->data之间预留NET_SKB_PAD个字节。NET_SKB_PAD的定义在skbuff.h中,其值为 16。这部分空间将被填入硬件帧头,如14B的以太网帧头。

1126 以GFP_ATOMIC为内存分配优先级,表示分配过程为原子操作,不能被中断。


三、释放SKB

dev_kfree_skb()和kfree_skb()用来释放SKB,把它返回给高速缓存。kfree_skb()可以直接调用,也可以通过封装函数dev_kfree_skb()来调用。而dev_kfree_skb()只是一个简单调用kfree_skb()的宏,一般为设备驱动使用,与之功能相反的函数是dev_alloc_skb()。这些函数只在skb->users为1的情况下才释放内存,否则只简单地递减skb->users,因此假设SKB有三个引用者,那么只有第三次调用dev_kfree_skb()或kfree_skb()时才释放内存。kfree_skb()的流程如图3-14所示。

图3-14所示的流程显示了释放一个SKB的步骤:

1)kfree_skb()检测sk_buff结构的引用计数users,如果不为1,则说明此次释放后该SKB还将被用户占用,因此递减引用计数users后即返回;否则说明不再有其他用户占用该sk_buff结构,调用__kfree_skb()释放之。

2)SKB描述符中包含一个dst_entry结构的引用,在释放SKB后,会调用dst_release()来递减dst_entry结构的引用计数。

3)如果初始化了SKB的析构函数,则调用相应的函数。

4)一个SKB描述符是与一个存有真正数据的内存块,即数据区相关的。如果存在聚合分散I/O数据,该数据区底部的skb_shared_info结构还会包含指向聚合分散I/O数据的指针,同样需要释放这些分片所占用的内存。最后需把SKB描述符所占内存返回给skbuff_head_cache缓存。释放内存由kfree_skbmem()处理,过程如下:

如果SKB没有被克隆,或者payload没有被单独引用,则释放SKB的数据缓存区,包括存储聚合分散I/O数据的缓存区和SKB描述符。

如果是释放从skbuff_fclone_cache中分配的父SKB描述符,且克隆计数为1,则释放父SKB描述符。

如果是释放从skbuff_fclone_cache中分配的子SKB描述符,设置父SKLB的fclone字段为SKB_FCLONE_UNAVAILABLE,在克隆计数为1的情况下,释放子SKB描述符。


四、数据预留和对齐

数据预留和对齐主要由skb_reserve()、skb_put()、skb_push()以及skb_pull()这几个函数来完成。

1. skb_reserve()

skb_reserve()在数据缓存区头部预留一定的空间,通常被用来在数据缓存区中插入协议首部或者在某个边界上对齐。它并没有把数据移出或移入数据缓存区,而只是简单地更新了数据缓存区的两个指针-分别指向负载起始和结尾的data和tail指针,图3-15 展示了调用skb_reserve()前后这两个指针的变化。

请注意:skb_reserve()只能用于空的SKB,通常会在分配SKB之后就调用该函数,此时data和tail指针还一同指向数据区的起始位置,如图3-15a所示。例如,某个以太网设备驱动的接收函数,在分配SKB之后,向数据缓存区填充数据之前,会有这样的一条语句skb_reserve(skb, 2),这是因为以太网头长度为14B,再加上2B就正好16字节边界对齐,所以大多数以太网设备都会在数据包之前保留2B。

当SKB在协议栈中向下传递时,每一层协议都把skb->data指针向上移动,然后复制本层首部,同时更新skb->len。这些操作都使用图3-15 中所示的函数完成。

2.skb_push()

skb_push()在数据缓存区的前头加入一块数据,与skb_reserve()类似,也并没有真正向数据缓存区中添加数据,而只是移动数据缓存区的头指针data和尾指针tail。数据由其他函数复制到数据缓存区中。

函数执行步骤如下:

1)当TCP发送数据时,会根据一些条件,如TCP最大分段长度MSS、是否支持聚合分散I/O等,分配一个SKB。

2)TCP需在数据缓存区的头部预留足够的空间,用来填充各层首部。MAX_TCP_HEADER是各层首部长度的总和,它考虑了最坏的情况:由于TCP层不知道将要用哪个接口发送包,它为每一层预留了最大的首部长度,甚至还考虑了出现多个IP首部的可能性,因为在内核编译支持IP over IP的情况下,会遇到多个IP首部。

3)把TCP负载复制到数据缓存区。需要注意的是,图3-16 只是一个例子,TCP负载可能会被组织成其他形式,例如分片,在后续章节中将会看到一个分片的数据缓存区是什么样的。

4)TCP层添加TCP首部。

5)SKB传递到IP层,IP层为数据包添加IP首部。

6)SKB传递到链路层,链路层为数据包添加链路层首部。

3.skb_put()

skb_put()修改指向数据区末尾的指针tail,使之往下移len字节,即使数据区向下扩大len字节,并更新数据区长度len。调用skb_put()前后,SKB结构变化如图3-17所示。

4.skb_pull()

skb_pull()通过将data指针往下移动,在数据区首部忽略len字节长度的数据,通常用于接收到数据包后在各层间由下往上传递时,上层忽略下层的首部。调用skb_pull()前后,SKB结构变化如图3-18所示。


五、克隆和复制SKB

1.skb_clone()

如果一个SKB会被不同的用户独立操作,而这些用户可能只是修改SKB描述符中的某些字段值,如h、nh,则内核没有必要为每个用户复制一份完整的SKB描述及其相应的数据缓存区,而会为了提高性能,只作克隆操作。克隆过程只复制SKB描述符,同时增加数据缓存区的引用计数,以免共享数据被提前释放。完成这些功能的是skb_clone()。一个使用包克隆的场景是,一个接收包程序要把该包传递给多个接收者,例如包处理函数或者一个或多个网络模块。原始的及克隆的SKB描述符的cloned值都会被设置为1,克隆SKB描述符的users值置为1,这样在第一次释放时就会释放掉。同时将数据缓存区引用计数dataref递增1,因为又多了一个克隆SKB描述符指向它。
图3-19 演示的是已克隆的SKB。

图3-19 所示是一个存在聚合分散I/O缓存区的例子,这个数据缓存区的一些数据保存在分片结构数组frags中。skb_share_check()用来检查SKB引用计数users,如果该字段表明SKB是被共享的,则克隆一个新的SKB。一个SKB被克隆后,该SKB数据缓存区中的内容就不能再被修改,这也意味着访问数据的函数没有必要加锁。skb_cloned()可以用来测试skb的克隆状态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
432 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)  
433 {  
434     struct sk_buff *n;  
435  
436     n = skb + 1;  
437     if (skb->fclone == SKB_FCLONE_ORIG &&  
438         n->fclone == SKB_FCLONE_UNAVAILABLE) {  
439         atomic_t *fclone_ref = (atomic_t *) (n + 1);  
440         n->fclone = SKB_FCLONE_CLONE;  
441         atomic_inc(fclone_ref);  
442     } else {  
443         n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);  
444         if (!n)  
445             return NULL;  
446         n->fclone = SKB_FCLONE_UNAVAILABLE;  
447     }  
448  
449 #define C(x) n->x = skb->x  
450  
451     n->nnext = n->prev = NULL;  
452     n->sk = NULL;  
453     C(tstamp);  
454     C(dev);  
455     C(h);  
456     C(nh);  
457     C(mac);  
458     C(dst);  
459     dst_clone(skb->dst);  
460     C(sp);  
461 #ifdef CONFIG_INET  
462     secpath_get(skb->sp);  
463 #endif  
464     memcpy(n->cb, skb->cb, sizeof(skb->cb));  
465     C(len);  
466     C(data_len);  
467     C(csum);  
468     C(local_df);  
469     n->cloned = 1;  
470     n->nohdr = 0;  
471     C(pkt_type);  
472     C(ip_summed);  
473     C(priority);  
474 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)  
475     C(ipvs_property);  
476 #endif  
477     C(protocol);  
478     n->destructor = NULL;  
479     C(mark);  
480 #ifdef CONFIG_NETFILTER  
481     C(nfct);  
482     nf_conntrack_get(skb->nfct);  
483     C(nfctinfo);  
484 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)  
485     C(nfct_reasm);  
486     nf_conntrack_get_reasm(skb->nfct_reasm);  
487 #endif  
488 #ifdef CONFIG_BRIDGE_NETFILTER  
489     C(nf_bridge);  
490     nf_bridge_get(skb->nf_bridge);  
491 #endif  
492 #endif /*CONFIG_NETFILTER*/  
493 #ifdef CONFIG_NET_SCHED  
494     C(tc_index);  
495 #ifdef CONFIG_NET_CLS_ACT  
496     n->tc_verd = SET_TC_VERD(skb->tc_verd,0);  
497     n->tc_verd = CLR_TC_OK2MUNGE(n->tc_verd);  
498     n->tc_verd = CLR_TC_MUNGED(n->tc_verd);  
499     C(input_dev);  
500 #endif  
501     skb_copy_secmark(n, skb);  
502 #endif  
503     C(truesize);  
504     atomic_set(&n->users, 1);  
505     C(head);  
506     C(data);  
507     C(tail);  
508     C(end);  
509  
510     atomic_inc(&(skb_shinfo(skb)->dataref));  
511     skb->cloned = 1;  
512  
513     return n;  
514 }

436-438 由fclone标志来决定从哪个缓冲池中分配SKB描述符。如果紧邻的两个父子SKB描述符,前一个的fclone为SKB_FCLONE_ORIG,后一个的fclone为SKB_FCLONE_ UNAVAILABLE,则说明这两个SKB描述符是从skbuff_fclone_cache缓冲池中分配的,且父SKB描述符还没有被克隆,即子SKB描述符还是空的。否则即从skbuff_head_cache缓冲池中分配一个新的SKB来用于克隆。

451-508 将父SKB描述符各字段值赋给子SKB描述符的对应字段。

504 设置子SKB描述符引用计数users为1。

510 递增父SKB描述符中的数据区引用计数skb_shared_info结构的dataref。

511 设置父SKB描述符的成员cloned为1,表示该SKB已被克隆。

2.pskb_copy()

当一个函数不仅要修改SKB描述符,而且还要修改数据缓存区中的数据时,就需要同时复制数据缓存区。在这种情况下,程序员有两个选择。如果所修改的数据在skb->head和skb->end之间,可使用pskb_copy()来复制这部分数据,如图3-20所示。

3.skb_copy()

如果同时需要修改聚合分散I/O存储区中的数据,就必须使用skb_copy(),如图3-21所示。从前面的章节中看到,skb_shared_info结构中也包含一个SKB链表frag_list。该链表在pskb_copy()和skb_copy()中的处理方式与frags数组处理方式相同。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
587 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)  
588 {  
589     int headerlen = skb->data - skb->head;  
590     /*  
591      *    Allocate the copy buffer  
592      */  
593     struct sk_buff *n = alloc_skb(skb->end - skb->head + skb->data_len,  
594                       gfp_mask);  
595     if (!n)  
596         return NULL;  
597  
598     /* Set the data pointer */  
599     skb_reserve(n, headerlen);  
600     /* Set the tail pointer and length */  
601     skb_put(n, skb->len);  
602     n->csum         = skb->csum;  
603     n->ip_summed = skb->ip_summed;  
604  
605     if (skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len))  
606         BUG();  
607  
608     copy_skb_header(n, skb);  
609     return n;  
610 }

589-599 分配一个新的SKB,即包括SKB描述符和数据缓存区,然后在指针head和data之间预留源数据缓存区headroom长度的空间。

601 将新SKB的tail指针和数据区长度len设置为与源SKB的一样。

605-608 复制数据。

在讨论本书中不同主题时,有时会强调某个特定函数需要克隆或者复制一个SKB。在决定克隆或复制SKB时,各子系统程序员不能预测其他内核组件是否需要使用SKB中的原始数据。内核是模块化的,其状态变化是不可预测的,每个子系统都不知道其他子系统是如何操作数据缓存区的。因此,内核程序员需要记录各子系统对数据缓存区的修改,并且在修改数据缓存区前,复制一个新的数据缓存区,以免其他子系统需使用数据缓存区原始数据时出现错误。

内核网络设备的注册与初始化(eth0...)

找到eth0…之类的设备的数据结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# crash vmlinux

p init_net
找到:
  dev_base_head = {
	next = 0xffff88003e48b070, 
	prev = 0xffff880037582070
  },
next 就是 struct net_device *dev; 中 dev->dev_list;
算算 dev_list 在 dev 中的偏移为0x50(可能会不同)

struct net_device 0xffff88003e48b020
然后根据 dev中的dev_list.next取下一个net_device
  dev_list = {
	next = 0xffff880037582070, 
	prev = 0xffffffff81b185b0
  },

找到net_device对应的XXX_adapter, 如ixgbe_adapter

ixgbe模块在申请net_device时会把需要预留给ixgbe_adapter的空间大小传给alloc_etherdev

1
2
3
4
5
6
7
8
netdev = alloc_etherdev(sizeof(struct ixgbe_adapter));

adapter = netdev_priv(netdev);

static inline void *netdev_priv(const struct net_device *dev)
{
	return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
}

所以ixgbe_adapter在net_device结构按32位对其后面,偏移0x6c0(视内核而定)


http://blog.csdn.net/sfrysh/article/details/5736752

首先来看如何分配内存给一个网络设备。

内核通过alloc_netdev来分配内存给一个指定的网络设备:

1
2
3
4
5
#define alloc_netdev(sizeof_priv, name, setup) /   
	alloc_netdev_mq(sizeof_priv, name, setup, 1)   
  
struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,   
		void (*setup)(struct net_device *), unsigned int queue_count)  

其中alloc_netdev_mq中的第一个元素是每个网络设备的私有数据(主要是包含一些硬件参数,比如中断之类的)的大小,也就是net_device结构中的priv的大小。第二个参数是设备名,我们传递进来一般都是一个待format的字符串,比如"eth%d",到时多个相同类型网卡设备就会依次为eth0,1(内核会通过dev_alloc_name来进行设置)… 第三个参数setup是一个初始化net_device结构的回调函数。

可是一般我们不需要直接调用alloc_netdev的,内核提供了一些包装好的函数:

这里我们只看alloc_etherdev:

1
2
3
4
5
#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)   
struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count)   
{   
	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);   
}  

这里实际是根据网卡的类型进行包装,也就类似于oo中的基类,ether_setup初始化一些所有相同类型的网络设备的一些相同配置的域:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
void ether_setup(struct net_device *dev)   
{   
	dev->header_ops      = ð_header_ops;   
  
	dev->change_mtu      = eth_change_mtu;   
	dev->set_mac_address     = eth_mac_addr;   
	dev->validate_addr   = eth_validate_addr;   
  
	dev->type        = ARPHRD_ETHER;   
	dev->hard_header_len     = ETH_HLEN;   
	dev->mtu     = ETH_DATA_LEN;   
	dev->addr_len        = ETH_ALEN;   
	dev->tx_queue_len    = 1000; /* Ethernet wants good queues */  
	dev->flags       = IFF_BROADCAST|IFF_MULTICAST;   
  
	memset(dev->broadcast, 0xFF, ETH_ALEN);   
  
}  

接下来我们来看注册网络设备的一些细节。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int register_netdev(struct net_device *dev)   
{   
	int err;   
  
	rtnl_lock();   
  
	/*  
	 * If the name is a format string the caller wants us to do a  
	 * name allocation.  
	 */  
	if (strchr(dev->name, '%')) {   
		// 这里通过dev_alloc_name函数来对设备名进行设置。   
		err = dev_alloc_name(dev, dev->name);   
		if (err < 0)   
			goto out;   
	}   
	// 注册当前的网络设备到全局的网络设备链表中.下面会详细看这个函数.   
	err = register_netdevice(dev);   
out:   
	rtnl_unlock();   
	return err;   
}  

整个网络设备就是一个链表,他需要很方便的遍历所有设备,以及很快的定位某个指定的设备。为此net_device包含了下面3个链表(有关内核中数据结构的介绍,可以去自己google下):

1
2
3
4
5
6
// 可以根据index来定位设备   
struct hlist_node   index_hlist;   
// 可以根据name来定位设备   
struct hlist_node   name_hlist;   
// 通过dev_list,将此设备插入到全局的dev_base_head中,我们下面会介绍这个。   
struct list_head    dev_list;  

当设备注册成功后,还需要通知内核的其他组件,这里通过netdev_chain类型的notifier chain来通知其他组件。事件是NETDEV_REGISTER..其他设备通过register_netdevice_notifier来注册自己感兴趣的事件到此notifier chain上。

网络设备(比如打开或关闭一个设备),与用户空间的通信通过rtmsg_ifinfo函数,也就是RTMGRP_LINK的netlink。

每个设备还包含两个状态,一个是state字段,表示排队策略状态(用位图表示),一个是注册状态。

包的排队策略也就是qos了。。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
int register_netdevice(struct net_device *dev)   
{   
	struct hlist_head *head;   
	struct hlist_node *p;   
	int ret;   
	struct net *net;   
  
	BUG_ON(dev_boot_phase);   
	ASSERT_RTNL();   
  
	might_sleep();   
  
	/* When net_device's are persistent, this will be fatal. */  
	BUG_ON(dev->reg_state != NETREG_UNINITIALIZED);   
	BUG_ON(!dev_net(dev));   
	net = dev_net(dev);   
  
	// 初始化相关的锁   
	spin_lock_init(&dev->addr_list_lock);   
	netdev_set_addr_lockdep_class(dev);   
	netdev_init_queue_locks(dev);   
  
	dev->iflink = -1;   
  
	/* Init, if this function is available */  
	if (dev->init) {   
		ret = dev->init(dev);   
		if (ret) {   
			if (ret > 0)   
				ret = -EIO;   
			goto out;   
		}   
	}   
  
	if (!dev_valid_name(dev->name)) {   
		ret = -EINVAL;   
		goto err_uninit;   
	}   
	// 给设备分配一个唯一的identifier.   
	dev->ifindex = dev_new_index(net);   
	if (dev->iflink == -1)   
		dev->iflink = dev->ifindex;   
  
	// 在全局的链表中检测是否有重复的名字   
	head = dev_name_hash(net, dev->name);   
	hlist_for_each(p, head) {   
		struct net_device *d   
			= hlist_entry(p, struct net_device, name_hlist);   
		if (!strncmp(d->name, dev->name, IFNAMSIZ)) {   
			ret = -EEXIST;   
			goto err_uninit;   
		}   
	}   
	// 下面是检测一些特性的组合是否合法。   
	/* Fix illegal checksum combinations */  
	if ((dev->features & NETIF_F_HW_CSUM) &&   
		(dev->features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {   
		printk(KERN_NOTICE "%s: mixed HW and IP checksum settings./n",   
			   dev->name);   
		dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);   
	}   
  
	if ((dev->features & NETIF_F_NO_CSUM) &&   
		(dev->features & (NETIF_F_HW_CSUM|NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {   
		printk(KERN_NOTICE "%s: mixed no checksumming and other settings./n",   
			   dev->name);   
		dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM|NETIF_F_HW_CSUM);   
	}   
  
  
	/* Fix illegal SG+CSUM combinations. */  
	if ((dev->features & NETIF_F_SG) &&   
		!(dev->features & NETIF_F_ALL_CSUM)) {   
		printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature./n",   
			   dev->name);   
		dev->features &= ~NETIF_F_SG;   
	}   
  
	/* TSO requires that SG is present as well. */  
	if ((dev->features & NETIF_F_TSO) &&   
		!(dev->features & NETIF_F_SG)) {   
		printk(KERN_NOTICE "%s: Dropping NETIF_F_TSO since no SG feature./n",   
			   dev->name);   
		dev->features &= ~NETIF_F_TSO;   
	}   
	if (dev->features & NETIF_F_UFO) {   
		if (!(dev->features & NETIF_F_HW_CSUM)) {   
			printk(KERN_ERR "%s: Dropping NETIF_F_UFO since no "  
					"NETIF_F_HW_CSUM feature./n",   
							dev->name);   
			dev->features &= ~NETIF_F_UFO;   
		}   
		if (!(dev->features & NETIF_F_SG)) {   
			printk(KERN_ERR "%s: Dropping NETIF_F_UFO since no "  
					"NETIF_F_SG feature./n",   
					dev->name);   
			dev->features &= ~NETIF_F_UFO;   
		}   
	}   
  
	/* Enable software GSO if SG is supported. */  
	if (dev->features & NETIF_F_SG)   
		dev->features |= NETIF_F_GSO;   
  
	// 初始化设备驱动的kobject并创建相关的sysfs   
	netdev_initialize_kobject(dev);   
	ret = netdev_register_kobject(dev);   
	if (ret)   
		goto err_uninit;   
	// 设置注册状态。   
	dev->reg_state = NETREG_REGISTERED;   
  
	/*  
	 *  Default initial state at registry is that the  
	 *  device is present.  
	 */  
  
	// 设置排队策略状态。   
	set_bit(__LINK_STATE_PRESENT, &dev->state);   
	// 初始化排队规则   
	dev_init_scheduler(dev);   
	dev_hold(dev);   
	// 将相应的链表插入到全局的链表中。紧接着会介绍这个函数   
	list_netdevice(dev);   
  
	/* Notify protocols, that a new device appeared. */  
	// 调用netdev_chain通知内核其他子系统。   
	ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);   
	ret = notifier_to_errno(ret);   
	if (ret) {   
		rollback_registered(dev);   
		dev->reg_state = NETREG_UNREGISTERED;   
	}   
  
out:   
	return ret;   
  
err_uninit:   
	if (dev->uninit)   
		dev->uninit(dev);   
	goto out;   
}  

这里要注意有一个全局的struct net init_net;变量,这个变量保存了全局的name,index hlist以及全局的网络设备链表。

net结构我们这里所需要的也就三个链表:

1
2
3
4
5
6
// 设备链表   
struct list_head    dev_base_head;   
// 名字为索引的hlist   
struct hlist_head   *dev_name_head;   
// index为索引的hlist   
struct hlist_head   *dev_index_head;  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static int list_netdevice(struct net_device *dev)   
{   
	struct net *net = dev_net(dev);   
  
	ASSERT_RTNL();   
  
	write_lock_bh(&dev_base_lock);   
	// 插入全局的list   
	list_add_tail(&dev->dev_list, &net->dev_base_head);   
	// 插入全局的name_list以及index_hlist   
	hlist_add_head(&dev->name_hlist, dev_name_hash(net, dev->name));   
	hlist_add_head(&dev->index_hlist, dev_index_hash(net, dev->ifindex));   
	write_unlock_bh(&dev_base_lock);   
	return 0;   
}  

最终执行完之后,注册函数将会执行rtnl_unlock函数,而此函数则会执行netdev_run_todo方法。也就是完成最终的注册。(要注意,当取消注册这个设备时也会调用这个函数来完成最终的取消注册)

这里有一个全局的net_todo_list的链表:

1
static LIST_HEAD(net_todo_list);  

而在取消注册的函数中会调用这个函数:

1
2
3
4
static void net_set_todo(struct net_device *dev)   
{   
	list_add_tail(&dev->todo_list, &net_todo_list);   
}  

也就是把当前将要取消注册的函数加入到todo_list链表中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
void netdev_run_todo(void)   
{   
	struct list_head list;   
  
	/* Snapshot list, allow later requests */  
	// replace掉net_todo_list用list代替。   
	list_replace_init(&net_todo_list, &list);   
  
	__rtnl_unlock();   
	// 当注册设备时没有调用net_set_todo函数来设置net_todo_list,因此list为空,所以就会直接跳过。   
	while (!list_empty(&list)) {   
		// 通过todo_list得到当前的device对象。   
		struct net_device *dev   
			= list_entry(list.next, struct net_device, todo_list);   
		// 删除此todo_list;   
		list_del(&dev->todo_list);   
  
  
		if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {   
			printk(KERN_ERR "network todo '%s' but state %d/n",   
				   dev->name, dev->reg_state);   
			dump_stack();   
			continue;   
		}   
		// 设置注册状态为NETREG_UNREGISTERED.   
		dev->reg_state = NETREG_UNREGISTERED;   
		// 在每个cpu上调用刷新函数。   
		on_each_cpu(flush_backlog, dev, 1);   
  
		// 等待引用此设备的所有系统释放资源,也就是引用计数清0.   
		netdev_wait_allrefs(dev);   
  
		/* paranoia */  
		BUG_ON(atomic_read(&dev->refcnt));   
		WARN_ON(dev->ip_ptr);   
		WARN_ON(dev->ip6_ptr);   
		WARN_ON(dev->dn_ptr);   
  
		if (dev->destructor)   
			dev->destructor(dev);   
  
		/* Free network device */  
		kobject_put(&dev->dev.kobj);   
	}   
}  

下面来看netdev_wait_allrefs函数,我们先看它的调用流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
static void netdev_wait_allrefs(struct net_device *dev)   
{   
	unsigned long rebroadcast_time, warning_time;   
  
	rebroadcast_time = warning_time = jiffies;   
	while (atomic_read(&dev->refcnt) != 0) {   
		if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {   
			rtnl_lock();   
  
			// 给netdev_chain发送NETDEV_UNREGISTER事件,通知各个子模块释放资源   
			/* Rebroadcast unregister notification */  
			call_netdevice_notifiers(NETDEV_UNREGISTER, dev);   
  
			if (test_bit(__LINK_STATE_LINKWATCH_PENDING,   
					 &dev->state)) {   
				/* We must not have linkwatch events  
				 * pending on unregister. If this  
				 * happens, we simply run the queue  
				 * unscheduled, resulting in a noop  
				 * for this device.  
				 */  
				linkwatch_run_queue();   
			}   
  
			__rtnl_unlock();   
  
			rebroadcast_time = jiffies;   
		}   
  
		msleep(250);   
  
		if (time_after(jiffies, warning_time + 10 * HZ)) {   
			printk(KERN_EMERG "unregister_netdevice: "  
				   "waiting for %s to become free. Usage "  
				   "count = %d/n",   
				   dev->name, atomic_read(&dev->refcnt));   
			warning_time = jiffies;   
		}   
	}   
}  

TCP的TSO/GSO处理(二)

http://book.51cto.com/art/201206/345021.htm

有些网络设备硬件可以完成一些传统上由CPU完成的任务,最常见的例子就是计算三层和四层校验和。有些网络设备甚至可以维护四层协议的状态机,由硬件完成分段或分片,因此传输层通过网络层提交给网络设备时可能是个GSO段,参见1.3.1节。本节论述SKB的成员都是用来支持GSO的。

1
unsigned short gso_size 

生成GSO段时的MSS,因为GSO段的长度是与发送该段的套接口中合适MSS的整数倍。

1
unsigned short gso_segs 

GSO段的长度是gso_size的倍数,即用gso_size来分割大段时产生的段数。

1
unsigned short gso_type 

该SKB中的数据支持的GSO类型,见表3-5。

表3-5 gso_type的取值

gso_type 描述
SKB_GSO_TCPV4 IPv4的TCP段卸载
SKB_GSO_UDP IPv4的UDP分片卸载
SKB_GSO_DODGY 表明数据报是从一个不可信赖的来源发出的
SKB_GSO_TCP_ECN IPv4的TCP段卸载,当设置TCP首部的CWR时,使用此gos_type。CWR参见29.4节
SKB_GSO_TCPV6 IPv6的TCP段卸载


http://blog.csdn.net/majieyue/article/details/11881325

GSO用来扩展之前的TSO,目前已经并入upstream内核。TSO只能支持tcp协议,而GSO可以支持tcpv4, tcpv6, udp等协议。在GSO之前,skb_shinfo(skb)有两个成员ufo_size, tso_size,分别表示udp fragmentation offloading支持的分片长度,以及tcp segmentation offloading支持的分段长度,现在都用skb_shinfo(skb)->gso_size代替。skb_shinfo(skb)->ufo_segs, skb_shinfo(skb)->tso_segs也被替换成了skb_shinfo(skb)->gso_segs,表示分片的个数。

skb_shinfo(skb)->gso_type包括SKB_GSO_TCPv4, SKB_GSO_UDPv4,同时NETIF_F_XXX的标志也增加了相应的bit,标识设备是否支持TSO, GSO, e.g.

1
2
3
NETIF_F_TSO = SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT
NETIF_F_UFO = SKB_GSO_UDPV4 << NETIF_F_GSO_SHIFT
#define NETIF_F_GSO_SHIFT 16

dev_hard_start_xmit在调用设备驱动的发送函数之前,会通过netif_needs_gso判断是否需要软件做GSO,如果需要,那么会调用到dev_gso_segment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/**
 *  dev_gso_segment - Perform emulated hardware segmentation on skb.
 *  @skb: buffer to segment
 *
 *  This function segments the given skb and stores the list of segments
 *  in skb->next.
 */
static int dev_gso_segment(struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct sk_buff *segs;
	int features = dev->features & ~(illegal_highdma(dev, skb) ?
	                 NETIF_F_SG : 0);

	segs = skb_gso_segment(skb, features);

	/* Verifying header integrity only. */
	if (!segs)
	    return 0;

	if (IS_ERR(segs))
	    return PTR_ERR(segs);

	skb->next = segs;
	DEV_GSO_CB(skb)->destructor = skb->destructor;
	skb->destructor = dev_gso_skb_destructor;

	return 0;
}

分析skb_gso_segment之前,看下析构过程,此时skb经过分片之后已经是一个skb list,通过skb->next串在一起,此时把初始的skb->destructor函数存到skb->cb中,然后把skb->destructor变更为dev_gso_skb_destructor。

dev_gso_skb_destructor会把skb->next一个个通过kfree_skb释放掉,最后调用DEV_GSO_CB(skb)->destructor,即skb初始的析构函数做最后的清理。

skb_gso_segment是通过软件方式模拟网卡分段的函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features)
{
	struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
	struct packet_type *ptype;
	__be16 type = skb->protocol;
	int err;

	skb_reset_mac_header(skb);
	skb->mac_len = skb->network_header - skb->mac_header;
	__skb_pull(skb, skb->mac_len);

	if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
	    struct net_device *dev = skb->dev;
	    struct ethtool_drvinfo info = {};

	    if (dev && dev->ethtool_ops && dev->ethtool_ops->get_drvinfo)
	        dev->ethtool_ops->get_drvinfo(dev, &info);

	    WARN(1, "%s: caps=(0x%lx, 0x%lx) len=%d data_len=%d "
	        "ip_summed=%d",
	         info.driver, dev ? dev->features : 0L,
	         skb->sk ? skb->sk->sk_route_caps : 0L,
	         skb->len, skb->data_len, skb->ip_summed);

	    if (skb_header_cloned(skb) &&
	        (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
	        return ERR_PTR(err);

如果skb header是clone,分离出来
	}

如果skb->ip_summed 不是 CHECKSUM_PARTIAL,那么报个warning,因为GSO类型的skb其ip_summed一般都是CHECKSUM_PARTIAL

	rcu_read_lock();
	list_for_each_entry_rcu(ptype,
	        &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
	    if (ptype->type == type && !ptype->dev && ptype->gso_segment) {
	        if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
	            err = ptype->gso_send_check(skb);
	            segs = ERR_PTR(err);
	            if (err || skb_gso_ok(skb, features))
	                break;
	            __skb_push(skb, (skb->data -
	                     skb_network_header(skb)));
	        }
	        segs = ptype->gso_segment(skb, features);
	        break;

把skb->data指向network header,然后调用inet_gso_segment,四层的gso_segment会在inet_gso_segment中被调用
	    }
	}
	rcu_read_unlock();

	__skb_push(skb, skb->data - skb_mac_header(skb));

把skb->data再次指向mac header

	return segs;
}


static struct sk_buff *inet_gso_segment(struct sk_buff *skb, int features)
{
	struct sk_buff *segs = ERR_PTR(-EINVAL);
	struct iphdr *iph;
	const struct net_protocol *ops;
	int proto;
	int ihl;
	int id;
	unsigned int offset = 0;

	if (!(features & NETIF_F_V4_CSUM))
	    features &= ~NETIF_F_SG;
如果设备不支持NETIF_F_V4_CSUM,那么就当设备不支持SG

	if (unlikely(skb_shinfo(skb)->gso_type &
	         ~(SKB_GSO_TCPV4 |
	           SKB_GSO_UDP |
	           SKB_GSO_DODGY |
	           SKB_GSO_TCP_ECN |
	           0)))
	    goto out;
gso_type不合法,直接返错

	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
	    goto out;
20字节ip头部无法获得,返错

	iph = ip_hdr(skb);
	ihl = iph->ihl * 4;
	if (ihl < sizeof(*iph))
	    goto out;

	if (unlikely(!pskb_may_pull(skb, ihl)))
	    goto out;
实际ip头部无法获得,返错

	__skb_pull(skb, ihl);
	skb_reset_transport_header(skb);
	iph = ip_hdr(skb);

OK,现在拿到ip头部了


	id = ntohs(iph->id);

ip包的id


	proto = iph->protocol & (MAX_INET_PROTOS - 1);
	segs = ERR_PTR(-EPROTONOSUPPORT);

	rcu_read_lock();
	ops = rcu_dereference(inet_protos[proto]);
	if (likely(ops && ops->gso_segment))
	    segs = ops->gso_segment(skb, features);

如果是tcp,那么调用tcp_tso_segment,如果是udp,那么调用udp4_ufo_fragment


	rcu_read_unlock();

	if (!segs || IS_ERR(segs))
	    goto out;

	skb = segs;
	do {
	    iph = ip_hdr(skb);
	    if (proto == IPPROTO_UDP) {
	        iph->id = htons(id);
	        iph->frag_off = htons(offset >> 3);
	        if (skb->next != NULL)
	            iph->frag_off |= htons(IP_MF);
	        offset += (skb->len - skb->mac_len - iph->ihl * 4);
	    } else
	        iph->id = htons(id++);
	    iph->tot_len = htons(skb->len - skb->mac_len);
	    iph->check = 0;
	    iph->check = ip_fast_csum(skb_network_header(skb), iph->ihl);
	} while ((skb = skb->next));

对每一个skb segment,填充ip包头,计算ip checksum。如果是tcp segmentation,那么ip头的id递增。如果是udp fragmentation,那么ip头的id不变,每次计算增加的offset,等于是在做ip分片

out:
	return segs;
}

下面来看TCP协议的分段函数tcp_tso_segment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features)
{
	struct sk_buff *segs = ERR_PTR(-EINVAL);
	struct tcphdr *th;
	unsigned thlen;
	unsigned int seq;
	__be32 delta;
	unsigned int oldlen;
	unsigned int mss;

	if (!pskb_may_pull(skb, sizeof(*th)))
	    goto out;

	th = tcp_hdr(skb);
	thlen = th->doff * 4;
	if (thlen < sizeof(*th))
	    goto out;

	if (!pskb_may_pull(skb, thlen))
	    goto out;

	oldlen = (u16)~skb->len;
	__skb_pull(skb, thlen);
把tcp header移到skb header里,把skb->len存到oldlen中,此时skb->len就只有tcp payload的长度

	mss = skb_shinfo(skb)->gso_size;
	if (unlikely(skb->len <= mss))
	    goto out;

	if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
	    /* Packet is from an untrusted source, reset gso_segs. */
	    int type = skb_shinfo(skb)->gso_type;

	    if (unlikely(type &
	             ~(SKB_GSO_TCPV4 |
	               SKB_GSO_DODGY |
	               SKB_GSO_TCP_ECN |
	               SKB_GSO_TCPV6 |
	               0) ||
	             !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))))
	        goto out;

	    skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
重新计算skb_shinfo(skb)->gso_segs的个数,基于skb->len和mss值

	    segs = NULL;
	    goto out;
	}


	segs = skb_segment(skb, features);
	if (IS_ERR(segs))
	    goto out;
skb_segment是真正的分段实现,后面再分析

	delta = htonl(oldlen + (thlen + mss));

	skb = segs;
	th = tcp_hdr(skb);
	seq = ntohl(th->seq);

	do {
	    th->fin = th->psh = 0;

	    th->check = ~csum_fold((__force __wsum)((__force u32)th->check +
	                   (__force u32)delta));
	    if (skb->ip_summed != CHECKSUM_PARTIAL)
	        th->check =
	             csum_fold(csum_partial(skb_transport_header(skb),
	                        thlen, skb->csum));
对每个分段都要计算tcp checksum

	    seq += mss;
	    skb = skb->next;
	    th = tcp_hdr(skb);

	    th->seq = htonl(seq);

对每个分段重新计算sequence值


	    th->cwr = 0;
	} while (skb->next);

	delta = htonl(oldlen + (skb->tail - skb->transport_header) +
	          skb->data_len);
	th->check = ~csum_fold((__force __wsum)((__force u32)th->check +
	            (__force u32)delta));
	if (skb->ip_summed != CHECKSUM_PARTIAL)
	    th->check = csum_fold(csum_partial(skb_transport_header(skb),
	                       thlen, skb->csum));

out:
	return segs;
}

UDP协议的分片函数是udp4_ufo_fragment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, int features)
{
	struct sk_buff *segs = ERR_PTR(-EINVAL);
	unsigned int mss;
	int offset;
	__wsum csum;

	mss = skb_shinfo(skb)->gso_size;
	if (unlikely(skb->len <= mss))
	    goto out;

	if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
	    /* Packet is from an untrusted source, reset gso_segs. */
	    int type = skb_shinfo(skb)->gso_type;

	    if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY) ||
	             !(type & (SKB_GSO_UDP))))
	        goto out;

	    skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);

	    segs = NULL;
	    goto out;
	}

	/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
	 * do checksum of UDP packets sent as multiple IP fragments.
	 */
	offset = skb->csum_start - skb_headroom(skb);
	csum = skb_checksum(skb, offset, skb->len - offset, 0);
	offset += skb->csum_offset;
	*(__sum16 *)(skb->data + offset) = csum_fold(csum);
	skb->ip_summed = CHECKSUM_NONE;

计算udp的checksum

	/* Fragment the skb. IP headers of the fragments are updated in
	 * inet_gso_segment()
	 */
	segs = skb_segment(skb, features);
out:
	return segs;
}

udp的分段其实和ip的分片没什么区别,只是多一个计算checksum的步骤

最后来分析下skb_segment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
struct sk_buff *skb_segment(struct sk_buff *skb, int features)
{
	struct sk_buff *segs = NULL;
	struct sk_buff *tail = NULL;
	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
	unsigned int mss = skb_shinfo(skb)->gso_size;
	unsigned int doffset = skb->data - skb_mac_header(skb);
	unsigned int offset = doffset;
	unsigned int headroom;
	unsigned int len;
	int sg = features & NETIF_F_SG;
	int nfrags = skb_shinfo(skb)->nr_frags;
	int err = -ENOMEM;
	int i = 0;
	int pos;

	__skb_push(skb, doffset);
	headroom = skb_headroom(skb);
	pos = skb_headlen(skb);

skb->data指向mac header,计算headroom,skb_headlen长度

	do {
	    struct sk_buff *nskb;
	    skb_frag_t *frag;
	    int hsize;
	    int size;

	    len = skb->len - offset;
	    if (len > mss)
	        len = mss;
len为skb->len减去直到offset的部分。开始时,offset只是mac header + ip header + tcp header的长度,len即tcp payload的长度。随着segment增加, offset每次都增加mss长度。因此len的定义是每个segment的payload长度(最后一个segment的payload可能小于一个mss长度)

	    hsize = skb_headlen(skb) - offset;

hsize为skb header减去offset后的大小,如果hsize小于0,那么说明payload在skb的frags, frag_list中。随着offset一直增长,必定会有hsize一直<0的情况开始出现,除非skb是一个完全linearize化的skb


	    if (hsize < 0)
	        hsize = 0;

这种情况说明skb_headlen没有tcp payload的部分,需要pull数据过来


	    if (hsize > len || !sg)
	        hsize = len;

如果不支持sg同时hsize大于len,那么hsize就为len,此时说明segment的payload还在skb header中


	    if (!hsize && i >= nfrags) {
	        BUG_ON(fskb->len != len);

	        pos += len;
	        nskb = skb_clone(fskb, GFP_ATOMIC);
	        fskb = fskb->next;

	        if (unlikely(!nskb))
	            goto err;

	        hsize = skb_end_pointer(nskb) - nskb->head;
	        if (skb_cow_head(nskb, doffset + headroom)) {
	            kfree_skb(nskb);
	            goto err;
	        }

	        nskb->truesize += skb_end_pointer(nskb) - nskb->head -
	                  hsize;
	        skb_release_head_state(nskb);
	        __skb_push(nskb, doffset);
	    } else {

	        nskb = alloc_skb(hsize + doffset + headroom,
	                 GFP_ATOMIC);

	        if (unlikely(!nskb))
	            goto err;

	        skb_reserve(nskb, headroom);
	        __skb_put(nskb, doffset);

alloc新的skb,skb->data到skb->head之间保留headroom,skb->tail到skb->data之间保留mac header + ip header + tcp header + hsize的长度


	    }

	    if (segs)
	        tail->next = nskb;
	    else
	        segs = nskb;
	    tail = nskb;


	    __copy_skb_header(nskb, skb);
	    nskb->mac_len = skb->mac_len;
把老skb的skb_buff内容拷贝到新skb中

	    /* nskb and skb might have different headroom */
	    if (nskb->ip_summed == CHECKSUM_PARTIAL)
	        nskb->csum_start += skb_headroom(nskb) - headroom;
修正下checksum计算的位置

	    skb_reset_mac_header(nskb);
	    skb_set_network_header(nskb, skb->mac_len);
	    nskb->transport_header = (nskb->network_header +
	                  skb_network_header_len(skb));
	    skb_copy_from_linear_data(skb, nskb->data, doffset);

把skb->data开始doffset长度的内容拷贝到nskb->data中,即把mac header , ip header, tcp header都复制过去


	    if (fskb != skb_shinfo(skb)->frag_list)
	        continue;

	    if (!sg) {
	        nskb->ip_summed = CHECKSUM_NONE;
	        nskb->csum = skb_copy_and_csum_bits(skb, offset,
	                            skb_put(nskb, len),
	                            len, 0);
	        continue;
	    }

	    frag = skb_shinfo(nskb)->frags;

	    skb_copy_from_linear_data_offset(skb, offset,
	                     skb_put(nskb, hsize), hsize);

如果hsize不为0,那么拷贝hsize的内容到nskb header中


	    while (pos < offset + len && i < nfrags) {

offset + len长度超过了pos,即超过了nskb header,这时需要用到frag


	        *frag = skb_shinfo(skb)->frags[i];
	        get_page(frag->page);
	        size = frag->size;

	        if (pos < offset) {
	            frag->page_offset += offset - pos;
	            frag->size -= offset - pos;
	        }


	        skb_shinfo(nskb)->nr_frags++;

	        if (pos + size <= offset + len) {
	            i++;
	            pos += size;
	        } else {
	            frag->size -= pos + size - (offset + len);
	            goto skip_fraglist;
	        }

	        frag++;
	    }

如果skb header空间不够,那么通过frag,把一个mss的内容拷贝到nskb的frag中


	    if (pos < offset + len) {
	        struct sk_buff *fskb2 = fskb;

	        BUG_ON(pos + fskb->len != offset + len);

	        pos += fskb->len;
	        fskb = fskb->next;

	        if (fskb2->next) {
	            fskb2 = skb_clone(fskb2, GFP_ATOMIC);
	            if (!fskb2)
	                goto err;
	        } else
	            skb_get(fskb2);

	        SKB_FRAG_ASSERT(nskb);
	        skb_shinfo(nskb)->frag_list = fskb2;
	    }

如果frag都用完还是无法满足mss的大小,那么就要用到frag_list,这段代码跳过去了,因为基本永远不会走到这里


skip_fraglist:
	    nskb->data_len = len - hsize;
	    nskb->len += nskb->data_len;
	    nskb->truesize += nskb->data_len;
	} while ((offset += len) < skb->len);

完成一个nskb之后,继续下一个seg,一直到offset >= skb->len

	return segs;

err:
	while ((skb = segs)) {
	    segs = skb->next;
	    kfree_skb(skb);
	}
	return ERR_PTR(err);
}

FRTO—虚假超时剖析

http://blog.csdn.net/zhangskd/article/details/7446441

F-RTO:Forward RTO-Recovery,for a TCP sender to recover after a retransmission timeout. F-RTO的主要目的:The main motivation of the algorithm is to recover efficiently from a spurious RTO.

F-RTO的基本思想

The guideline behind F-RTO is, that an RTO either indicates a loss, or it is caused by an excessive delay in packet delivery while there still are outstanding segments in flight. If the RTO was due to delay, i.e. the RTO was spurious, acknowledgements for non-retransmitted segments sent before the RTO should arrive at the sender after the RTO occurred. If no such segments arrive, the RTO is concluded to be non-spurious and the conventional RTO recovery with go-back-N retransmissions should take place at the TCP sender.

To implement the principle described above, an F-RTO sender acts as follows: if the first ACK arriving after a RTO-triggered retransmission advances the window, transmit two new segments instead of continuing retransmissions. If also the second incoming acknowledgement advances the window, RTO is likely to be spurious, because the second ACK is triggered by an originally transmitted segment that has not been retransmitted after the RTO. If either one of the two acknowledgements after RTO is a duplicate ACK, the sender continues retransmissions similarly to the conventional RTO recovery algorithm.

When the retransmission timer expires, the F-RTO algorithm takes the following steps at the TCP sender. In the algorithm description below we use SND.UNA to indicate the first unacknowledged segment.

1.When the retransmission timer expires, retransmit the segment that triggered the timeout. As required by the TCP congestion control specifications, the ssthresh is adjusted to half of the number of currently outstanding segments. However, the congestion window is not yet set to one segment, but the sender waits for the next two acknowledgements before deciding on what to do with the congestion window.

2.When the first acknowledgement after RTO arrives at the sender, the sender chooses the following actions depending on whether the ACK advances the window or whether it is a duplicate ACK.

(a)If the acknowledgement advances SND.UNA, transmit up to two new (previously unsent) segments. This is the main point in which the F-RTO algorithm differs from the conventional way of recovering from RTO. After transmitting the two new segments, the congestion window size is set to have the same value as ssthresh. In effect this reduces the transmission rate of the sender to half of the transmission rate before the RTO. At this point the TCP sender has transmitted a total of three segments after the RTO, similarly to the conventional recovery algorithm. If transmitting two new segments is not possible due to advertised window limitation, or because there is no more data to send, the sender may transmit only one segment. If now new data can be transmitted, the TCP sender follows the conventional RTO recovery algorithm and starts retransmitting the unacknowledged data using slow start.

(b)If the acknowledgement is duplicate ACK, set the congestion window to one segment and proceed with the conventional RTO recovery. Two new segments are not transmitted in this case, because the conventional RTO recovery algorithm would not transmit anything at this point either. Instead, the F-RTO sender continues with slow start and performs similarly to the conventional TCP sender in retransmitting the unacknowledged segments. Step 3 of the F-RTO algorithm is not entered in this case. A common reason for executing this branch is the loss of a segment, in which case the segments injected by the sender before the RTO may still trigger duplicate ACKs that arrive at the sender after the RTO.

3.When the second acknowledgement after the RTO arrives, either continue transmitting new data, or start retransmitting with the slow start algorithm, depending on whether new data was acknowledged.

(a)If the acknowledgement advances SND.UNA, continue transmitting new data following the congestion avoidance algorithm. Because the TCP sender has retransmitted only one segment after the RTO, this acknowledgement indicates that an originally transmitted segment has arrived at the receiver. This is regarded as a strong indication of a suprious RTO. However, since the TCP sender cannot surely know at this point whether the segment that triggered the RTO was actually lost, adjusting the congestion control parameters after the RTO is the conservative action. From this point on, the TCP sender continues as in the normal congestion avoidance.

If this algorithm branch is taken, the TCP sender ignores the send_high variable that indicates the highest sequence number transmitted so far. The send_high variable was proposed as a bugfix for avoiding unnecessary multiple fast retransmits when RTO expires during fast recovery with NewReon TCP. As the sender has not retransmitted other segments but the one that triggered RTO, the problem addressed by the bugfix cannot occur. Therefore, if there are duplicate ACKs arriving at the sender after the RTO, they are likely to indicate a packet loss, hence fast retransmit should bu used to allow efficient recovery. Alternatively, if there are not enough duplicate ACKs arriving at the sender after a packet loss, the retransmission timer expires another time and the sender enters step 1 of this algorithm to detect whether the new RTO is spurious.

(b)If the acknowledgement is duplicate ACK, set the congestion window to three segments, continue with the slow start algorithm retransmitting unacknowledged segments. The duplicate ACK indicates that at least one segment other than the segment that triggered RTO is lost in the last window of data. There is no sufficient evidence that any of the segments was delayed. Therefore the sender proceeds with retransmissions similarly to the conventional RTO recovery algorithm, with the send_high variable stored when the retransmission timer expired to avoid unnecessary fast retransmits.

引起RTO的主要因素:

(1)Sudden delays
The primary motivation of the F-RTO algorithm is to improve the TCP performance when sudden delays cause spurious retransmission timeouts.

(2)Packet losses
These timeouts occur mainly when retransmissions are lost, since lost original packets are usually recovered by fast retransmit.

(3)Bursty losses
Losses of several successive packets can result in a retransmission timeout.

造成虚假RTO的原因还有:

Wireless links may also suffer from link outages that cause persistent data loss for a period of time.
Oher potential reasons for sudden delays that have been reported to trigger spurious RTOs include a delay due to tedious actions required to complete a hand-off or re-routing of packets to the new serving access point after the hand-off, arrival of competing traffic on a shared link with low bandwidth, and a sudden bandwidth degradation due to reduced resources on a wireless channel.

造成真实RTO的原因:

A RTO-triggered retransmission is needed when a retransmission is lost, or when nearly a whole window of data is lost, thus making it impossible for the receiver to generate enough duplicate ACKs for triggering TCP fast retransmit.

虚假RTO的后果

If no segments were lost but the retransmission timer expires spuriously, the segments retransmitted in the slow-start are sent unnecessarily. Particularly, this phenomenon is very possible with the various wireless access network technologies that are prone to sudden delay spikes. The retransmission timer expires because of the delay, spuriously triggering the RTO recovery and unnecessarily retransmission of all unacknowledged segments. This happens because after the delay the ACKs for the original segments arrive at the sender one at the time but too late, because the TCP sender has already entered the RTO recovery. Therefore, each of the ACKs trigger the retransmission of segments for which the original ACKs will arrive after a while. This continues until the whole window of segments is eventually unnecessarily retransmitted. Furthermore, because a full window of retransmitted segments arrive unnecessarily at the receiver, it generates duplicate ACKs for these out-of-order segments. Later on, the duplicate ACKs unnecessarily trigger fast retransmit at the sender.

TCP uses the fast retransmit mechanism to trigger retransmissions after receiving three successive duplicate acknowledgements (ACKs). If for a certain time period TCP sender does not receive ACKs that acknowledge new data, the TCP retransmission timer expires as a backoff mechanism. When the retransmission time expires, the TCP sender retransmits the first unacknowledged segment assuming it was lost in the network. Because a retransmission timeout (RTO) can be an indication of severe congestion in the network, the TCP sender resets its congestion window to one segment and starts increasing it according to the slow start algorithm. However, if the RTO occurs spuriously and there still are segments outstanding in the network, a false slow start is harmful for the potentially congested network as it injects extra segments to the network at increasing rate.

虚假的RTO不仅会降低吞吐量,而且由于丢包后会使用慢启动算法,快速的向网络中注入数据包, 而此时网络中还有原来发送的数据包,这样可能会造成真正的网络拥塞!

How about Reliable link-layer protocol ? Since wireless networks are often subject to high packet loss rate due to corruption or hand-offs, reliable link-layer protocols are widely employed with wireless links. The link-layer receiver often aims to deliver the packets to the upper protocol layers in order, which implies that the later arriving packets are blocked until the head of the queue arrives successfully. Due to the strict link-layer ordering, the communication end point observe a pause in packet delivery that can cause a spurious TCP RTO instead of getting out-of-order packets that could result in a false fast retransmit instead. Either way, interaction between TCP retransmission mechanisms and link-layer recovery can cause poor performance.

DSACK不能解决此问题 If the unnecessary retransmissions occurred due to spurious RTO caused by a sudden delay, the acknowledgements with the DSACK information arrive at the sender only after the acknowledgements of the original segments. Therefore, the unnecessary retransmissions following the spurious RTO cannot be avoided by using DSACK. Instead, the suggested recovery algorithm using DSACK can only revert the congestion control parameters to the state preceding the spurious retransmissions.

F-RTO实现

F-RTO is implemented (mainly) in four functions:
(1)tcp_use_frto() is used to determine if TCP can use F-RTO.

(2)tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is called when tcp_use_frto() showed green light.

(3)tcp_process_frto() handles incoming ACKs during F-RTO algorithm.

(4)tcp_enter_frto_loss() is called if there is not enough evidence to prove that the RTO is indeed spurious. It transfers the control from F-RTO to the conventional RTO recovery.

判断是否可以使用F-RTO

调用时机:当TCP段传送超时后,会引起段的重传,在重传定时器的处理过程中会判断是否可以使用F-RTO算法。

1
2
3
4
5
6
7
8
9
10
11
12
void tcp_retransmit_timer (struct sock *sk)  
{  
	....  
  
	if (tcp_use_frto(sk)) {  
		tcp_enter_frto(sk);  
	} else {  
		tcp_enter_loss(sk);  
	}  
  
	....  
}

能够使用F-RTO的条件:
(1)tcp_frto非零,此为TCP参数
(2)MTU probe没使用,因为它和F-RTO有冲突
(3)a. 如果启用了sackfrto,则可以使用
b. 如果没启用sackfrto,不能重传过除head以外的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* F-RTO can only be used if TCP has never retransmitted anything other than 
 * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here) 
 */  
int tcp_use_frto(struct sock *sk)  
{  
	const struct tcp_sock *tp = tcp_sk(sk);  
	const struct inet_connection_sock *icsk = inet_csk(sk);  
	struct sk_buff *skb;  
  
	if (! sysctl_tcp_frto)  
		return 0;  
  
	/* MTU probe and F-RTO won't really play nicely along currently */  
	if (icsk->icsk_mtup.probe_size)  
		return 0;  
  
	if (tcp_is_sackfrto(tp))  
		return 1;  
  
	/* Avoid expensive walking of rexmit queue if possible */  
	if (tp->retrans_out > 1)  
		return 0; /* 不能重过传除了head以外的数据*/  
  
	skb = tcp_write_queue_head(sk);  
	if (tcp_skb_is_last(sk, skb))  
		return 1;  
	skb = tcp_write_queue_next(sk, skb); /* Skips head */  
	tcp_for_write_queue_from(skb, sk) {  
		if (skb == tcp_send_head(sk))  
			break;  
  
		if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)  
			return 0; /* 不允许处head以外的数据包被重传过 */  
  
		/* Short-circut when first non-SACKed skb has been checked */  
		if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))  
		break;  
	}  
	return 1;  
}  
  
static int tcp_is_sackfrto(const struct tcp_sock *tp)  
{  
	return (sysctl_tcp_frto == 0x2) && ! tcp_is_reno(tp);  
}

进入F-RTO状态

启用F-RTO后,虽然传送超时,但还没进入Loss状态,相反,先进入Disorder状态。减小慢启动阈值,而snd_cwnd暂时保持不变。此时对应head数据包还没重传前。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO recovery 
 * a bit and use heuristics in tcp_process_frto() to detect if the RTO was  
 * spurious. 
 */  
  
void tcp_enter_frto (struct sock *sk)  
{  
	const struct inet_connection_sock *icsk = inet_csk(sk);  
	struct tcp_sock *tp = tcp_sk(sk);  
	struct sk_buff *skb;  
  
	/* Do like tcp_enter_loss() would*/  
	if ((! tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) ||  
		tp->snd_una == tp->high_seq ||   
		((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&  
		! icsk->icsk_retransmits)) {  
  
		tp->prior_ssthresh = tcp_current_ssthresh(sk); /* 保存旧阈值*/  
  
		if (tp->frto_counter) {   
			u32 stored_cwnd;  
			stored_cwnd = tp->snd_cwnd;  
			tp->snd_cwnd = 2;  
			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);  
			tp->snd_cwnd = stored_cwnd;  
		} else {  
			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); /* 减小阈值*/  
		}  
  
		tcp_ca_event(sk, CA_EVENT_FRTO); /* 触发FRTO事件 */  
	}  
  
	tp->undo_marker = tp->snd_una;  
	tp->undo_retrans = 0;  
  
	skb = tcp_write_queue_head(sk);  
	if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)  
		tp->undo_marker = 0;  
  
	/* 清除head与重传相关的标志*/  
	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {  
		TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;  
		tp->retrans_out -= tcp_skb_pcount(skb);  
	}  
  
	tcp_verfify_left_out(tp);  
  
	/* Too bad if TCP was application limited */  
	tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1);  
  
	/* Earlier loss recovery underway */  
	if (tcp_is_sackfrto(tp) && (tp->frto_counter ||   
		((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery | TCPF_CA_Loss))) &&  
		after(tp->high_seq, tp->snd_una)) {  
  
		tp->frto_highmark = tp->high_seq;  
  
	} else {  
		tp->frto_highmark = tp->snd_nxt;  
	}  
  
	tcp_set_ca_state (sk, TCP_CA_Disorder); /* 设置拥塞状态*/  
	tp->high_seq = tp->snd_nxt;  
	tp->frto_counter = 1; /* 表示刚进入F-RTO状态!*/  
}

F-RTO算法处理

F-RTO算法的处理过程主要发生在重传完超时数据包后。发送方在接收到ACK后,在处理ACK时会检查是否处于F-RTO处理阶段。如果是则会调用tcp_process_frto()进行F-RTO阶段的处理。

1
2
3
4
5
6
7
8
9
static int tcp_ack (struct sock *sk, const struct sk_buff *skb, int flag)  
{  
	....  
  
	if (tp->frto_counter )  
		frto_cwnd = tcp_process_frto(sk, flag);  
  
	....  
}

2.6.20的F-RTO

tcp_process_frto()用于判断RTO是否为虚假的,主要依据为RTO后的两个ACK。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
static void tcp_process_frto (struct sock *sk, u32 prior_snd_una)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	tcp_sync_left_out(tp);  
  
	/* RTO was caused by loss, start retransmitting in 
	 * go-back-N slow start. 
	 * 包括两种情况: 
	  * (1)此ACK为dupack 
	 * (2)此ACK确认完整个窗口 
	  * 以上两种情况都表示有数据包丢失了,需要采用传统的方法。 
	  */  
	if (tp->snd_una == prior_snd_una ||   
		! before(tp->snd_una, tp->frto_highmark)) {  
  
		tcp_enter_frto_loss(sk);  
		return;  
	}  
  
	/* First ACK after RTO advances the window: allow two new  
	 * segments out. 
	 * frto_counter = 1表示收到第一个有效的ACK,则重新设置 
	 * 拥塞窗口,确保可以在F-RTO处理阶段在输出两个数据包, 
	 * 因为此时还没进入Loss状态,所以可以发送新数据包。 
	 */  
	if (tp->frto_counter == 1) {  
  
		tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;  
  
	} else {  
  
		/* Also the second ACK after RTO advances the window. 
		 * The RTO was likely spurious. Reduce cwnd and continue 
		 * in congestion avoidance. 
		 * 第二个ACK有效,则调整拥塞窗口,直接进入拥塞避免阶段, 
		  * 而不用重传数据包。 
		  * / 
		tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); 
		tcp_moderate_cwnd(tp); 
	} 
 
	/* F-RTO affects on two new ACKs following RTO. 
	 * At latest on third ACK the TCP behavior is back to normal. 
	 * 如果能连续收到两个确认了新数据的ACK,则说明RTO是虚假的,因此 
	  * 退出F-RTO。 
	  */  
	tp->frto_counter = (tp->frto_counter + 1) % 3;  
}

如果确定RTO为虚假的,则调用tcp_enter_frto_loss(),进入RTO恢复阶段,开始慢启动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO, which 
 * indicates that we should follow the traditional RTO recovery, i.e. mark  
 * erverything lost and do go-back-N retransmission. 
 */  
static void tcp_enter_frto_loss (struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	struct sk_buff *skb;  
	int cnt = 0;  
  
	/* 进入Loss状态后,清零SACK、lost、retrans_out等数据*/  
	tp->sacked_out = 0;  
	tp->lost_out = 0;  
	tp->fackets_out = 0;  
  
	/* 遍历重传队列,重新标志LOST。对于那些在RTO发生后传输 
	 * 的数据不用标志为LOST。 
	 */  
	sk_stream_for_retrans_queue(skb, sk) {  
		cnt += tcp_skb_pcount(skb);  
		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;  
  
		/* 对于那些没被SACK的数据包,需要把它标志为LOST。*/  
		if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {  
			/* Do not mark those segments lost that were forward 
			 * transmitted after RTO. 
			 */  
			 if (! after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark))  
			 {  
				TCP_SKB_CB(skb)->sacked |= TCP_LOST;  
				tp->lost_out += tcp_skb_pcount(skb);  
			 }  
  
		} else { /* 对于那些已被sacked的数据包,则不用标志LOST。*/  
			tp->sacked_out += tcp_skb_pcount(skb);  
			tp->fackets_out = cnt;  
		}  
	}  
	tcp_syn_left_out(tp);  
  
	tp->snd_cwnd = tp->frto_counter + tcp_packets_in_flight(tp) + 1;  
	tp->snd_cwnd_cnt = 0;  
	tp->snd_cwnd_stamp = tcp_time_stamp;  
	tp->undo_marker = 0; /* 不需要undo标志*/  
	tp->frto_counter = 0; /* 表示F-RTO结束了*/  
  
	/* 更新乱序队列的最大值*/  
	tp->reordering = min_t(unsigned int, tp->reordering, sysctl_tcp_reordering);  
	tcp_set_ca_state(sk, TCP_CA_Loss); /* 进入loss状态*/  
	tp->high_seq = tp->frto_highmark; /*RTO时的最大序列号*/  
	TCP_ECN_queue_cwr(tp); /* 设置显示拥塞标志*/  
	clear_all_retrans_hints(tp);  
}

3.2.12的F-RTO

F-RTO spurious RTO detection algorithm (RFC4138)
F-RTO affects during two new ACKs following RTO (well, almost, see inline comments). State (ACK number) is kept in frto_counter. When ACK advances window (but not to or beyond highest sequence sent before RTO) :
On First ACK, send two new segments out.
On second ACK, RTO was likely spurious. Do spurious response (response
algorithm is not part of the F-RTO detection algorithm given in RFC4138 but
can be selected separately).

Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss and TCP falls back to conventional RTO recovery. F-RTO allows overriding of Nagle, this is done using frto_counter states 2 and 3, when a new data segment of any size sent during F-RTO, state 2 is upgraded to 3.

Rationale: if the RTO was suprious, new ACKs should arrive from the original window even after we transmit two new data segments.

SACK version:
on first step, wait until first cumulative ACK arrives, then move to the second step. In second step, the next ACK decides.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
static int tcp_process_frto(struct sock *sk, int flag)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	tcp_verify_left_out(tp);  
   
	/* Duplicate the behavior from Loss state (fastretrans_alert) */  
	if (flag & FLAG_DATA_ACKED)  
		inet_csk(sk)->icsk_retransmits = 0; /*重传次数归零*/  
   
	if ((flag & FLAG_NONHEAD_RETRANS_ACKED) ||  
		((tp->frto_counter >= 2) && (flag & FLAG_RETRANS_DATA_ACKED)))  
		tp->undo_marker = 0;  
   
	/* 一个ACK确认完RTO时整个窗口,表示出现了丢包*/  
	if (! before(tp->snd_una, tp->frto_highmark)) {  
		tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 2 : 3), flag) ;  
		return 1;  
	}  
  
	/* Reno的处理方式 */  
	if (! tcp_is_sackfrto(tp)) {   
		/* RFC4138 shortcoming in step2; should also have case c): 
		 * ACK isn't duplicate nor advances window, e.g., opposite dir 
		 * data, winupdate 
		 */  
		if (! (flag & FLAG_ANY_PROGRESS) && (flag & FLAG_NOT_DUP))  
			return 1; /*不采取任何措施,忽略*/  
  
		if (! (flag & FLAG_DATA_ACKED)) { /* 没有确认新的数据*/  
			tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3), flag);  
			return 1;  
		}  
  
	} else { /* SACK的处理方式 */  
		/* Prevent sender of new data. 表示第一个ACK没有确认新数据, 
		 * 这个时候不允许发送新的数据,直接返回。 
		 */  
		if (! (flag & FLAG_DATA_ACKED) & (tp->frto_conter == 1) {  
			tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp));  
			return 1;  
		}  
  
		/* 当第二个ACK也没有确认新的数据时,判定RTO真实,退出F-RTO。*/  
		if ( (tp->frto_counter >= 2) &&   
			(! (flag & FLAG_FORWARD_PROGRESS) ||  
			((flag & FLAG_DATA_SACKED) && ! (flag & FLAG_ONLY_ORIG_SACKED))) {  
			/* RFC4138 shortcoming (see comment above) */  
  
			if (! (flag & FLAG_FORWARD_PROGRESS) &&   
				(flag & FLAG_NOT_DUP);  
				return 1;  
   
			tcp_enter_frto_loss(sk, 3, flag);  
			return 1;  
		}  
	}  
  
	if (tp->frto_counter == 1) {  
		/* tcp_may_send_now needs to see updated state */  
		tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;  
		tp->frto_counter = 2;  
		  
		if (! tcp_may_send_now(sk))  
			tcp_enter_frto_loss(sk, 2, flag);  
		return 1;  
  
	} else {  
		switch (sysctl_tcp_frto_response) {  
		case 2: /* 比较激进的,恢复到RTO前的窗口和阈值*/  
			tcp_undo_spur_to_response(sk, flag);  
			break;  
  
		case 1: /* 非常保守,阈值减小B,可窗口一再减小,为B/2 */  
			tcp_conservative_spur_to_response(sk);  
			break;  
  
		default:  
			/* 保守*/  
			tcp_ratehalving_spur_to_response(sk);  
			break;  
		}  
  
		tp->frto_counter = 0; /*F-RTO算法结束标志*/  
		tp->undo_marker = 0; /*清零undo标志*/  
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSPURIOUSRTOS);  
	}  
	return 0;   
}  
  
#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */  
#define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmit data was ACKed. */  
#define FLAG_RETRANS_DATA_ACKED 0x08 /* some of which was retransmitted.*/  
  
#define FLAG_ACKED (FLAG_DATA_ACKED | FLAG_SYN_ACKED)  
#define FLAG_FORWARD_PROGRESS (FLAG_ACKED | FLAG_DATA_SACKED)  
#define FLAG_ANY_PROGRESS (FLAG_RORWARD_PROGRESS | FLAG_SND_UNA_ADVANCED)  
   
#define FLAG_NOT_DUP (FLAG_DATA | FLAG_WIN_UPDATE | FLAG_ACKED)

tcp_frto_response选项

tcp_frto_response表示TCP在检测到虚假的RTO后,采用什么函数来进行阈值和拥塞窗口的调整,它有三种取值:

(1)值为2

表示使用tcp_undo_spur_to_response(),这是一种比较激进的处理方法,它把阈值和拥塞窗口都恢复到RTO前的值。

(2)值为1

表示使用tcp_conservative_spur_to_response(),这是一种很保守的处理方法。
假设减小因子为B,RTO前的窗口为C,那么一般情况下(因为阈值调整算法不同)
此后ssthresh=(1 - B)C,cwnd = (1 -B )(1- B)C

(3)值为0或其它(默认为0)

表示使用默认的tcp_ratehalving_spur_to_response(),也是一种保守的处理方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static void tcp_undo_spur_to_response (struct sock *sk, int flag)  
{  
	/* 如果有显示拥塞标志,则进入CWR状态,最终阈值不变,窗口减半*/  
	if (flag & FLAG_ECE)  
		tcp_ratehalving_spur_to_response(sk);  
	else  
	/* 撤销阈值调整,撤销窗口调整,恢复RTO前的状态*/  
		tcp_undo_cwr(sk, true);  
}  
  
/* A conservative spurious RTO response algorithm: reduce cwnd 
 * using rate halving and continue in congestion_avoidance. 
 */  
static void tcp_ratehalving_spur_to_response(struct sock *sk)  
{  
	tcp_enter_cwr(sk, 0);  
}  
  
/* A very conservative spurious RTO response algorithm: reduce cwnd 
 * and continue in congestion avoidance. 
 */  
static void tcp_conservative_spur_to_response(struct tcp_sock *tp)  
{  
	tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);  
	tp->snd_cwnd_cnt = 0;  
	tp->bytes_acked = 0;  
	/* 竟然又设置了显示拥塞标志,那窗口就还要减小到阈值的(1-B)! 
	 * 果然是非常保守。 
	 */  
	TCP_ECN_queue_cwr(tp);   
	tcp_moderate_cwnd(tp);  
}

如果判断RTO是真实的,就调用tcp_enter_frto_loss()来处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO, 
 * which indicates that we should follow the tradditional RTO recovery, 
 * i.e. mark everything lost and do go-back-N retransmission. 
 */  
static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	struct sk_buff *skb;  
  
	tp->lost_out = 0;  
	tp->retrans_out = 0;  
  
	if (tcp_is_reno(tp))  
		tcp_reset_reno_sack(tp);  
  
	tcp_for_write_queue(skb, sk) {  
		if (skb == tcp_send_head(sk))  
			break;  
  
		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;  
		/*  
		 * Count the retransmission made on RTO correctly (only when waiting for 
		 * the first ACK and did not get it. 
		 */  
		if ((tp->frto_counter == 1) && !(flag & FLAG_DATA_ACKED)) {  
			/* For some reason this R-bit might get cleared ? */  
			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)  
				tp->retrans_out += tcp_skb_pcount(skb);  
  
			/* enter this if branch just for the first segment */  
			flag |= FLAG_DATA_ACKED;  
		} else {  
  
			if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)  
				tp->undo_marker = 0;  
			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;  
		}  
  
		/* Marking forward transmissions that were made after RTO lost can 
		* cause unnecessary retransmissions in some scenarios, 
		* SACK blocks will mitigate that in some but not in all cases. 
		* We used to not mark them but it was casuing break-ups with 
		* receivers that do only in-order receival. 
		*  
		* TODO: we could detect presence of such receiver and select different 
		* behavior per flow. 
		*/  
	   if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {  
		  TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;  
		   tp->lost_out += tcp_skb_pcount(skb);  
		   tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;  
	   }  
	}  
	tcp_verify_left_out(tp);  
  
	/* allowed_segments应该不大于3*/  
	tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments;  
	tp->snd_cwnd_cnt = 0;  
	tp->snd_cwnd_stamp = tcp_time_stamp;  
	tp->frto_counter = 0; /* F-RTO结束了*/  
	tp->bytes_acked = 0;  
  
	/* 更新乱序队列的最大长度*/  
	tp->reordering = min_t(unsigned int, tp->reordering,  
						sysctl_tcp_reordering);  
  
	tcp_set_ca_state(sk, TCP_CA_Loss); /*设置成Loss状态*/  
	tp->high_seq = tp->snd_nxt;  
	TCP_ECN_queue_cwr(tp); /*设置显式拥塞标志*/  
	tcp_clear_all_retrans_hints(tp);  
}

总结

现在内核(3.2.12)是默认使用F-RTO算法的。
其中tcp_frto默认为2,tcp_frto_response默认为0。

TCP拥塞状态机的实现tcp_fastretrans_alert

TCP拥塞状态机的实现(上)
TCP拥塞状态机的实现(中)
TCP拥塞状态机的实现(下)


TCP拥塞状态机的实现(上)

内容:本文主要分析TCP拥塞状态机的实现中,主体函数tcp_fastretrans_alert()的实现。接下来的文章会对其中重要的部分进行更具体的分析。

内核版本:2.6.37

原理

先来看一下涉及到的知识。

拥塞状态:

(1)Open:Normal state, no dubious events, fast path.
(2)Disorder:In all respects it is Open, but requres a bit more attention.
It is entered when we see some SACKs or dupacks. It is split of Open mainly to move some processing from fast path to slow one.
(3)CWR:cwnd was reduced due to some Congestion Notification event.
It can be ECN, ICMP source quench, local device congestion.
(4)Recovery:cwnd was reduced, we are fast-retransmitting.
(5)Loss:cwnd was reduced due to RTO timeout or SACK reneging.

tcp_fastretrans_alert() is entered:

(1)each incoming ACK, if state is not Open
(2)when arrived ACK is unusual, namely:
SACK
Duplicate ACK
ECN ECE

Counting packets in flight is pretty simple.

(1)in_flight = packets_out - left_out + retrans_out
packets_out is SND.NXT - SND.UNA counted in packets.
retrans_out is number of retransmitted segments.
left_out is number of segments left network, but not ACKed yet.

(2)left_out = sacked_out + lost_out
sacked_out:Packets, which arrived to receiver out of order and hence not ACKed. With SACK this number is simply amount of SACKed data. Even without SACKs it is easy to give pretty reliable estimate of this number, counting duplicate ACKs.

(3)lost_out:Packets lost by network. TCP has no explicit loss notification feedback from network(for now). It means that this number can be only guessed. Actually, it is the heuristics to predict lossage that distinguishes different algorithms.
F.e. after RTO, when all the queue is considered as lost, lost_out = packets_out and in_flight = retrans_out.

Essentially, we have now two algorithms counting lost packets.

1)FACK:It is the simplest heuristics. As soon as we decided that something is lost, we decide that all not SACKed packets until the most forward SACK are lost. I.e.
lost_out = fackets_out - sacked_out and left_out = fackets_out
It is absolutely correct estimate, if network does not reorder packets. And it loses any connection to reality when reordering takes place. We use FACK by defaut until reordering is suspected on the path to this destination.

2)NewReno:when Recovery is entered, we assume that one segment is lost (classic Reno). While we are in Recovery and a partial ACK arrives, we assume that one more packet is lost (NewReno).
This heuristics are the same in NewReno and SACK.
Imagine, that’s all! Forget about all this shamanism about CWND inflation deflation etc. CWND is real congestion window, never inflated, changes only according to classic VJ rules.

Really tricky (and requiring careful tuning) part of algorithm is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().

tcp_time_to_recover()

It determines the moment when we should reduce cwnd and, hence, slow down forward transmission. In fact, it determines the moment when we decide that hole is caused by loss, rather than by a reorder.

tcp_xmit_retransmit_queue()

It decides what we should retransmit to fill holes, caused by lost packets.

undo heuristics

And the most logically complicated part of algorithm is undo heuristics. We detect false retransmits due to both too early fast retransmit (reordering) and underestimated RTO, analyzing timestamps and D-SACKs. When we detect that some segments were retransmitted by mistake and CWND reduction was wrong, we undo window reduction and abort recovery phase. This logic is hidden inside several functions named tcp_try_undo_.

主体函数

TCP拥塞状态机主要是在tcp_fastretrans_alert()中实现的,tcp_fastretrans_alert()在tcp_ack()中被调用。

此函数分成几个阶段:
A. FLAG_ECE,收到包含ECE标志的ACK。
B. reneging SACKs,ACK指向已经被SACK的数据段。如果是此原因,进入超时处理,然后返回。
C. state is not Open,发现丢包,需要标志出丢失的包,这样就知道该重传哪些包了。
D. 检查是否有错误( left_out > packets_out)。
E. 各个状态是怎样退出的,当snd_una >= high_seq时候。
F. 各个状态的处理和进入。

下文会围绕这几个阶段进行具体分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
/* Process an event, which can update packets-in-flight not trivially.
 * Main goal of this function is to calculate new estimate for left_out,
 * taking into account both packets sitting in receiver's buffer and
 * packets lost by network. 
 * 
 * Besides that it does CWND reduction, when packet loss is detected
 * and changes state of machine.
 *
 * It does not decide what to send, it is made in function
 * tcp_xmit_retransmit_queue().
 */

/* 此函数被调用的条件:
 * (1) each incoming ACK, if state is not Open
 * (2) when arrived ACK is unusual, namely:
 *       SACK
 *       Duplicate ACK
 *       ECN ECE
 */

static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct tcp_sock *tp = tcp_sk(sk);

	/* 判断是不是重复的ACK*/
	int is_dupack = ! (flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));

	/* tcp_fackets_out()返回hole的大小,如果大于reordering,则认为发生丢包.*/
	int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) && 
				(tcp_fackets_out(tp) > tp->reordering ));

	int fast_rexmit = 0, mib_idx;

	/* 如果packet_out为0,那么不可能有sacked_out */
	if (WARN_ON(!tp->packets_out && tp->sacked_out))
		tp->sacked_out = 0;

	/* fack的计数至少需要依赖一个SACK的段.*/
	if (WARN_ON(!tp->sacked_out && tp->fackets_out))
		tp->fackets_out = 0;
 
	/* Now state machine starts.
	 * A. ECE, hence prohibit cwnd undoing, the reduction is required. 
	 * 禁止拥塞窗口撤销,并开始减小拥塞窗口。
	 */
	if (flag & FLAG_ECE)
		tp->prior_ssthresh = 0;
	
	/* B. In all the states check for reneging SACKs. 
	 * 检查是否为虚假的SACK,即ACK是否确认已经被SACK的数据.
	 */
	if (tcp_check_sack_reneging(sk, flag))
		return;
	 
	/* C. Process data loss notification, provided it is valid. 
	 * 为什么需要这么多个条件?不太理解。
	 * 此时不在Open态,发现丢包,需要标志出丢失的包。
	  */
	if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOSS) &&
		before(tp->snd_una, tp->high_seq) &&
		icsk->icsk_ca_state != TCP_CA_Open &&
		tp->fackets_out > tp->reordering) {
		tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0);
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS);
		}

	/* D. Check consistency of the current state. 
	 * 确定left_out < packets_out
	 */
	tcp_verify_left_out(tp); 

	/* E. Check state exit conditions. State can be terminated 
	 * when high_seq is ACKed. */
	if (icsk->icsk_ca_state == TCP_CA_Open) {
		/* 在Open状态,不可能有重传且尚未确认的段*/
		WARN_ON(tp->retrans_out != 0);
		/* 清除上次重传阶段第一个重传段的发送时间*/
		tp->retrans_stamp = 0;

	} else if (!before(tp->snd_una, tp->high_seq) {/* high_seq被确认了*/
		switch(icsk->icsk_ca_state) {
			case TCP_CA_Loss:
				icsk->icsk_retransmits = 0; /*超时重传次数归0*/ 

				/*不管undo成功与否,都会返回Open态,除非没有使用SACK*/
				if (tcp_try_undo_recovery(sk)) 
					return;
				break;
 
			case TCP_CA_CWR:
				/* CWR is to be held someting *above* high_seq is ACKed
				 * for CWR bit to reach receiver.
				 * 需要snd_una > high_seq才能撤销
				   */
				if (tp->snd_una != tp->high_seq) {
					tcp_complete_cwr(sk);
					tcp_set_ca_state(sk, TCP_CA_Open);
				}
				break;

			case TCP_CA_Disorder:
				tcp_try_undo_dsack(sk);
				 /* For SACK case do not Open to allow to undo
				  * catching for all duplicate ACKs.?*/
				if (!tp->undo_marker || tcp_is_reno(tp) || 
					tp->snd_una != tp->high_seq) {
					tp->undo_marker = 0;
					tcp_set_ca_state(sk, TCP_CA_Open);
				}

			case TCP_CA_Recovery:
				if (tcp_is_reno(tp))
					tcp_reset_reno_sack(tp)); /* sacked_out清零*/

				if (tcp_try_undo_recovery(sk))
					return;

				tcp_complete_cwr(sk);
				break;
		}
	}

	/* F. Process state. */
	switch(icsk->icsk_ca_state) {
		case TCP_CA_Recovery:
			if (!(flag & FLAG_SND_UNA_ADVANCED)) {
				if (tcp_is_reno(tp) && is_dupack)
					tcp_add_reno_sack(sk); /* 增加sacked_out ,检查是否出现reorder*/
			} else 
				do_lost = tcp_try_undo_partial(sk, pkts_acked);
			break;

		case TCP_CA_Loss:
			/* 收到partical ack,超时重传的次数归零*/
			if (flag & FLAG_DATA_ACKED)
				icsk->icsk_retransmits = 0;

			if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
				tcp_reset_reno_sack(tp); /* sacked_out清零*/

			if (!tcp_try_undo_loss(sk)) { /* 尝试撤销拥塞调整,进入Open态*/
				/* 如果不能撤销,则继续重传标志为丢失的包*/
				tcp_moderate_cwnd(tp);
				tcp_xmit_retransmit_queue(sk); /* 待看*/
			   return;
			}

			if (icsk->icsk_ca_state != TCP_CA_Open)
				return;
 
		/* Loss is undone; fall through to process in Open state.*/
		default:
			if (tcp_is_reno(tp)) {
				if (flag & FLAG_SND_UNA_ADVANCED)
				   tcp_reset_reno_sack(tp);

				if (is_dupack)
				   tcp_add_reno_sack(sk);
			}

			if (icsk->icsk_ca_state == TCP_CA_Disorder)
				tcp_try_undo_dsack(sk); /*D-SACK确认了所有重传的段*/
			 
			/* 判断是否应该进入Recovery状态*/
			if (! tcp_time_to_recover(sk)) {
			   /*此过程中,会判断是否进入Open、Disorder、CWR状态*/
				tcp_try_to_open(sk, flag); 
				return;
			}

			/* MTU probe failure: don't reduce cwnd */
			/* 关于MTU探测部分此处略过!*/
			......

			/* Otherwise enter Recovery state */
			if (tcp_is_reno(tp))
				mib_idx = LINUX_MIB_TCPRENORECOVERY;
			else
				mib_idx = LINUX_MIB_TCPSACKRECOVERY;

			 NET_INC_STATS_BH(sock_net(sk), mib_idx);

			/* 进入Recovery状态前,保存那些用于恢复的数据*/
			tp->high_seq = tp->snd_nxt; /* 用于判断退出时机*/
			tp->prior_ssthresh = 0;
			tp->undo_marker = tp->snd_una;
			tp->undo_retrans=tp->retrans_out;
 
		   if (icsk->icsk_ca_state < TCP_CA_CWR) {
			   if (! (flag & FLAG_ECE))
				   tp->prior_ssthresh = tcp_current_ssthresh(sk); /*保存旧阈值*/
			   tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);/*更新阈值*/
			   TCP_ECN_queue_cwr(tp);
		   }

		   tp->bytes_acked = 0;
		   tp->snd_cwnd_cnt = 0;

		   tcp_set_ca_state(sk, TCP_CA_Recovery); /* 进入Recovery状态*/
		   fast_rexmit = 1; /* 快速重传标志 */
	}

	if (do_lost || (tcp_is_fack(tp) && tcp_head_timeout(sk)))
		/* 更新记分牌,标志丢失和超时的数据包,增加lost_out */
		tcp_update_scoreboard(sk, fast_rexmit); 

	/* 减小snd_cwnd */
	tcp_cwnd_down(sk, flag);
	tcp_xmit_retransmit_queue(sk);
}

flag标志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#define FLAG_DATA 0x01  /* Incoming frame contained data. */  
#define FLAG_WIN_UPDATE 0x02  /* Incoming ACK was a window update. */  
#define FLAG_SND_UNA_ADVANCED 0x400  /* snd_una was changed (!= FLAG_DATA_ACKED) */  
#define FLAG_DATA_SACKED 0x20  /* New SACK. */  
#define FLAG_ECE 0x40  /* ECE in this ACK */  
#define FLAG_SACK_RENEGING 0x2000  /* snd_una advanced to a sacked seq */  
#define FLAG_DATA_LOST  /* SACK detected data lossage. */  
   
#define FLAG_DATA_ACKED 0x04  /* This ACK acknowledged new data. */  
#define FLAG_SYN_ACKED 0x10    /* This ACK acknowledged SYN. */  
#define FLAG_ACKED (FLAG_DATA_ACKED | FLAG_SYN_ACKED)  
   
#define FLAG_NOT_DUP (FLAG_DATA | FLAG_WIN_UPDATE | FLAG_ACKED)  /* 定义非重复ACK*/  
   
#define FLAG_FORWARD_PROGRESS (FLAG_ACKED | FLAG_DATA_SACKED)  
#define FLAG_ANY_PROGRESS (FLAG_FORWARD_PROGRESS | FLAG_SND_UNA_ADVANCED)  
#define FLAG_DSACKING_ACK 0x800  /* SACK blocks contained D-SACK info */  
  
struct tcp_sock {  
	...  
	u32 retrans_out; /*重传还未得到确认的TCP段数目*/  
	u32 retrans_stamp; /* 记录上次重传阶段,第一个段的发送时间,用于判断是否可以进行拥塞调整撤销*/  
  
	struct sk_buff *highest_sack; /* highest skb with SACK received,  
					*(validity guaranteed only if sacked_out > 0)  
					*/  
   ...  
}  
   
struct inet_connection_sock {  
	...  
	__u8 icks_retransmits; /* 记录超时重传的次数*/  
	...  
}

SACK/ RENO/ FACK是否启用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/* These function determine how the currrent flow behaves in respect of SACK 
 * handling. SACK is negotiated with the peer, and therefore it can very between 
 * different flows. 
 * 
 * tcp_is_sack - SACK enabled 
 * tcp_is_reno - No SACK 
 * tcp_is_fack - FACK enabled, implies SACK enabled 
 */  
  
static inline int tcp_is_sack (const struct tcp_sock *tp)  
{  
		return tp->rx_opt.sack_ok; /* SACK seen on SYN packet */  
}  
  
static inline int tcp_is_reno (const struct tcp_sock *tp)  
{  
		return ! tcp_is_sack(tp);  
}  
  
static inline int tcp_is_fack (const struct tcp_sock *tp)  
{  
		return tp->rx_opt.sack_ok & 2;  
}  
   
static inline void tcp_enable_fack(struct tcp_sock *tp)  
{  
		tp->rx_opt.sack_ok |= 2;  
}  
   
static inline int tcp_fackets_out(const struct tcp_sock *tp)  
{  
		return tcp_is_reno(tp) ? tp->sacked_out +1 : tp->fackets_out;  
}

(1)如果启用了FACK,那么fackets_out = left_out
fackets_out = sacked_out + loss_out
所以:loss_out = fackets_out - sacked_out
这是一种比较激进的丢包估算,即FACK。

(2)如果没启用FACK,那么就假设只丢了一个数据包,所以left_out = sacked_out + 1
这是一种较为保守的做法,当出现大量丢包时,这种做法会出现问题。


TCP拥塞状态机的实现(中)

内容:本文主要分析TCP拥塞状态机的实现中,虚假SACK的处理、标志丢失数据包的详细过程。
内核版本:2.6.37

虚假SACK

state B

如果接收的ACK指向已记录的SACK,这说明记录的SACK并没有反应接收方的真实的状态,也就是说接收方现在已经处于严重拥塞的状态或者在处理上有bug,所以接下来就按照超时重传的方式去处理。因为按照正常的逻辑流程,接收的ACK不应该指向已记录的SACK,而应该指向SACK后面未接收的地方。通常情况下,此时接收方已经删除了保存到失序队列中的段。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* If ACK arrived pointing to a remembered SACK, it means that our remembered 
 * SACKs do not reflect real state of receiver i.e. receiver host is heavily congested 
 * or buggy. 
 * 
 * Do processing similar to RTO timeout. 
 */  
  
static int tcp_check_sack_reneging (struct sock *sk, int flag)  
{  
	if (flag & FLAG_SACK_RENEGING) {  
		struct inet_connection_sock *icsk = inet_csk(sk);  
		/* 记录mib信息,供SNMP使用*/  
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);  
		  
		/* 进入loss状态,1表示清除SACKED标志*/  
		tcp_enter_loss(sk, 1);  /* 此函数在前面blog中分析过:)*/  
		  
		icsk->icsk_retransmits++; /* 未恢复的RTO加一*/  
   
		/* 重传发送队列中的第一个数据包*/  
		tcp_retransmit_skb(sk, tcp_write_queue_head(sk));   
   
		/* 更新超时重传定时器*/  
		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,   
						icsk->icsk_rto, TCP_RTO_MAX);  
		return 1;  
	}  
	return 0;  
}  
  
/** 用于返回发送队列中的第一个数据包,或者NULL 
 * skb_peek - peek at the head of an &sk_buff_head 
 * @list_ : list to peek at  
 * 
 * Peek an &sk_buff. Unlike most other operations you must 
 * be careful with this one. A peek leaves the buffer on the 
 * list and someone else may run off with it. You must hold 
 * the appropriate locks or have a private queue to do this. 
 * 
 * Returns %NULL for an empty list or a pointer to the head element. 
 * The reference count is not incremented and the reference is therefore 
 * volatile. Use with caution. 
 */  
  
static inline struct sk_buff *skb_peek (const struct sk_buff_head *list_)  
{  
	struct sk_buff *list = ((const struct sk_buff *) list_)->next;  
	if (list == (struct sk_buff *) list_)  
		list = NULL;  
	return list;  
}  
  
static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)  
{  
	return skb_peek(&sk->sk_write_queue);  
}

tcp_retransmit_skb()用来重传一个数据包。它最终调用tcp_transmit_skb()来发送一个数据包。这个函数在接下来的blog中会分析。

重设重传定时器

state B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/** inet_connection_sock - INET connection oriented sock 
 * 
 * @icsk_timeout: Timeout 
 * @icsk_retransmit_timer: Resend (no ack) 
 * @icsk_rto: Retransmission timeout 
 * @icsk_ca_ops: Pluggable congestion control hook 
 * @icsk_ca_state: Congestion control state 
 * @icsk_ca_retransmits: Number of unrecovered [RTO] timeouts 
 * @icsk_pending: scheduled timer event 
 * @icsk_ack: Delayed ACK control data 
 */  
  
struct inet_connection_sock {  
	...  
	unsigned long icsk_timeout; /* 数据包超时时间*/  
	struct timer_list icsk_retransmit_timer; /* 重传定时器*/  
	struct timer_list icsk_delack_timer; /* delay ack定时器*/  
	__u32 icsk_rto; /*超时时间*/  
	const struct tcp_congestion ops *icsk_ca_ops; /*拥塞控制算法*/  
	__u8 icsk_ca_state; /*所处拥塞状态*/  
	__u8 icsk_retransmits; /*还没恢复的timeout个数*/  
	__u8 icsk_pending; /* 等待的定时器事件*/  
	...  
	struct {  
	   ...  
		__u8 pending; /* ACK is pending */  
		unsigned long timeout; /* Currently scheduled timeout */  
		...  
	} icsk_ack; /* Delayed ACK的控制模块*/  
	...  
	u32 icsk_ca_priv[16]; /*放置拥塞控制算法的参数*/  
	...  
#define ICSK_CA_PRIV_SIZE (16*sizeof(u32))  
}  
   
#define ICSK_TIME_RETRANS 1 /* Retransmit timer */  
#define ICSK_TIME_DACK 2 /* Delayed ack timer */  
#define ICSK_TIME_PROBE0 3 /* Zero window probe timer */  
  
/* 
 * Reset the retransmissiion timer 
 */  
static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,  
						unsigned long when,  
						const unsigned long max_when)  
{  
	struct inet_connection_sock *icsk = inet_csk(sk);  
  
	if (when > max_when) {  
#ifdef INET_CSK_DEBUG  
		pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",  
					sk, what, when, current_text_addr());  
#endif  
		when = max_when;  
	}  
	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {  
		icsk->icsk_pending = what;  
		icsk->icsk_timeout = jiffies + when; /*数据包超时时刻*/  
		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);  
	} else if (what == ICSK_TIME_DACK) {  
		icsk->icsk_ack.pending |= ICSK_ACK_TIMER;  
		icsk->icsk_ack.timeout = jiffies + when; /*Delay ACK定时器超时时刻*/  
		sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);  
	}  
#ifdef INET_CSK_DEBUG  
	else {  
		pr_debug("%s", inet_csk_timer_bug_msg);  
	}    
#endif       
}

添加LOST标志

state C

Q: 我们发现有数据包丢失了,怎么知道要重传哪些数据包呢?
A: tcp_mark_head_lost()通过给丢失的数据包标志TCPCB_LOST,就可以表明哪些数据包需要重传。
如果通过SACK发现有段丢失,则需要从重传队首或上次标志丢失段的位置开始,为记分牌为0的段添加LOST标志,直到所有被标志LOST的段数达到packets或者被标志序号超过high_seq为止。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/* Mark head of queue up as lost. With RFC3517 SACK, the packets is against sakced cnt, 
 * otherwise it's against fakced cnt. 
 * packets = fackets_out - reordering,表示sacked_out和lost_out的总和。 
 * 所以,被标志为LOST的段数不能超过packets。 
 * high_seq : 可以标志为LOST的段序号的最大值。 
 * mark_head: 为1表示只需要标志发送队列的第一个段。 
 */  
  
static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	struct sk_buff *skb;  
	int cnt, oldcnt;  
	int err;  
	unsigned int mss;  
  
	/* 被标志为丢失的段不能超过发送出去的数据段数*/  
	WARN_ON(packets > tp->packets_out);  
  
	/* 如果已经有标识为丢失的段了*/  
	if (tp->lost_skb_hint) {  
		skb = tp->lost_skb_hint; /* 下一个要标志的段 */  
		cnt = tp->lost_cnt_hint; /* 已经标志了多少段 */  
  
		/* Head already handled? 如果发送队列第一个数据包已经标志了,则返回 */  
		if (mark_head && skb != tcp_write_queue_head(sk))  
			return;  
  
	} else {  
		skb = tcp_write_queue_head(sk);  
		cnt = 0;  
	}  
  
	tcp_for_write_queue_from(skb, sk) {  
		if (skb == tcp_send_head(sk))  
			break; /* 如果遍历到snd_nxt,则停止*/  
  
		/* 更新丢失队列信息*/  
		tp->lost_skb_hint = skb;  
		tp->lost_cnt_hint = cnt ;  
  
		/* 标志为LOST的段序号不能超过high_seq */  
		if (after(TCP_SKB_CB(skb)->end_seq, tp->high_seq))  
			break;  
  
		oldcnt = cnt;  
  
		if (tcp_is_fack(tp) || tcp_is_reno(tp) ||   
			(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))  
			cnt += tcp_skb_pcount(skb); /* 此段已经被sacked */  
				 
		/* 主要用于判断退出时机 */  
		if (cnt > packets) {  
			if ((tcp_is_sack(tp) && !tcp_is_fack(tp) ||   
				(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||  
				(oldcnt >= pakcets))  
  
				break;  
  
			 mss = skb_shinfo(skb)->gso_size;  
			 err = tcp_fragment(sk, skb, (packets - oldcnt) * mss, mss);  
			 if (err < 0)  
				 break;  
			 cnt = packets;  
		}  
  
		/* 标志动作:标志一个段为LOST*/  
		tcp_skb_mark_lost(tp, skb);  
		if (mark_head)  
			break;  
	}  
	tcp_verify_left_out(tp);  
}

涉及变量

1
2
3
4
5
6
7
8
9
struct tcp_sock {  
	/* 在重传队列中,缓存下次要标志的段,为了加速对重传队列的标志操作 */  
	struct sk_buff *lost_skb_hint; /* 下一次要标志的段 */  
	int lost_cnt_hint; /* 已经标志了多少个段 */  
  
	struct sk_buff *retransmit_skb_hint; /* 表示将要重传的起始包*/  
	u32 retransmit_high; /*重传队列的最大序列号*/  
	struct sk_buff *scoreboard_skb_hint; /* 记录超时的数据包,序号最大*/  
}

TCP分片函数tcp_fragment

1
2
3
4
5
6
7
8
/* Function to create two new TCP segments. shrinks the given segment 
 * to the specified size and appends a new segment with the rest of the 
 * packet to the list. This won't be called frequently, I hope. 
 * Remember, these are still headerless SKBs at this point. 
 */  
  
int tcp_fragment (struct sock *sk, struct sk_buff *skb, u32 len,  
				unsigned int mss_now) {}  

给一个段添加一个LOST标志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)  
{  
	if (! (TCP_SKB_CB(skb)->sacked & (TCPCB_LOST | TCPCB_SACKED_ACKED))) {  
		tcp_verify_retransmit_hint(tp, skb); /* 更新重传队列*/  
		tp->lost_out += tcp_skb_pcount(skb); /*增加LOST的段数*/  
		TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; /* 添加LOST标志*/  
	}  
}  
  
/* This must be called before lost_out is incremented */  
static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb)  
{  
	if ((tp->retransmit_skb_hint == NULL) ||  
		 before(TCP_SKB_CB(skb)->seq,  
					   TCP_SKB_CB(tp->retransmit_skb_hint)->seq))  
	tp->retransmit_skb_hint = skb;   
   
	if (! tp->lost_out ||  
		after(TCP_SKB_CB(skb)->end_seq, tp->retransmit_high))  
		tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;  
}

TCP拥塞状态机的实现(下)

内容:本文主要分析TCP拥塞状态机的实现中,各个拥塞状态的进入、处理和退出的详细过程。
内核版本:2.6.37

各状态的退出

state E

各状态的退出时机:tp->snd_una >= tp->high_seq

(1) Open

因为Open态是正常态,所以无所谓退出,保持原样。

(2)Loss

icsk->icsk_retransmits = 0; /超时重传次数归0/
tcp_try_undo_recovery(sk);

检查是否需要undo,不管undo成功与否,都返回Open态。

(3)CWR

If seq number greater than high_seq is acked, it indicates that the CWR indication has reached the peer TCP, call tcp_complete_cwr() to bring down the cwnd to ssthresh value.

tcp_complete_cwr(sk)中:
tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);

(4)Disorder

启用sack,则tcp_try_undo_dsack(sk),交给它处理。否则,tp->undo_marker = 0;

(5)Recovery

tcp_try_undo_recovery(sk);
在tcp_complete_cwr(sk)中:
tp->snd_cwnd = tp->snd_ssthresh;

1
2
3
4
5
6
7
8
9
/*cwr状态或Recovery状态结束时调用,减小cwnd*/   
  
static inline void tcp_complete_cwr(struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);  
	tp->snd_cwnd_stamp = tcp_time_stamp;  
	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);  
}

Recovery状态处理

state F

(1)收到dupack

如果收到的ACK并没有使snd_una前进、是重复的ACK,并且没有使用SACK,则:
sacked_out++,增加sacked数据包的个数。
检查是否有reordering,如果有reordering则:
纠正sacked_out
禁用FACK(画外音:这实际上是多此一举,没有使用SACK,哪来的FACK?)
更新tp->reordering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Emulate SACKs for SACKless connection: account for a new dupack.*/  
static void tcp_add_reno_sack(struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	tp->sacked_out++; /* 增加sacked数据包个数*/  
	tcp_check_reno_reordering(sk, 0); /*检查是否有reordering*/  
	tcp_verify_left_out(tp);  
}  
   
/* If we receive more dupacks than we expected counting segments in  
 * assumption of absent reordering, interpret this as reordering. 
 * The only another reason could be bug in receiver TCP. 
 * tcp_limit_reno_sack()是判断是否有reordering的函数。 
 */  
static void tcp_check_reno_reordering(struct sock *sk, const int addend)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	if (tcp_limit_reno_sack(tp)) /* 检查sack是否过多*/  
		/* 如果是reordering则更新reordering信息*/  
		tcp_update_reordering(sk, tp->packets_out + addend, 0);  
}  
   
/* Limit sacked_out so that sum with lost_out isn't ever larger than packets_out. 
 * Returns zero if sacked_out adjustment wasn't necessary. 
 * 检查sacked_out是否过多,过多则限制,且返回1说明出现reordering了。 
 * Q: 怎么判断是否有reordering呢? 
 * A: 我们知道dupack可能由lost引起,也有可能由reorder引起,那么如果 
 *    sacked_out + lost_out > packets_out,则说明sacked_out偏大了,因为它错误的把由reorder 
 *    引起的dupack当客户端的sack了。 
 */  
static int tcp_limit_reno_sacked(struct tcp_sock *tp)  
{  
	u32 holes;  
	holes = max(tp->lost_out, 1U);  
	holes = min(holes, tp->packets_out);  
	if ((tp->sacked_out + holes) > tp->packets_out) {  
		tp->sacked_out = tp->packets_out - holes;  
		return 1;  
	}  
	return 0;  
}

更新reordering信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
static void tcp_update_reordering(struct sock *sk, const int metric,  
					const int ts)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
  
	if (metric > tp->reordering) {  
		int mib_idx;  
		/* 更新reordering的值,取其小者*/  
		tp->reordering = min(TCP_MAX_REORDERING, metric);  
		  
		if (ts)  
			mib_idx = LINUX_MIB_TCPTSREORDER;  
		else if (tcp_is_reno(tp))  
			mib_idx = LINUX_MIB_TCPRENOREORDER;  
		else if (tcp_is_fack(tp))  
			mib_idx = LINUX_MIB_TCPFACKREORDER;  
		else   
			mib_idx = LINUX_MIB_TCPSACKREORDER;  
  
		NET_INC_STATS_BH(sock_net(sk), mib_idx);  
#if FASTRETRANS_DEBUG > 1  
		printk(KERN_DEBUG "Disorder%d %d %u f%u s%u rr%d\n",  
				   tp->rx_opt.sack_ok, inet_csk(sk)->icsk_ca_state,  
				   tp->reordering, tp->fackets_out, tp->sacked_out,  
				   tp->undo_marker ? tp->undo_retrans : 0);  
#endif  
		tcp_disable_fack(tp); /* 出现了reorder,再用fack就太激进了*/  
	}  
}  
/* Packet counting of FACK is based on in-order assumptions, therefore 
 * TCP disables it when reordering is detected. 
 */  
  
static void tcp_disable_fack(struct tcp_sock *tp)  
{  
	/* RFC3517 uses different metric in lost marker => reset on change */  
	if (tcp_is_fack(tp))  
		tp->lost_skb_hint = NULL;  
	tp->rx_opt.sack_ok &= ~2; /* 取消FACK选项*/  
}
(2)收到partical ack

do_lost = tcp_try_undo_partical(sk, pkts_acked);
一般情况下do_lost都会为真,除非需要undo。
具体可以看前面blog《TCP拥塞窗口调整撤销剖析》。

(3)跳出F state,标志丢失的数据段

执行完(1)或(2)后,就跳出F state。
如果有丢失的数据包,或者发送队列的第一个数据包超时,则调用tcp_update_scoreboard()来更新记分牌,给丢失的段加TCPCB_LOST标志,增加lost_out。

检查发送队列的第一个数据包是否超时。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* 检验发送队列的第一个数据包是否超时*/  
static inline int tcp_head_timeout(const struct sock *sk)  
{  
	const struct tcp_sock *tp = tcp_sk(sk);  
	return tp->packets_out &&   
				tcp_skb_timeout(sk, tcp_write_queue_head(sk));  
}  
  
/* 检验发送队列的某个数据包是否超时*/  
static inline int tcp_skb_timeout(const struct sock *sk,  
				 const struct sk_buff *skb)  
{  
	return tcp_time_stamp - TCP_SKB_CB(skb)->when > inet_csk(sk)->icsk_rto;  
}

为确定丢失的段更新记分牌,记分牌指的是tcp_skb_cb结构中的sacked,保存该数据包的状态信息。
(1) 没有使用SACK,每次收到dupack或partical ack时,只能标志一个包为丢失。

(2) 使用FACK,每次收到dupack或partical ack时,分两种情况:
如果lost = fackets_out - reordering <= 0,这时虽然不能排除是由乱序引起的,但是fack的思想较为激进,所以也标志一个包为丢失。
如果lost >0,就可以肯定有丢包,一次性可以标志lost个包为丢失。

(3) 使用SACK,但是没有使用FACK。
如果sacked_upto = sacked_out - reordering,这是不能排除是由乱序引起的,除非快速重传标志fast_rexmit为真,才标志一个包为丢失。
如果sacked_upto > 0,就可以肯定有丢包,一次性可以标志sacked_upto个包为丢失。

内核默认使用的是(2)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* Account newly detected lost packet(s) */  
  
 static void tcp_update_scoreboard (struct sock *sk, int fast_rexmit)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	if (tcp_is_reno(tp)) {  
		/* 只标志第一个数据包为丢失,reno一次性只标志一个包*/  
		tcp_mark_head_lost(sk, 1, 1);  
  
	} else if (tcp_is_fack(tp)) {  
		/* 还是考虑到乱序的,对于可能是由乱序引起的部分,一次标志一个包*/  
		int lost = tp->fackets_out - tp->reordering;  
		if (lost <= 0)  
			lost = 1;  
  
		/* 因为使用了FACK,可以标志多个数据包丢失*/  
		tcp_mark_head_lost(sk, lost, 0);  
  
	} else {  
		int sacked_upto = tp->sacked_out - tp->reordering;  
		if (sacked_upto >= 0)  
			tcp_mark_head_lost(sk, sacked_upto, 0);  
  
		else if (fast_rexmit)  
			tcp_mark_head_lost(sk, 1, 1);  
	}  
  
	/* 检查发送队列中的数据包是否超时,如果超时则标志为丢失*/  
	tcp_timeout_skbs(sk);  
}

检查发送队列中哪些数据包超时,并标志为丢失

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void tcp_timeout_skbs(struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	struct sk_buff *skb;  
  
	if (! tcp_is_fack(tp) || !tcp_head_timeout(sk))  
		return;  
  
	skb = tp->scoreboard_skb_hint;  
  
	if (tp->scoreboard_skb_hint == NULL)  
		skb = tcp_write_queue_head(sk));  
  
	tcp_for_write_queue_from(skb, sk) {  
		if (skb == tcp_send_head(sk)) /*遇到snd_nxt则停止*/  
			break;  
  
		if (!tcp_skb_timeout(sk, skb)) /* 数据包不超时则停止*/  
			break;  
  
		tcp_skb_mark_lost(tp, skb); /* 标志为LOST,并增加lost_out */  
	}  
  
	tp->scoreboard_skb_hint = skb;  
	tcp_verify_left_out(tp);  
}
(4)减小snd_cwnd

拥塞窗口每隔一个确认段减小一个段,即每收到2个确认将拥塞窗口减1,直到拥塞窗口等于慢启动阈值为止。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* Decrease cwnd each second ack. */  
static void tcp_cwnd_down (struct sock *sk, int flag)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	int decr = tp->snd_cwnd_cnt + 1;  
  
	if ((flag & (FLAG_ANY_PROGRESS | FLAG_DSACKING_ACK )) ||  
		(tcp_is_reno(tp) && ! (flag & FLAG_NOT_DUP))) {  
		tp->snd_cwnd_cnt = decr & 1; /* 0=>1,1=>0 */  
  
		decr >>= 1; /*与上个snd_cwnd_cnt相同,0或1*/  
  
		/* 减小cwnd */  
		if (decr && tp->snd_cwnd > tcp_cwnd_min(sk))  
			tp->snd_cwnd -= decr;  
			  
		/* 注:不太理解这句的用意。*/  
		tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) +1);  
		tp->snd_cwnd_stamp = tcp_time_stamp;  
	}  
}  
  
/* Lower bound on congestion window is slow start threshold 
 * unless congestion avoidance choice decides to override it. 
 */  
static inline u32 tcp_cwnd_min(const struct sock *tp)  
{  
	const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;  
	return ca_ops->min_cwnd ? ca_ops->min_cwnd(sk) : tcp_sk(sk)->snd_ssthresh;  
}
(5)重传标志为丢失的段
1
2
3
4
5
6
7
8
/* This gets called after a retransmit timeout, and the initially retransmitted data is  
 * acknowledged. It tries to continue resending the rest of the retransmit queue, until  
 * either we've sent it all or the congestion window limit is reached. If doing SACK,  
 * the first ACK which comes back for a timeout based retransmit packet might feed us  
 * FACK information again. If so, we use it to avoid unnecessarily retransmissions. 
 */  
  
void tcp_xmit_retransmit_queue (struct sock *sk) {}

这个函数决定着发送哪些包,比较复杂,会在之后的blog单独分析。

(6)什么时候进入Recovery状态

tcp_time_to_recover()是一个重要函数,决定什么时候进入Recovery状态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/* This function decides, when we should leave Disordered state and enter Recovery 
 * phase, reducing congestion window. 
 * 决定什么时候离开Disorder状态,进入Recovery状态。 
 */  
  
static int tcp_time_to_recover(struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	__u32 packets_out;  
  
	/* Do not perform any recovery during F-RTO algorithm 
	 * 这说明Recovery状态不能打断Loss状态。 
	 */  
	if (tp->frto_counter)  
		return 0;  
  
	/* Trick#1: The loss is proven.  
	 * 如果传输过程中存在丢失段,则可以进入Recovery状态。 
	 */  
	if (tp->lost_out)  
		return 1;  
   
	/* Not-A-Trick#2 : Classic rule... 
	 * 如果收到重复的ACK大于乱序的阈值,表示有数据包丢失了, 
	 * 可以进入到Recovery状态。 
	 */  
	if (tcp_dupack_heuristics(tp) > tp->reordering)  
		return 1;  
   
	/* Trick#3 : when we use RFC2988 timer restart, fast 
	 * retransmit can be triggered by timeout of queue head. 
	 * 如果发送队列的第一个数据包超时,则进入Recovery状态。 
	 */  
	  if (tcp_is_fack(tp) && tcp_head_timeout(sk))  
		 return 1;  
  
	/* Trick#4 : It is still not OK... But will it be useful to delay recovery more? 
	 * 如果此时由于应用程序或接收窗口的限制而不能发包,且接收到很多的重复ACK。那么不能再等下去了, 
	 * 推测发生了丢包,且马上进入Recovery状态。 
	 */  
	if (packets_out <= tp->reordering &&  
		tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering)  
		&& ! tcp_may_send_now(sk)  ) {  
		/* We have nothing to send. This connection is limited 
		 * either by receiver window or by application. 
		 */  
		return 1;  
	}  
  
	/* If a thin stream is detected, retransmit after first received 
	 * dupack. Employ only if SACK is supported in order to avoid  
	 * possible corner-case series of spurious retransmissions 
	 * Use only if there are no unsent data. 
	 */  
	if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&  
		 tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&  
		 tcp_is_sack(tp) && ! tcp_send_head(sk))  
		 return 1;  
  
	return 0; /*表示为假*/  
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/* Heurestics to calculate number of duplicate ACKs. There's no  
 * dupACKs counter when SACK is enabled (without SACK, sacked_out 
 * is used for that purpose). 
 * Instead, with FACK TCP uses fackets_out that includes both SACKed 
 * segments up to the highest received SACK block so far and holes in 
 * between them. 
 * 
 * With reordering, holes may still be in filght, so RFC3517 recovery uses 
 * pure sacked_out (total number of SACKed segment) even though it 
 * violates the RFC that uses duplicate ACKs, often these are equal but 
 * when e.g. out-of-window ACKs or packet duplication occurs, they differ. 
 * Since neither occurs due to loss, TCP shuld really ignore them. 
 */  
static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)  
{  
	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;  
}  
  
  
/* Determines whether this is a thin stream (which may suffer from increased 
 * latency). Used to trigger latency-reducing mechanisms. 
 */  
static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)  
{  
	return tp->packets_out < 4 && ! tcp_in_initial_slowstart(tp);  
}  
  
#define TCP_INFINITE_SSTHRESH 0x7fffffff  
  
static inline bool tcp_in_initial_slowstart(const struct tcp_sock *tp)  
{  
	return tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH;  
}

This function examines various parameters (like number of packet lost) for TCP connection to decide whether it is the right time to move to Recovery state. It’s time to recover when TCP heuristics suggest a strong possibility of packet loss in the network, the following checks are made.

总的来说,一旦确定有丢包,或者很可能丢包,就可以进入Recovery状态恢复丢包了。

可以进入Recovery状态的条件包括:
(1) some packets are lost (lost_out is non zero)。发现有丢包。

(2) SACK is an acknowledgement for out of order packets. If number of packets Sacked is greater than the
reordering metrics of the network, then loss is assumed to have happened.
被fack数据或收到的重复ACK,大于乱序的阈值,表明很可能发生丢包。

(3) If the first packet waiting to be acked (head of the write Queue) has waited for time equivalent to retransmission
timeout, the packet is assumed to have been lost. 发送队列的第一个数据段超时,表明它可能丢失了。

(4) If the following three conditions are true, TCP sender is in a state where no more data can be transmitted
and number of packets acked is big enough to assume that rest of the packets are lost in the network:
A: If packets in flight is less than the reordering metrics.
B: More than half of the packets in flight have been sacked by the receiver or number of packets sacked is more
than the Fast Retransmit thresh. (Fast Retransmit thresh is the number of dupacks that sender awaits before
fast retransmission)
C: The sender can not send any more packets because either it is bound by the sliding window or the application
has not delivered any more data to it in anticipation of ACK for already provided data.
我们收到很多的重复ACK,那么很可能有数据段丢失了。如果此时由于接收窗口或应用程序的限制而不能发送数据,那么我们不打算再等下去,直接进入Recovery状态。

(5) 当检测到当前流量很小时(packets_out < 4),如果还满足以下条件:
A: tp->thin_dupack == 1 / Fast retransmit on first dupack /
或者sysctl_tcp_thin_dupack为1,表明允许在收到第一个重复的包时就重传。
B: 启用SACK,且FACK或SACK的数据量大于1。
C: 没有未发送的数据,tcp_send_head(sk) == NULL。
这是一种特殊的情况,只有当流量非常小的时候才采用。

(7)刚进入Recovery时的设置
保存那些用于undo的数据:
tp->prior_ssthresh = tp->snd_ssthresh; / 保存旧阈值/
tp->undo_marker = tp->snd_una; / tracking retrans started here./
tp->undo_retrans = tp->retrans_out; / Retransmitted packets out /

保存退出点:
tp->high_seq = tp->snd_nxt;

重置变量:
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tp->bytes_acked = 0;
tp->snd_cwnd_cnt = 0;

进入Recovery状态:
tcp_set_ca_state(sk, TCP_CA_Recovery);

Loss状态处理

state F

(1)收到partical ack

icsk->icsk_retransmits = 0; / 超时重传的次数归零/
如果使用的是reno,没有使用sack,则归零tp->sacked_out。

(2)尝试undo

调用tcp_try_undo_loss(),当使用时间戳检测到一个不必要的重传时:
移除记分牌中所有段的Loss标志,从而发送新的数据而不再重传。
调用tcp_undo_cwr()来撤销拥塞窗口和阈值的调整。

否则:
tcp_moderate_cwnd()调整拥塞窗口,防止爆发式重传。
tcp_xmit_retransmit_queue()继续重传丢失的数据段。

其它状态处理

state F

如果tcp_time_to_recover(sk)返回值为假,也就是说不能进入Recovery状态,则进行CWR、Disorder或Open状态的处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
static void tcp_try_to_open (struct sock *sk, int flag)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	tcp_verify_left_out(tp);  
  
	if (!tp->frto_conter && !tcp_any_retrans_done(sk))  
		tp->retrans_stamp = 0; /* 归零,因为不需要undo了*/  
  
	/* 判断是否需要进入CWR状态*/  
	if (flag & FLAG_ECE)  
		tcp_enter_cwr(sk, 1);  
   
	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) { /*没进入CWR*/  
		tcp_try_keep_open(sk); /* 尝试保持Open状态*/  
		tcp_moderate_cwnd(tp);  
  
	} else { /* 说明进入CWR状态*/  
		tcp_cwnd_down(sk, flag);/* 每2个ACK减小cwnd*/  
	}  
}  
  
static void tcp_try_keep_open(struct sock *sk)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	int state = TCP_CA_Open;  
	  
	/* 是否需要进入Disorder状态*/  
	if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)  
		state = TCP_CA_Disorder;  
  
	if (inet_csk(sk)->icsk_ca_state != state) {  
		tcp_set_ca_state(sk, state);  
		tp->high_seq = tp->snd_nxt;  
	}  
}
(1)CWR状态

Q: 什么时候进入CWR状态?
A: 如果检测到ACK包含ECE标志,表示接收方通知发送法进行显示拥塞控制。

1
2
3
 @tcp_try_to_open():
 if (flag & FLAG_ECE)
	 tcp_enter_cwr(sk, 1);

tcp_enter_cwr()函数分析可见前面blog《TCP拥塞状态变迁》。
它主要做了:
1. 重新设置慢启动阈值。
2. 清除undo需要的标志,不允许undo。
3. 记录此时的最高序号(high_seq = snd_nxt),用于判断退出时机。
4. 添加CWR标志,用于通知接收方它已经做出反应。
5. 设置此时的状态为TCP_CA_CWR。

Q: 在CWR期间采取什么措施?
A: 拥塞窗口每隔一个确认段减小一个段,即每收到2个确认将拥塞窗口减1,直到拥塞窗口等于慢启动阈值为止。
调用tcp_cwnd_down()。

(2)Disorder状态

Q: 什么时候进入Disorder状态?
A: 如果检测到有被sacked的数据包,或者有重传的数据包,则进入Disorder状态。
当然,之前已经确认不能进入Loss或Recovery状态了。
判断条件: sacked_out、lost_out、retrans_out、undo_marker不为0。

Q: 在Disorder期间采取什么措施?
A: 1. 设置CA状态为TCP_CA_Disorder。
2. 记录此时的最高序号(high_seq = snd_nxt),用于判断退出时机。
3. 微调拥塞窗口,防止爆发式传输。

In Disorder state TCP is still unsure of genuiness of loss, after receiving acks with sack there may be a clearing ack which acks many packets non dubiously in one go. Such a clearing ack may cause a packet burst in the network, to avoid this cwnd size is reduced to allow no more than max_burst (usually 3) number of packets.

(3)Open状态

因为Open状态是正常的状态,是状态处理的最终目的,所以不需要进行额外处理。