kk Blog —— 通用基础

date [-d @int|str] [+%s|"+%F %T"]

TCP的TSO处理(一)

http://blog.csdn.net/zhangskd/article/details/7699081

概述

In computer networking, large segment offload (LSO) is a technique for increasing outbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by queuing up large buffers and letting the network interface card (NIC) split them into separate packets. The technique is also called TCP segmentation offload (TSO) when applied to TCP, or generic segmentation offload (GSO).

The inbound counterpart of large segment offload is large recive offload (LRO).

When large chunks of data are to be sent over a computer network, they need to be first broken down to smaller segments that can pass through all the network elements like routers and switches between the source and destination computers. This process it referred to as segmentation. Segmentation is often done by the TCP protocol in the host computer. Offloading this work to the NIC is called TCP segmentation offload (TSO).

For example, a unit of 64KB (65,536 bytes) of data is usually segmented to 46 segments of 1448 bytes each before it is sent over the network through the NIC. With some intelligence in the NIC, the host CPU can hand over the 64KB of data to the NIC in a single transmit request, the NIC can break that data down into smaller segments of 1448 bytes, add the TCP, IP, and data link layer protocol headers——according to a template provided by the host’s TCP/IP stack——to each segment, and send the resulting frames over the network. This significantly reduces the work done by the CPU. Many new NICs on the market today support TSO. [1]

具体

It is a method to reduce CPU workload of packet cutting in 1500byte and asking hardware to perform the same functionality.

1.TSO feature is implemented using the hardware support. This means hardware should be able to segment the packets in max size of 1500 byte and reattach the header with every packets.

2.Every network hardware is represented by netdevice structure in kernel. If hardware supports TSO, it enables the Segmentation offload features in netdevice, mainly represented by “ NETIF_F_TSO” and other fields. [2]

TCP Segmentation Offload is supported in Linux by the network device layer. A driver that wants to offer TSO needs to set the NETIF_F_TSO bit in the network device structure. In order for a device to support TSO, it needs to also support Net : TCP Checksum Offloading and Net : Scatter Gather.

The driver will then receive super-sized skb’s. These are indicated to the driver by skb_shinfo(skb)->gso_size being non-zero. The gso_size is the size the hardware should fragment the TCP data. TSO may change how and when TCP decides to send data. [3]

实现

1
2
3
4
5
6
7
8
9
10
11
/* This data is invariant across clones and lives at the end of the 
 * header data, ie. at skb->end. 
 */  
struct skb_share_info {  
	...  
   unsigned short gso_size; // 每个数据段的大小  
   unsigned short gso_segs; // skb被分割成多少个数据段  
   unsigned short gso_type;  
   struct sk_buff *frag_list; // 分割后的数据包列表  
   ...  
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Initialize TSO state of skb. 
 * This must be invoked the first time we consider transmitting 
 * SKB onto the wire. 
 */  
static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb,  
					unsigned int mss_now)  
{  
	int tso_segs = tcp_skb_pcount(skb);  
  
	/* 如果还没有分段,或者有多个分段但是分段长度不等于当前MSS,则需处理*/  
	if (! tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {  
		tcp_set_skb_tso_segs(sk, skb, mss_now);  
  
		tso_segs = tcp_skb_pcount(skb);/* 重新获取分段数量 */  
	}  
	return tso_segs;  
}  
  
/* Initialize TSO segments for a packet. */  
static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,  
					unsigned int mss_now)  
{  
	/* 有以下情况则不需要分片: 
	  * 1. 数据的长度不超过允许的最大长度MSS 
	 * 2. 网卡不支持GSO 
	 * 3. 网卡不支持重新计算校验和 
	 */  
	if (skb->len <= mss_now || ! sk_can_gso(sk) ||  
		skb->ip_summed == CHECKSUM_NONE) {  
  
		/* Avoid the costly divide in the normal non-TSO case.*/  
		skb_shinfo(skb)->gso_segs = 1;  
		skb_shinfo(skb)->gso_size = 0;  
		skb_shinfo(skb)->gso_type = 0;  
	} else {  
  
		/* 计算需要分成几个数据段*/  
		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);/*向上取整*/  
		skb_shinfo(skb)->gso_size = mss_now; /* 每个数据段的大小*/  
		skb_shinfo(skb)->gso_type = sk->sk_gso_type;  
	}  
}  
  
/* Due to TSO, an SKB can be composed of multiple actual packets.  
 * To keep these tracked properly, we use this. 
 */  
static inline int tcp_skb_pcount (const struct sk_buff *skb)  
{  
	return skb_shinfo(skb)->gso_segs;  
}  
   
/* This is valid if tcp_skb_pcount() > 1 */  
static inline int tcp_skb_mss(const struct sk_buff *skb)  
{  
	return skb_shinfo(skb)->gso_size;  
}  
  
static inline int sk_can_gso(const struct sock *sk)  
{  
	/* sk_route_caps标志网卡驱动的特征, sk_gso_type表示GSO的类型, 
	 * 设置为SKB_GSO_TCPV4 
	 */  
	return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);  
}  
  
static inline int net_gso_ok(int features, int gso_type)  
{  
	int feature = gso_type << NETIF_F_GSO_SHIFT;  
	return (features & feature) == feature;  
}
sk_gso_max_size

NIC also specify the maximum segment size which it can handle, in sk_gso_max_size field. Mostly it will be set to 64k. This 64k values means if the data at TCP is more than 64k, then again TCP has to segment it in 64k and then push to interface.

相关变量,sock中:unsigned int sk_gso_max_size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* RFC2861 Check whether we are limited by application or congestion window 
 * This is the inverse of cwnd check in tcp_tso_should_defer 
 * 函数返回1,受拥塞控制窗口的限制,需要增加拥塞控制窗口; 
 * 函数返回0,受应用程序的限制,不需要增加拥塞控制窗口。 
 */  
  
int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)  
{  
	const struct tcp_sock *tp = tcp_sk(sk);  
	u32 left;  
   
	if (in_flight >= tp->snd_cwnd)  
		return 1;  
   
	/* left表示还可以发送的数据量 */  
	left = tp->snd_cwnd - in_flight;  
   
  
	/* 如果使用gso,符合以下条件,认为是拥塞窗口受到了限制, 
	 * 可以增加拥塞窗口。 
	 */  
	if (sk_can_gso(sk) &&   
		left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&  
		left * tp->mss_cache < sk->sk_gso_max_size)  
		return 1;  
  
	/* 如果left大于允许的突发流量,那么拥塞窗口的增长已经很快了, 
	 * 不能再增加了。 
	 */  
	return left <= tcp_max_burst(tp);  
}

TSO Nagle

GSO, Generic Segmentation Offload,是协议栈提高效率的一个策略。

它尽可能晚的推迟分段(segmentation),最理想的是在网卡驱动里分段,在网卡驱动里把 大包(super-packet)拆开,组成SG list,或在一块预先分配好的内存中重组各段,然后交给 网卡。

The idea behind GSO seems to be that many of the performance benefits of LSO (TSO/UFO) can be obtained in a hardware-independent way, by passing large “superpackets” around for as long as possible, and deferring segmentation to the last possible moment - for devices without hardware segmentation/fragmentation support, this would be when data is actually handled to the device driver; for devices with hardware support, it could even be done in hardware.

Try to defer sending, if possible, in order to minimize the amount of TSO splitting we do. View it as a kind of TSO Nagle test.

通过延迟数据包的发送,来减少TSO分段的次数,达到减小CPU负载的目的。

1
2
3
4
5
struct tcp_sock {  
	...  
	u32 tso_deferred; /* 上次TSO延迟的时间戳 */  
	...  
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/** This algorithm is from John Heffner. 
 * 0: send now ; 1: deferred 
 */  
static int tcp_tso_should_defer (struct sock *sk, struct sk_buff *skb)  
{  
	struct tcp_sock *tp = tcp_sk(sk);  
	const struct inet_connection_sock *icsk = inet_csk(sk);  
	u32 in_flight, send_win, cong_win, limit;  
	int win_divisor;  
	  
	/* 如果此skb包含结束标志,则马上发送*/  
	if (TCP_SKB_CB(skb)->flags & TCPHDR_FIN)  
		goto send_now;  
  
	/* 如果此时不处于Open态,则马上发送*/  
	if (icsk->icsk_ca_state != TCP_CA_Open)  
		goto send_now;  
  
	/* Defer for less than two clock ticks. 
	 * 上个skb被延迟了,且超过现在1ms以上,则不再延迟。 
	 * 也就是说,TSO延迟不能超过2ms! 
	 */  
	if (tp->tso_deferred && (((u32)jiffies <<1) >> 1) - (tp->tso_deferred >> 1) > 1)  
		goto send_now;  
	
	in_flight = tcp_packets_in_flight(tp);  
	/* 如果此数据段不用分片,或者受到拥塞窗口的限制不能发包,则报错*/  
	BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));  
	/* 通告窗口的剩余大小*/  
	send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;  
	/* 拥塞窗口的剩余大小*/  
	cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;  
	/* 取其小者作为最终的发送限制*/  
	limit = min(send_win, cong_win);  
  
	/*If a full-sized TSO skb can be sent, do it. 
	 * 一般来说是64KB 
	 */  
	if (limit >= sk->sk_gso_max_size)  
		goto send_now;  
  
	/* Middle in queue won't get any more data, full sendable already ? */  
	if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))  
		goto send_now;  
  
	win_divisor = ACCESS_ONCE(sysctl_tcp_tso_win_divisor);  
	if (win_divisor) {  
		/* 一个RTT内允许发送的最大字节数*/  
		u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);  
		chunk /= win_divisor; /* 单个TSO段可消耗的发送量*/  
  
		/* If at least some fraction of a window is available, just use it. */  
		if (limit >= chunk)  
			goto send_now;  
	} else {  
		/* Different approach, try not to defer past a single ACK. 
		 * Receiver should ACK every other full sized frame, so if we have space for 
		 * more than 3 frames then send now. 
		 */  
		if (limit > tcp_max_burst(tp) * tp->mss_cache)  
			goto send_now;  
	}  
  
	/* OK, it looks like it is advisable to defer. */  
	tp->tso_deferred = 1 | (jiffies << 1); /* 记录此次defer的时间戳*/  
  
	return 1;  
  
send_now:  
	tp->tso_deferred = 0;  
	return 0;  
}  
  
/* Returns end sequence number of the receiver's advertised window */  
static inline u32 tcp_wnd_end (const struct tcp_sock *tp)  
{  
	/* snd_wnd的单位为字节*/  
	return tp->snd_una + tp->snd_wnd;  
}

tcp_tso_win_divisor:单个TSO段可消耗拥塞窗口的比例,默认值为3。

符合以下任意条件,不会TSO延迟,可马上发送:

(1) 数据包带有FIN标志。传输快结束了,不宜延迟。
(2) 发送方不处于Open拥塞状态。处于异常状态时,不宜延迟。
(3) 上一次skb被延迟了,且距离现在大于等于2ms。延迟不能超过2ms。
(4) min(send_win, cong_win) > full-sized TSO skb。允许发送的数据量超过TSO一次能处理的最大值,没必要再defer。
(5) skb处于发送队列中间,且允许整个skb一起发送。处于发送队列中间的skb不能再获得新的数据,没必要再defer。
(6) tcp_tso_win_divisor有设置时,limit > 单个TSO段可消耗的数据量,即min(snd_wnd, snd_cwnd * mss_cache) / tcp_tso_win_divisor。
(7) tcp_tso_win_divisor没有设置时,limit > tcp_max_burst(tp) * mss_cache,一般是3个数据包。

条件4、5、6/7,都是limit > 某个阈值,就可以马上发送。这个因为通过这几个条件,可以确定此时发送是受到应用程序的限制,而不是通告窗口或者拥塞窗口。在应用程序发送的数据量很少的情况下,不宜采用TSO Nagle,因为这会影响此类应用。

我们注意到tcp_is_cwnd_limited()中的注释说:
“ This is the inverse of cwnd check in tcp_tso_should_defer",所以可以认为在tcp_tso_should_defer()中包含判断 tcp_is_not_cwnd_limited (或者tcp_is_application_limited) 的条件。

符合以下所有条件,才会进行TSO延迟:

(1) 数据包不带有FIN标志。
(2) 发送方处于Open拥塞状态。
(3) 距离上一次延迟的时间在2ms以内。
(4) 允许发送的数据量小于sk_gso_max_size。
(5) skb处于发送队列末尾,或者skb不能整个发送出去。
(6) tcp_tso_win_divisor有设置时,允许发送的数据量不大于单个TSO段可消耗的。
(7) tcp_tso_win_divisor没有设置时,允许发送的数据量不大于3个包。

可以看到TSO的触发条件并不苛刻,所以被调用时并没有加unlikely。

应用

(1) 禁用TSO
1
ethtool -K ethX tso off
(2) 启用TSO

TSO是默认启用的。

1
ethtool -K ethX tso on

Reference

[1] http://en.wikipedia.org/wiki/Large_segment_offload

[2] http://tejparkash.wordpress.com/2010/03/06/tso-explained/

[3] http://www.linuxfoundation.org/collaborate/workgroups/networking/tso

TSO/GSO

http://book.51cto.com/art/201206/344985.htm

TSO是通过网络设备进行TCP段的分割,从而来提高网络性能的一种技术。较大的数据包(超过标准1518B的帧)可以使用该技术,使操作系统减少必须处理的数据数量以提高性能。通常,当请求大量数据时,TCP发送方必须将数据拆分为MSS大小的数据块,然后进一步将其封装为数据包形式,以便最终可以在网络中进行传输。而当启用了TSO技术之后,TCP发送方可以将数据拆分为MSS整数倍大小的数据块,然后将大块数据的分段直接交给网络设备处理,操作系统需要创建并传输的数据包数量更少,因此性能会有较大的提高。图1-3所示为标准帧和TSO技术特性比较。

图是标准帧和TSO的处理过程
a) 不支持TSO b) 启用TSO后

从前面有关TSO的论述可以看出,TSO只是针对TCP协议的,使TCP协议在硬件上得到了有力的支持。事实上,这种概念也可以应用于其他的传输层协议,如TCPv6,UDP,甚至DCCP等,这就是GSO(Generic Segmentation Offload)。

性能提高的关键在于尽可能地推迟分段的时机,这样才能有效地降低成本。最理想的是在网络设备驱动里进行分段,在网络设备驱动里把大包进行拆分,组成分段列表,或在一块预先分配好的内存中重组各段,然后交给网络设备。这样,就要在网络设备的驱动里边来实现它,那么就需要修改每一个网络设备的驱动程序。事实上,这样做不大现实。

然而似乎有另一种更容易的解决办法来支持GSO,那就是在把数据报文提交给网络设备驱动之前进行聚合/分散操作。Linux目前支持GSO框架已经支持的传输层的其他协议。有关GSO方面的代码,参见后续章节。

应用层可以使用ethtool -K eth0 tso off|on命令对支持TSO特性的网络设备进行TSO功能的关闭和启用。

拥塞窗口cwnd的理解

http://blog.csdn.net/linweixuan/article/details/4353015

开始的时候拥塞窗口是1,发一个数据包等ACK回来 cwnd++即2,这个时候可以发送两个包,发送间隔几乎没有, 对方回的ACK到达发送方几乎是同时到达的.一个RTT来回,cwnd就翻倍,cwnd++,cwnd++即4了.如此下去,cwnd是指数增加.

snd_cwnd_clamp这个变量我们可以不管,假定是一个大值.窗口到了我们设置的门限,snd_cwnd不在增加 而通过snd_cwnd_cnt变量来计数增加,一直增加到大过cwnd值,cwnd才加1,然后snd_cwnd_cnt重新计数, 通过snd_cwnd_cnt延缓cwnd计数,由于TCP是固定大小报文,每一个snd_cwnd代表了一个报文段的增加,snd_cwnd_cnt则看成byte的增加

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void tcp_cong_avoid(struct send_queue* sq)
{
	/* In saft area, increase*/
	if (sq->snd_cwnd <= sq->snd_ssthresh){
		if (sq->snd_cwnd < sq->snd_cwnd_clamp)
			sq->snd_cwnd++;
	}
	else{ 
		/* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd */
		if (sq->snd_cwnd_cnt >= sq->snd_cwnd) {
			if (sq->snd_cwnd < sq->snd_cwnd_clamp)
				sq->snd_cwnd++;
			sq->snd_cwnd_cnt = 0;
		} else
			sq->snd_cwnd_cnt++;
	} 
}

snd_cwnd 还没到达门限不断增加snd_cwnd++
snd_cwnd++ | <–snd_ssthresh ^

到达了snd_ssthresh转入拥塞避免,这个阶段由变量snd_cwnd_cnt来控制

转入拥塞,由于snd_cwnd_cnt从0开始小于snd_ssthresh,即从snd_ssthresh那个点开始计数, 一旦计数达到snd_cwnd拥塞窗口的值,但是还小过牵制snd_cwnd_clamp值

1
2
3
4
5
6
7
8
9
10
11
12
13
                          snd_cwnd_clamp
                                 ^
    snd_cwnd++                   |            | <--snd_ssthresh
                                              ^
                                    snd_cwnd++        
                                                          snd_cwnd_clamp
                                                                 ^
                                snd_cwnd_cnt++                   |            | <--snd_ssthresh
                                                                              ^
                                               0      --->       snd_cwnd_cnt++
 
 
               <------                       时间                      ------->

TCP接收窗口的调整算法

TCP接收窗口的调整算法(上)
TCP接收窗口的调整算法(中)
TCP接收窗口的调整算法(下)


TCP接收窗口的调整算法(上)

我们知道TCP首部中有一个16位的接收窗口字段,它可以告诉对端:我现在能接收多少数据。TCP的流控制主要就是通过调整接收窗口的大小来进行的。

本文内容:分析TCP接收窗口的调整算法,包括一些相关知识和初始接收窗口的取值。

内核版本:3.2.12

数据结构

以下是涉及到的数据结构。

1
2
3
4
5
6
7
8
9
10
struct tcp_sock {  
	...  
	/* 最早接收但未确认的段的序号,即当前接收窗口的左端*/  
	u32 rcv_wup; /* rcv_nxt on last window update sent */  
	u16 advmss; /* Advertised MSS. 本端能接收的MSS上限,建立连接时用来通告对端*/  
	u32 rcv_ssthresh; /* Current window clamp. 当前接收窗口大小的阈值*/  
	u32 rcv_wnd; /* Current receiver window,当前的接收窗口大小*/  
	u32 window_clamp; /* 接收窗口的最大值,这个值也会动态调整*/  
	...  
}
1
2
3
4
5
6
7
struct tcp_options_received {  
	...  
		snd_wscale : 4, /* Window scaling received from sender, 对端接收窗口扩大因子 */  
		rcv_wscale : 4; /* Window scaling to send to receiver, 本端接收窗口扩大因子 */  
	u16 user_mss; /* mss requested by user in ioctl */  
	u16 mss_clamp; /* Maximal mss, negotiated at connection setup,对端的最大mss */  
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/** 
 * struct sock - network layer representation of sockets 
 * @sk_rcvbuf: size of receive buffer in bytes 
 * @sk_receive_queue: incoming packets 
 * @sk_write_queue: packet sending queue 
 * @sk_sndbuf: size of send buffer in bytes 
 */  
struct sock {  
	...  
	struct sk_buff_head sk_receive_queue;  
	/* 表示接收队列sk_receive_queue中所有段的数据总长度*/  
#define sk_rmem_alloc sk_backlog.rmem_alloc  
  
	int sk_rcvbuf; /* 接收缓冲区长度的上限*/  
	int sk_sndbuf; /* 发送缓冲区长度的上限*/  
  
	struct sk_buff_head sk_write_queue;  
	...  
}  
  
struct sk_buff_head {  
	/* These two members must be first. */  
	struct sk_buff *next;  
	struct sk_buff *prev;  
	__u32 qlen;  
	spinlock_t lock;  
};

TCP的核心系列 — SACK和DSACK的实现

TCP的核心系列 — SACK和DSACK的实现(一)
TCP的核心系列 — SACK和DSACK的实现(二)
TCP的核心系列 — SACK和DSACK的实现(三)
TCP的核心系列 — SACK和DSACK的实现(四)
TCP的核心系列 — SACK和DSACK的实现(五)
TCP的核心系列 — SACK和DSACK的实现(六)
TCP的核心系列 — SACK和DSACK的实现(七)


TCP的核心系列 — SACK和DSACK的实现(一)

TCP的实现中,SACK和DSACK是比较重要的一部分。

SACK和DSACK的处理部分由Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) 维护。

tcp_ack()处理接收到的带有ACK标志的数据段时,如果此ACK处于慢速路径,且此ACK的记分牌不为空,则调用
tcp_sacktag_write_queue()来根据SACK选项标记发送队列中skb的记分牌状态。

笔者主要分析18和37这两个版本的实现。
相对而言,18版本的逻辑清晰,但效率较低;37版本的逻辑复杂,但效率较高。

本文主要内容:18版tcp_sacktag_write_queue()的实现,也即18版SACK和DSACK的实现。

18版数据结构

1
2
3
4
5
/* 这就是一个SACK块 */
struct tcp_sack_block {
	u32 start_seq;  /* 起始序号 */
	u32 end_seq;    /* 结束序号 */
};
1
2
3
4
5
6
7
8
9
10
11
12
13
struct tcp_sock {
	...
	/* Options received (usually on last packet, some only on SYN packets). */
	struct tcp_options_received rx_opt;
	...
	struct tcp_sack_block recv_sack_cache[4]; /* 保存收到的SACK块,用于提高效率*/
	...
	/* 快速路径中使用,上次第一个SACK块的结束处,现在直接从这里开始处理 */
	struct sk_buff *fastpath_skb_hint;
	int fastpath_cnt_hint;  /* 快速路径中使用,上次记录的fack_count,现在继续累加 */
	...

};
1
2
3
4
5
6
7
8
9
10
struct tcp_options_received {
	...
	u16 saw_tstamp : 1,    /* Saw TIMESTAMP on last packet */
		tstamp_ok : 1,     /* TIMESTAMP seen on SYN packet */
		dsack : 1,         /* D-SACK is scheduled, 下一个发送段是否存在D-SACK */
		sack_ok : 4,       /* SACK seen on SYN packet, 接收方是否支持SACK */
		...
	u8 num_sacks;          /* Number of SACK blocks, 下一个发送段中SACK块数 */
	...
};