kk Blog —— 通用基础

date [-d @int|str] [+%s|"+%F %T"]

tcp_collapse do not copy headers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
commit b3d6cb92fd190d720a01075c4d20cdca896663fc
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Sep 15 04:19:53 2014 -0700

    tcp: do not copy headers in tcp_collapse()

    tcp_collapse() wants to shrink skb so that the overhead is minimal.

    Now we store tcp flags into TCP_SKB_CB(skb)->tcp_flags, we no longer
    need to keep around full headers.
    Whole available space is dedicated to the payload.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 228bf0c..ea92f23 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4535,26 +4535,13 @@ restart:
      return;
 
  while (before(start, end)) {
+     int copy = min_t(int, SKB_MAX_ORDER(0, 0), end - start);
      struct sk_buff *nskb;
-     unsigned int header = skb_headroom(skb);
-     int copy = SKB_MAX_ORDER(header, 0);
 
-     /* Too big header? This can happen with IPv6. */
-     if (copy < 0)
-         return;
-     if (end - start < copy)
-         copy = end - start;
-     nskb = alloc_skb(copy + header, GFP_ATOMIC);
+     nskb = alloc_skb(copy, GFP_ATOMIC);
      if (!nskb)
          return;
 
-     skb_set_mac_header(nskb, skb_mac_header(skb) - skb->head);
-     skb_set_network_header(nskb, (skb_network_header(skb) -
-                       skb->head));
-     skb_set_transport_header(nskb, (skb_transport_header(skb) -
-                     skb->head));
-     skb_reserve(nskb, header);
-     memcpy(nskb->head, skb->head, header);
      memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
      TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
      __skb_queue_before(list, skb, nskb);

这个改进无形中修了一个BUG,但是这BUG正常情况下不会触发,除非我们对skb进行改动导致skb->data - skb->head = 4k时,如果此时内存紧张,且满足tcp_collapse合并条件才触发。

BUG: tcp_collapse代码中有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
	while (before(start, end)) {
		struct sk_buff *nskb;
		unsigned int header = skb_headroom(skb);
		int copy = SKB_MAX_ORDER(header, 0);

		/* Too big header? This can happen with IPv6. */
		if (copy < 0) 
			return;

		......

		/* Copy data, releasing collapsed skbs. */
		while (copy > 0) { 
			int offset = start - TCP_SKB_CB(skb)->seq;
			int size = TCP_SKB_CB(skb)->end_seq - start;

也就是说如果header = 4k,那么copy = 0,那么会一直申请len=0的skb插入到receive队列,直到申请skb失败。这样就会造成tcp_recvmsg出错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
		skb_queue_walk(&sk->sk_receive_queue, skb) {
			/* Now that we have two receive queues this
			 * shouldn't happen.
			 */
			if (WARN(before(*seq, TCP_SKB_CB(skb)->seq),
				 KERN_INFO "recvmsg bug: copied %X "
					   "seq %X rcvnxt %X fl %X\n", *seq,
					   TCP_SKB_CB(skb)->seq, tp->rcv_nxt,
					   flags))
				break;

			offset = *seq - TCP_SKB_CB(skb)->seq;
			if (tcp_hdr(skb)->syn)
				offset--;
			if (offset < skb->len)
				goto found_ok_skb;
			if (tcp_hdr(skb)->fin)
				goto found_fin_ok;
			WARN(!(flags & MSG_PEEK), KERN_INFO "recvmsg bug 2: "
					"copied %X seq %X rcvnxt %X fl %X\n",
					*seq, TCP_SKB_CB(skb)->seq,
					tp->rcv_nxt, flags);
		}

因为offset = 0, len = 0, if (offset < skb->len)就不符合,报WARN。而且如果申请的len=0的skb过多,会导致一直在这里循环,因为WARN有打印堆栈,执行很慢。

错误如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
WARNING: at net/ipv4/tcp.c:1457 tcp_recvmsg+0x96a/0xc20() (Tainted: G    W  ---------------   )
Hardware name: PowerEdge R620
Modules linked in: sha256_generic ws_st_tcp_cubic(U) ws_st(U) autofs4 i2c_dev i2c_core bonding 8021q garp stp llc be2iscsi iscsi_boot_sysfs ib]
Pid: 6964, comm: squid Tainted: G        W  ---------------    2.6.32-358.6.1.x86_64 #1
Call Trace:
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff81069aa8>] ? warn_slowpath_common+0x98/0xc0
 [<ffffffff81069bce>] ? warn_slowpath_fmt+0x6e/0x70
 [<ffffffff814ce08e>] ? _spin_lock_bh+0x2e/0x40
 [<ffffffff813fea53>] ? skb_release_data+0xb3/0x100
 [<ffffffff813feb56>] ? __kfree_skb+0x46/0xa0
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff813f93c7>] ? sock_common_recvmsg+0x37/0x50
 [<ffffffff813f6b05>] ? sock_aio_read+0x185/0x190
 [<ffffffff81171912>] ? do_sync_read+0xf2/0x130
 [<ffffffff81090e60>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811b4a2c>] ? sys_epoll_wait+0x21c/0x3f0
 [<ffffffff8120b3b6>] ? security_file_permission+0x16/0x20
 [<ffffffff81171bab>] ? vfs_read+0x18b/0x1a0
 [<ffffffff81172df5>] ? sys_read+0x55/0x90
 [<ffffffff8100af72>] ? system_call_fastpath+0x16/0x1b
---[ end trace ef9663ba0fc61730 ]---
------------[ cut here ]------------
WARNING: at net/ipv4/tcp.c:1457 tcp_recvmsg+0x96a/0xc20() (Tainted: G        W  ---------------   )
Hardware name: PowerEdge R620
Modules linked in: sha256_generic ws_st_tcp_cubic(U) ws_st(U) autofs4 i2c_dev i2c_core bonding 8021q garp stp llc be2iscsi iscsi_boot_sysfs ib]
Pid: 6964, comm: squid Tainted: G        W  ---------------    2.6.32-358.6.1.x86_64 #1
Call Trace:
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff81069aa8>] ? warn_slowpath_common+0x98/0xc0
 [<ffffffff81069bce>] ? warn_slowpath_fmt+0x6e/0x70
 [<ffffffff814ce08e>] ? _spin_lock_bh+0x2e/0x40
 [<ffffffff813fea53>] ? skb_release_data+0xb3/0x100
 [<ffffffff813feb56>] ? __kfree_skb+0x46/0xa0
 [<ffffffff8144f1ca>] ? tcp_recvmsg+0x96a/0xc20
 [<ffffffff813f93c7>] ? sock_common_recvmsg+0x37/0x50
 [<ffffffff813f6b05>] ? sock_aio_read+0x185/0x190
 [<ffffffff81171912>] ? do_sync_read+0xf2/0x130
 [<ffffffff81090e60>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811b4a2c>] ? sys_epoll_wait+0x21c/0x3f0
 [<ffffffff8120b3b6>] ? security_file_permission+0x16/0x20
 [<ffffffff81171bab>] ? vfs_read+0x18b/0x1a0
 [<ffffffff81172df5>] ? sys_read+0x55/0x90
 [<ffffffff8100af72>] ? system_call_fastpath+0x16/0x1b
---[ end trace ef9663ba0fc61731 ]---
------------[ cut here ]------------

.......

如果skb申请的不多,很快就能看到tcp_cleanup_rbuf的WARN,仔细观察会发现,这里打印的end_seq和上面的seq是一样的。

1
2
3
4
5
6
7
8
9
10
11
12
void tcp_cleanup_rbuf(struct sock *sk, int copied)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int time_to_ack = 0;

#if TCP_DEBUG
	struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);

	WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
		 KERN_INFO "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
		 tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
#endif

tcp三个接收队列

http://www.cnblogs.com/alreadyskb/p/4386565.html

三个接收队列

  • tcp协议栈数据接收实现了三个接收缓存分别是prequeue、sk_write_queue、sk_backlog。

之所以需要三个接收缓存的原因如下:
tcp协议栈接收到数据包时struct sock *sk 可能被进程下上文或者中断上下文占用:

1、如果处于进程上下文sk_lock.owned=1,软中断因为sk_lock.owned=1,所以数据只能暂存在后备队列中(backlog),当进程上下文逻辑处理完成后会回调tcp_v4_do_rcv处理backlog队列作为补偿,具体看tcp_sendmsg 函数 release_sock的实现。

2、如果当前处于中断上下文,sk_lock.owned=0,那么数据可能被放置到receive_queue或者prequeue,数据优先放置到prequeue中,如果prequeue满了则会放置到receive_queue中,理论上这里有一个队列就行了,但是TCP协议栈为什么要设计两个呢?其实是为了快点结束软中断数据处理流程,软中断处理函数中禁止了进程抢占和其他软中断发生,效率应该是很低下的,如果数据被放置到prequeue中,那么软中断流程很快就结束了,如果放置到receive_queue那么会有很复杂的逻辑需要处理。receive_queue队列的处理在软中断中,prequeue队列的处理则是在进程上下文中。总的来说就是为了提高TCP协议栈的效率。

后备队列的处理逻辑

1、什么时候使用后备队列

tcp协议栈对struct sock sk有两把锁,第一把是sk_lock.slock,第二把则是sk_lock.owned。sk_lock.slock用于获取struct sock sk对象的成员的修改权限;sk_lock.owned用于区分当前是进程上下文或是软中断上下文,为进程上下文时sk_lock.owned会被置1,中断上下文为0。

如果是要对sk修改,首先是必须拿锁sk_lock.slock,其后是判断当前是软中断或是进程上下文,如果是进程上下文,那么接收到的skb则只能先放置到后备队列中sk_backlog中。如果是软中断上下文则可以放置到prequeue和sk_write_queue中。

代码片段如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
	bh_lock_sock_nested(sk);               // 获取第一把锁。
	ret = 0;
	if (!sock_owned_by_user(sk)) {         // 判断第二把锁,区分是处于进程上下文还是软中断上下文。
#ifdef CONFIG_NET_DMA
		struct tcp_sock *tp = tcp_sk(sk);
		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
			tp->ucopy.dma_chan = dma_find_channel(DMA_MEMCPY);
		if (tp->ucopy.dma_chan)
			ret = tcp_v4_do_rcv(sk, skb);
		else
#endif
		{
			if (!tcp_prequeue(sk, skb))    // 如果处于中断上下文,则优先放置到prequeue中,如果prequeue满则放置到sk_write_queue中。
				ret = tcp_v4_do_rcv(sk, skb);
		}
	} else if (unlikely(sk_add_backlog(sk, skb,  // 如果是处于进程上下文则直接放置到后备队列中(sk_backlog中)。
						sk->sk_rcvbuf + sk->sk_sndbuf))) {
		bh_unlock_sock(sk);
		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
		goto discard_and_relse;
	}
	bh_unlock_sock(sk);
2、skb怎么add到sk_backlog中

sk_add_backlog函数用于add sbk到sk_backlog中,所以下面我们分析次函数。

1
2
3
4
5
6
7
8
9
10
11
/* The per-socket spinlock must be held here. */
static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb,
						   unsigned int limit)
{
	if (sk_rcvqueues_full(sk, skb, limit))  // 判断接收缓存是否已经用完了,很明显sk_backlog的缓存大小也算在了总接收缓存中。
		return -ENOBUFS;

	__sk_add_backlog(sk, skb);              // 将skb添加到sk_backlog队列中。
	sk_extended(sk)->sk_backlog.len += skb->truesize;  // 更新sk_backlog中已经挂载的数据量。
	return 0;
}
1
2
3
4
5
6
7
8
9
10
11
/* OOB backlog add */
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
	if (!sk->sk_backlog.tail) {   // 如果当前sk_backlog为NULL,此时head和tail都指向skb。
		sk->sk_backlog.head = sk->sk_backlog.tail = skb;
	} else {                      // 分支表示sk_backlog中已经有数据了,那么skb直接挂在tail的尾部,之后tail指针后移到skb。
		sk->sk_backlog.tail->next = skb;
		sk->sk_backlog.tail = skb;
	}
	skb->next = NULL;             // 这种很重要,在sk_backlog处理时会用来判断skb是否处理完毕。
}
3、sk_backlog中skb的处理

很明显sk_backlog的处理必然中进程上下文进行,对于数据接收,进程上下文的接口是tcp_recvmmsg,所以sk_backlog肯定要在tcp_recvmmsg中处理。

tcp_recvmmsg sk_backlog的代码处理片段如下:

1
2
3
tcp_cleanup_rbuf(sk, copied);
TCP_CHECK_TIMER(sk);
release_sock(sk);

release_sock(sk)涉及到sk_backlog处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void release_sock(struct sock *sk)
{
	/*
	* The sk_lock has mutex_unlock() semantics:
	*/
	mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);

	spin_lock_bh(&sk->sk_lock.slock);   // 获取第一把锁。
	if (sk->sk_backlog.tail)            // 如果后备队列不为NULL,则开始处理。
		__release_sock(sk);

	if (proto_has_rhel_ext(sk->sk_prot, RHEL_PROTO_HAS_RELEASE_CB) &&
			sk->sk_prot->release_cb)
		sk->sk_prot->release_cb(sk);

	sk->sk_lock.owned = 0;              // 进成上下文skb处理完了,释放第二把锁。
	if (waitqueue_active(&sk->sk_lock.wq))
		wake_up(&sk->sk_lock.wq);
	spin_unlock_bh(&sk->sk_lock.slock); // 释放第一把锁。
}

__release_sock(sk)是后备队列的真正处理函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
static void __release_sock(struct sock *sk)
{
	struct sk_buff *skb = sk->sk_backlog.head;

	do {
		sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
		bh_unlock_sock(sk);

		do {
			struct sk_buff *next = skb->next;

			skb->next = NULL;
			sk_backlog_rcv(sk, skb);    // skb的处理函数,其实调用的是tcp_v4_do_rcv函数。

			/*
			 * We are in process context here with softirqs
			 * disabled, use cond_resched_softirq() to preempt.
			 * This is safe to do because we've taken the backlog
			 * queue private:
			 */
			cond_resched_softirq();

			skb = next;
		} while (skb != NULL);          // 如果skb=NULL,那么说明之前的sk_backlog已经处理完了。

		bh_lock_sock(sk);
	} while ((skb = sk->sk_backlog.head) != NULL); // 在处理上一个sk_backlog时,可能被软中断中断了,建立了新的sk_backlog,新建立的sk_backlog也将一并被处理。

	/*
	* Doing the zeroing here guarantee we can not loop forever
	* while a wild producer attempts to flood us.
	*/
	sk_extended(sk)->sk_backlog.len = 0;
}

一开始重置sk->sk_backlog.head ,sk->sk_backlog.tail为NULL。sk_backlog是一个双链表,head指向了链表头部的skb,而tail则指向了链表尾部的skb。这里之所以置NULL head 和tail,是因为struct sk_buff *skb = sk->sk_backlog.head 提前取到了head指向的skb,之后就可以通过skb->next来获取下一个skb处理,结束的条件是skb->next=NULL,这个是在__sk_add_backlog函数中置位的,也就说对于sk_backlog的处理head和tail指针已经没有用了。

为什么要置NULLsk->sk_backlog.head ,sk->sk_backlog.tail呢?第一想法是它可能要被重新使用了。那么在什么情况下会被重新使用呢?试想一下当前是在进程上下文,并且sk->sk_lock.slock没有被锁住,那是不是可能被软中断打断呢?如果被软中断打断了是不是要接收数据呢,tcp协议栈为了效率考虑肯定是要接收数据的,前面分析道这种情况的数据必须放置到后备队列中(sk_backlog),所以可以肯定置NULL sk->sk_backlog.head ,sk->sk_backlog.tail是为了在处理上一个sk_backlog时,能重用sk_backlog,建立一条新的sk_backlog,或许有人会问为什么不直接添加到原先的sk_backlog tail末尾呢?这个问题我也没有想太清楚,或许是同步不好做吧。

4、skb被处理到哪去了

很明显接收的数据最终都将被传递到应用层,在传递到应用层前必须要保证三个接收队列中的数据有序,那么这三个队列是怎么保证数据字节流有序的被递交给应用层呢?三个队列都会调用tcp_v4_do_rcv函数,prequeue和sk_backlog是在tcp_recvmsg中调用tcp_v4_do_rcv函数,也就是进程上下文中调用tcp_v4_do_rcv函数,但会local_bh_disable禁止软中断。如果在tcp_rcv_established, tcp_data_queue中如果刚好数据可以直接copy到用户空间,又会短暂开始软中断local_bh_enable。

但在tcp_checksum_complete_user、tcp_rcv_established、tcp_data_queue函数中开启软中断将来容易出问题,进入软中断:softirq()+=1; local_bh_enable:softirq()-=2; 所以现在只是软中断中softirq()统计不准,进程中还是准的。但如果以后在软中断中在local_bh_enable之前给softirq()+=1了,那么就会导致软中断被打断,导致软中断执行途中被切走而且永远切不回来。tcp_checksum_complete_user被切走导致收包不成功,tcp_rcv_established、tcp_data_queue函数中如果在tp->copied_seq+=chunk后被切走就会导致tp->copied_seq>tp->rcv_nxt,那么下次收包后就有可能出现tp->copied_seq > sk_write_queue.first.end_seq, 等异常。

如果仔细分析tcp_v4_do_rcv函数能发现,这个函数能保证数据有序的排列在一起,所以无论是在处理sk_backlog还是prequeue,最终都会调用tcp_v4_do_rcv函数将数据有效地插入到sk_write_queue中,最后被应用层取走。

tcp_read_sock BUG

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
commit baff42ab1494528907bf4d5870359e31711746ae
Author: Steven J. Magnani <steve@digidescorp.com>
Date:   Tue Mar 30 13:56:01 2010 -0700

	net: Fix oops from tcp_collapse() when using splice()

	tcp_read_sock() can have a eat skbs without immediately advancing copied_seq.
	This can cause a panic in tcp_collapse() if it is called as a result
	of the recv_actor dropping the socket lock.

	A userspace program that splices data from a socket to either another
	socket or to a file can trigger this bug.

	Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>
1
2
3
4
5
6
7
8
9
10
11
12
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6afb6d8..2c75f89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1368,6 +1368,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
      sk_eat_skb(sk, skb, 0);
      if (!desc->count)
          break;
+     tp->copied_seq = seq;
  }
  tp->copied_seq = seq;
 

如果在tcp_read_sock中sk_eat_skb时copied_seq没有及时一起修改的话,就会出现copied_seq小于sk_write_queue队列第一个包的seq。
tcp_read_sock的recv_actor指向的函数(比如tcp_splice_data_recv)是有可能释放sk锁的,如果这时进入收包软中断且内存紧张调用tcp_collapse,
tcp_collapse中:

1
2
3
4
5
start = copied_seq
...
int offset = start - TCP_SKB_CB(skb)->seq;

BUG_ON(offset < 0);

tcp_match_skb_to_sack BUG

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
commit 2cd0d743b05e87445c54ca124a9916f22f16742e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed Jun 18 21:15:03 2014 -0400

	tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb

	If there is an MSS change (or misbehaving receiver) that causes a SACK
	to arrive that covers the end of an skb but is less than one MSS, then
	tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
	the skb ("Round if necessary..."), then chopping all bytes off the skb
	and creating a zero-byte skb in the write queue.

	This was visible now because the recently simplified TLP logic in
	bef1909ee3ed1c ("tcp: fixing TLP's FIN recovery") could find that 0-byte
	skb at the end of the write queue, and now that we do not check that
	skb's length we could send it as a TLP probe.

	Consider the following example scenario:

	 mss: 1000
	 skb: seq: 0 end_seq: 4000  len: 4000
	 SACK: start_seq: 3999 end_seq: 4000

	The tcp_match_skb_to_sack() code will compute:

	 in_sack = false
	 pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
	 new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
	 new_len += mss = 4000

	Previously we would find the new_len > skb->len check failing, so we
	would fall through and set pkt_len = new_len = 4000 and chop off
	pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
	afterward in the write queue.

	With this new commit, we notice that the new new_len >= skb->len check
	succeeds, so that we return without trying to fragment.

	Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
	Reported-by: Eric Dumazet <edumazet@google.com>
	Signed-off-by: Neal Cardwell <ncardwell@google.com>
	Cc: Eric Dumazet <edumazet@google.com>
	Cc: Yuchung Cheng <ycheng@google.com>
	Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
	Acked-by: Eric Dumazet <edumazet@google.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>
1
2
3
4
5
6
7
8
9
10
11
12
13
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 40661fc..b5c2375 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1162,7 +1162,7 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
          unsigned int new_len = (pkt_len / mss) * mss;
          if (!in_sack && new_len < pkt_len) {
              new_len += mss;
-             if (new_len > skb->len)
+             if (new_len >= skb->len)
                  return 0;
          }
          pkt_len = new_len;

gro收包

linux kernel 网络协议栈之GRO(Generic receive offload)

gro会合并多个gso_size不同的包, 会将gso_size设置成第一个包的gso_size.

如果此时把这个包发出去,那么就会导致不满足: skb->gso_size * (skb->segs-1) < skb->len <= skb->gso_size * skb->segs

那么后面的三个函数就有可能出错

一、tcp_shift_skb_data

1
2
3
4
5
6
7
mss = skb->gso_size
len = len/mss * mss

|---|-------|-------|
 mss    |
        V
|---|---|

二、tcp_mark_head_lost

1
2
3
4
5
6
len = (packets - cnt) * mss

|--------|--|--|
   mss   |
         V
|--------|--------|

三、tcp_match_skb_to_sack

1
2
3
4
5
6
7
8
new_len = (pkt_len/mm)*mss
in_sack = 1
pkt_len = new_len

|---|-------|-------|
 mss    |
        V
|---|---|

修改

加入发包队列前

1
2
3
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_segs = 0;
skb_shinfo(skb)->gso_type = 0;