kk Blog —— 通用基础


date [-d @int|str] [+%s|"+%F %T"]
netstat -ltunp
sar -n DEV 1

路由:反向路径过滤 reverse path filt

1
2
3
find /proc/ -name rp_filter -exec cat {} \;

find /proc/ -name rp_filter -exec sh -c "echo 0 > {}" \;

http://blog.chinaunix.net/uid-20417916-id-3050031.html

反向路径过滤 – reverse path filter

一、原理

先介绍个非对称路由的概念

参考《Understanding Linux Network Internals》三十章,

30.2. Essential Elements of Routing

Symmetric routes and asymmetric routes

Usually, the route taken from Host A to Host B is the same as the route used to get back from Host B to Host A; the route is then called symmetric . In complex setups, the route back may be different; in this case, it is asymmetric.

关于反向路径过滤,参考《Understanding Linux Network Internals》三十一章,

31.7. Reverse Path Filtering

We saw what an asymmetric route is in the section “Essential Elements of Routing in Chapter 30. Asymmetric routes are not common, but may be necessary in certain cases. The default behavior of Linux is to consider asymmetric routing suspicious and therefore to drop any packet whose source IP address is not reachable through the device the packet was received from, according to the routing table.

However, this behavior can be tuned via /proc on a per-device basis, as we will see in Chapter 36. See also the section “Input Routing” in Chapter 35.

二、检查流程

如果一台主机(或路由器)从接口A收到一个包,其源地址和目的地址分别是10.3.0.2和10.2.0.2, 即, 如果启用反向路径过滤功能,它就会以为关键字去查找路由表,如果得到的输出接口不为A,则认为反向路径过滤检查失败,它就会丢弃该包。

关于反向路径过滤,ipv4中有个参数,这个参数的说明在Documentation/networking/ip-sysctl.txt中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
rp_filter - INTEGER
    0 - No source validation.
    1 - Strict mode as defined in RFC3704 Strict Reverse Path
        Each incoming packet is tested against the FIB and if the interface
        is not the best reverse path the packet check will fail.
        By default failed packets are discarded.
    2 - Loose mode as defined in RFC3704 Loose Reverse Path
        Each incoming packet's source address is also tested against the FIB
        and if the source address is not reachable via any interface
        the packet check will fail.

    Current recommended practice in RFC3704 is to enable strict mode
    to prevent IP spoofing from DDos attacks. If using asymmetric routing
    or other complicated routing, then loose mode is recommended.

    The max value from conf/{all,interface}/rp_filter is used
    when doing source validation on the {interface}.

    Default value is 0. Note that some distributions enable it
    in startup scripts.

三、源代码分析

git commit 373da0a2a33018d560afcb2c77f8842985d79594

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
net/ipv4/fib_frontend.c
 192 int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
 193                         int oif, struct net_device *dev, __be32 *spec_dst,
 194                         u32 *itag)
 195 {
             // 是否启用反向路径过滤
 216         /* Ignore rp_filter for packets protected by IPsec. */
 217         rpf = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(in_dev);
 
             // 检查路由表
             // 注意这里的源地址贺目的地址是反过来的,
             // 看看其他函数是如何调用fib_validate_source()就明白了。
 227         if (fib_lookup(net, &fl4, &res))
 228                 goto last_resort;

             // 运行到这里,说明反向路由是可达的
             // 下面分成两种情况检查输出设备是否就是输入设备
 237 #ifdef CONFIG_IP_ROUTE_MULTIPATH
             // 启用多路径时,任意一个匹配,就用它了
 238         for (ret = 0; ret < res.fi->fib_nhs; ret++) {
 239                 struct fib_nh *nh = &res.fi->fib_nh[ret];
 240
 241                 if (nh->nh_dev == dev) {
 242                         dev_match = true;
 243                         break;
 244                 }
 245         }
 246 #else
 247         if (FIB_RES_DEV(res) == dev)
 248                 dev_match = true;
 249 #endif
 250         if (dev_match) {
             // 反向路径过滤检查成功了,返回
 251                 ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
 252                 return ret;
 253         }
 254         if (no_addr)
 255                 goto last_resort;
             // 运行到这里,说明反向路径检查是失败的,
             // 如果rpf为1,表示反向路径检查必须成功才能正常返回,
             // 否则只好返回错误。
 256         if (rpf == 1)
 257                 goto e_rpf;
 278 e_rpf:
 279         return -EXDEV;

五、如何解决

两种方法:

1 On R2:

ip route add 10.3.0.0/16 via 10.2.0.2

增加一条关于10.3.0.0/16子网的路由。

2 On R2:

/etc/sysctl.conf

net.ipv4.conf.default.rp_filter = 0

禁用反向路径检查。

参数ip_early_demux

ip_early_demux 内核中默认是1(开启), 所以在ip_rcv_finish 到 tcp_v4_rcv 中 skb->destructor = sock_edemux;

且很大概率 skb->_skb_refdst = (unsigned long)dst | SKB_DST_NOREF; // #define SKB_DST_NOREF 1UL

对于NOREF的dst,如果要缓存(tcp_prequeue()或sk_add_backlog()), 则要调skb_dst_force(skb);

1
2
3
4
5
6
7
8
static inline void skb_dst_force(struct sk_buff *skb)
{
	if (skb_dst_is_noref(skb)) {
		WARN_ON(!rcu_read_lock_held());
		skb->_skb_refdst &= ~SKB_DST_NOREF;
		dst_clone(skb_dst(skb));
	}   
}

http://blog.chinaunix.net/uid-20662820-id-4935075.html

The routing cache has been suppressed in Linux 3.6 after a 2 years effort by David and the other Linux kernel developers. The global cache has been suppressed and some stored information have been moved to more separate resources like socket.

Metrics were stored in the routing cache entry which has disappeared. So it has been necessary to introduce a separate TCP metrics cache. A netlink interface is available to update/delete/add entry to the cache.

总结起来说就是Linux内核从3.6开始将全局的route cache全部剔除,取而代之的是各个子系统(tcp协议栈)内部的cache,由各个子系统维护。

当内核接收到一个TCP数据包来说,首先需要查找skb对应的路由,然后查找skb对应的socket。David Miller 发现这样做是一种浪费,对于属于同一个socket(只考虑ESTABLISHED情况)的路由是相同的,那么如果能将skb的路由缓存到socket(skb->sk)中,就可以只查找查找一次skb所属的socket,就可以顺便把路由找到了,于是David Miller提交了一个patch ipv4: Early TCP socket demux

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(commit 41063e9dd11956f2d285e12e4342e1d232ba0ea2)
ipv4: Early TCP socket demux.

	Input packet processing for local sockets involves two major demuxes.
	One for the route and one for the socket.

	But we can optimize this down to one demux for certain kinds of local
	sockets.

	Currently we only do this for established TCP sockets, but it could
	at least in theory be expanded to other kinds of connections.

	If a TCP socket is established then it's identity is fully specified.

	This means that whatever input route was used during the three-way
	handshake must work equally well for the rest of the connection since
	the keys will not change.

	Once we move to established state, we cache the receive packet's input
	route to use later.

	Like the existing cached route in sk->sk_dst_cache used for output
	packets, we have to check for route invalidations using dst->obsolete
	and dst->ops->check().

	Early demux occurs outside of a socket locked section, so when a route
	invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
	actually inside of established state packet processing and thus have
	the socket locked.

然而Davem添加的这个patch是有局限的,因为这个处理对于转发的数据包,增加了一个在查找路由之前查找socket的逻辑,可能导致转发效率的降低。 Alexander Duyck提出增加一个ip_early_demux参数来控制是否启动这个特性。

1
2
3
4
5
6
7
8
9
10
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%. For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added. It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static int ip_rcv_finish(struct sk_buff *skb)
{
	const struct iphdr *iph = ip_hdr(skb);
	struct rtable *rt;

	if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) {
		const struct net_protocol *ipprot;
		int protocol = iph->protocol;

		ipprot = rcu_dereference(inet_protos[protocol]);
		if (ipprot && ipprot->early_demux) {
			ipprot->early_demux(skb);
			/* must reload iph, skb->head might have changed */
			iph = ip_hdr(skb);
		}
	}

	/*
	 * Initialise the virtual path cache for the packet. It describes
	 * how the packet travels inside Linux networking.
	 */
	if (!skb_dst(skb)) {
		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
						   iph->tos, skb->dev);
		if (unlikely(err)) {
			if (err == -EXDEV)
				NET_INC_STATS_BH(dev_net(skb->dev),
					 LINUX_MIB_IPRPFILTER);
			goto drop;
		}
	}
	......

ip_early_demux就这样诞生了,目前内核中默认是1(开启),但是如果你的数据流量中60%以上都是转发的,那么请关闭这个特性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
void tcp_v4_early_demux(struct sk_buff *skb)
{
	const struct iphdr *iph;
	const struct tcphdr *th;
	struct sock *sk;

	if (skb->pkt_type != PACKET_HOST)
		return;

	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct tcphdr)))
		return;

	iph = ip_hdr(skb);
	th = tcp_hdr(skb);

	if (th->doff < sizeof(struct tcphdr) / 4)
		return;

	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
				       iph->saddr, th->source,
				       iph->daddr, ntohs(th->dest),
				       skb->skb_iif);
	if (sk) {
		skb->sk = sk;
		skb->destructor = sock_edemux;
		if (sk->sk_state != TCP_TIME_WAIT) {
			struct dst_entry *dst = sk->sk_rx_dst;

			if (dst)
				dst = dst_check(dst, 0);
			if (dst &&
			    inet_sk(sk)->rx_dst_ifindex == skb->skb_iif)
				skb_dst_set_noref(skb, dst);
		}
	}
}

tcp选项TCP_DEFER_ACCEPT

http://www.pagefault.info/?p=346

TCP_DEFER_ACCEPT这个选项可能大家都知道,不过我这里会从源码和数据包来详细的分析这个选项。要注意,这里我所使用的内核版本是3.0.

首先看man手册中的介绍(man 7 tcp):

1
2
TCP_DEFER_ACCEPT (since Linux 2.4)
Allow a listener to be awakened only when data arrives on the socket. Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection. This option should not be used in code intended to be portable. 

我先来简单介绍下,这个选项主要是针对server端的服务器,一般来说我们三次握手,当客户端发送syn,然后server端接收到,然后发送syn + ack,然后client接收到syn+ack之后,再次发送ack(client进入establish状态),最终server端收到最后一个ack,进入establish状态。

而当正确的设置了TCP_DEFER_ACCEPT选项之后,server端会在接收到最后一个ack之后,并不进入establish状态,而只是将这个socket标记为acked,然后丢掉这个ack。此时server端这个socket还是处于syn_recved,然后接下来就是等待client发送数据, 而由于这个socket还是处于syn_recved,因此此时就会被syn_ack定时器所控制,对syn ack进行重传,而重传次数是由我们设置TCP_DEFER_ACCEPT传进去的值以及TCP_SYNCNT选项,proc文件系统的tcp_synack_retries一起来决定的(后面分析源码会看到如何来计算这个值).而我们知道我们传递给TCP_DEFER_ACCEPT的是秒,而在内核里面会将这个东西转换为重传次数.

这里要注意,当重传次数超过限制之后,并且当最后一个ack到达时,下一次导致超时的synack定时器还没启动,那么这个defer的连接将会被加入到establish队列,然后通知上层用户。这个也是符合man里面所说的(Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection.) 也就是最终会完成这个连接.

我们来看抓包,这里server端设置deffer accept,然后客户端connect并不发送数据,我们来看会发生什么:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//客户端发送syn
19:38:20.631611 IP T-diaoliang.60277 > T-diaoliang.sunproxyadmin: Flags [S], seq 2500439144, win 32792, options [mss 16396,sackOK,TS val 9008384 ecr 0,nop,wscale 4], length 0
//server回了syn+ack
19:38:20.631622 IP T-diaoliang.sunproxyadmin > T-diaoliang.60277: Flags [S.], seq 1342179593, ack 2500439145, win 32768, options [mss 16396,sackOK,TS val 9008384 ecr 9008384,nop,wscale 4], length 0
 
//client发送最后一个ack
19:38:20.631629 IP T-diaoliang.60277 > T-diaoliang.sunproxyadmin: Flags [.], ack 1, win 2050, options [nop,nop,TS val 9008384 ecr 9008384], length 0
 
//这里注意时间,可以看到过了大概1分半之后,server重新发送了syn+ack
19:39:55.035893 IP T-diaoliang.sunproxyadmin > T-diaoliang.60277: Flags [S.], seq 1342179593, ack 2500439145, win 32768, options [mss 16396,sackOK,TS val 9036706 ecr 9008384,nop,wscale 4], length 0
19:39:55.035899 IP T-diaoliang.60277 > T-diaoliang.sunproxyadmin: Flags [.], ack 1, win 2050, options [nop,nop,TS val 9036706 ecr 9036706,nop,nop,sack 1 {0:1}], length 0
 
//再过了1分钟,server close掉这条连接。
19:40:55.063435 IP T-diaoliang.sunproxyadmin > T-diaoliang.60277: Flags [F.], seq 1, ack 1, win 2048, options [nop,nop,TS val 9054714 ecr 9036706], length 0
 
19:40:55.063692 IP T-diaoliang.60277 > T-diaoliang.sunproxyadmin: Flags [F.], seq 1, ack 2, win 2050, options [nop,nop,TS val 9054714 ecr 9054714], length 0
 
19:40:55.063701 IP T-diaoliang.sunproxyadmin > T-diaoliang.60277: Flags [.], ack 2, win 2048, options [nop,nop,TS val 9054714 ecr 9054714], length 0

这里要注意,close的原因是当synack超时之后,nginx接收到了这条连接,然后读事件超时,最终导致close这条连接。

接下来就来看内核的代码。

先从设置TCP_DEFER_ACCEPT开始,设置TCP_DEFER_ACCEPT是通过setsockopt来做的,而传递给内核的值是秒,下面就是内核中对应的do_tcp_setsockopt函数,它用来设置tcp相关的option,下面我们能看到主要就是将传递进去的val转换为将要重传的次数。

1
2
3
4
5
6
case TCP_DEFER_ACCEPT:
	/* Translate value in seconds to number of retransmits */
	//注意参数
	icsk->icsk_accept_queue.rskq_defer_accept =
	secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
	break;

这里可以看到通过调用secs_to_retrans来将秒转换为重传次数。接下来就来看这个函数,它有三个参数,第一个是将要转换的秒,第二个是RTO的初始值,第三个是RTO的最大值。 可以看到这里都是依据RTO来计算的,这是因为这个重传次数是syn_ack的重传次数。

这个函数实现很简单,就是一个定时器退避的计算过程(定时器退避可以看我前面的blog的介绍),每次乘2,然后来计算重传次数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
{
	u8 res = 0;
 
	if (seconds > 0) {
		int period = timeout;
		//重传次数
		res = 1;
		//开始遍历
		while (seconds > period && res < 255) {
			res++;
			//定时器退避
			timeout <<= 1;
			if (timeout > rto_max)
				timeout = rto_max;
			//定时器的秒数
			period += timeout;
		}
	}
	return res;
}

然后来看当server端接收到最后一个ack的处理,这里只关注defer_accept的部分,这个函数是tcp_check_req,它主要用来检测SYN_RECV状态接收到包的校验。

req->retrans表示已经重传的次数。

acked标记主要是为了syn_ack定时器来使用的。

1
2
3
4
5
6
7
8
//两个条件,一个是重传次数小于defer_accept,一个是序列号,这两个都必须满足。
if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
	TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
	//此时设置acked。
	inet_rsk(req)->acked = 1;
	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDEFERACCEPTDROP);
	return NULL;
}

而当tcp_check_req返回之后,在tcp_v4_do_rcv中会丢掉这个包,让socket继续保存在半连接队列中。

然后来看syn ack定时器,这个定时器我以前有分析过(http://simohayha.iteye.com/admin/blogs/481989) ,因此我这里只是简要的再次分析下。如果需要更详细的分析,可以看我上面的链接,这个定时器会调用inet_csk_reqsk_queue_prune函数,在这个函数中做相关的处理。

这里我们就主要关注重试次数。其中icsk_syn_retries是TCP_SYNCNT这个option设置的。这个值会比sysctl_tcp_synack_retries优先.然后是rskq_defer_accept,它又比icsk_syn_retries优先.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
void inet_csk_reqsk_queue_prune(struct sock *parent,
				const unsigned long interval,
				const unsigned long timeout,
				const unsigned long max_rto)
{
	........................
	//最大的重试次数
	int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
	int thresh = max_retries;
	unsigned long now = jiffies;
	struct request_sock **reqp, *req;
	int i, budget;
 
	....................................
	//更新设置最大的重试次数。
	if (queue->rskq_defer_accept)
		max_retries = queue->rskq_defer_accept;
 
	budget = 2 * (lopt->nr_table_entries / (timeout / interval));
	i = lopt->clock_hand;
 
	do {
		reqp=&lopt->syn_table[i];
		while ((req = *reqp) != NULL) {
			if (time_after_eq(now, req->expires)) {
				int expire = 0, resend = 0;
				//这个函数主要是判断超时和是否重新发送syn ack,然后保存在expire和resend这个变量中。
				syn_ack_recalc(req, thresh, max_retries,
						   queue->rskq_defer_accept,
						   &expire, &resend);
				....................................................
				if (!expire &&
					(!resend ||
					 !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
					 inet_rsk(req)->acked)) {
					unsigned long timeo;
					//更新重传次数。
					if (req->retrans++ == 0)
						lopt->qlen_young--;
					timeo = min((timeout << req->retrans), max_rto);
					req->expires = now + timeo;
					reqp = &req->dl_next;
					continue;
				}
				//如果超时,则丢掉这个请求,并对应的关闭连接.
				/* Drop this request */
				inet_csk_reqsk_queue_unlink(parent, req, reqp);
				reqsk_queue_removed(queue, req);
				reqsk_free(req);
				continue;
			}
			reqp = &req->dl_next;
		}
 
		i = (i + 1) & (lopt->nr_table_entries - 1);
 
	} while (--budget > 0);
	...............................................
}