NAME
ip-tcp_metrics - management for TCP Metrics
SYNOPSIS
ip [ OPTIONS ] tcp_metrics { COMMAND | help }
ip tcp_metrics { show | flush } SELECTOR
ip tcp_metrics delete [ address ] ADDRESS
SELECTOR := [ [ address ] PREFIX ]
EXAMPLES
1234567891011121314151617181920
ip tcp_metrics show address 192.168.0.0/24
Shows the entries for destinations from subnet
ip tcp_metrics show 192.168.0.0/24
The same but address keyword is optional
ip tcp_metrics
Show all is the default action
ip tcp_metrics delete 192.168.0.1
Removes the entry for 192.168.0.1 from cache.
ip tcp_metrics flush 192.168.0.0/24
Removes entries for destinations from subnet
ip tcp_metrics flush all
Removes all entries from cache
ip -6 tcp_metrics flush all
Removes all IPv6 entries from cache keeping the IPv4 entries.
struct rtable
{
union
{
struct dst_entry dst;
} u;
...
// 以下为记录二元组的信息
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information */
__be32 rt_spec_dst; /* RFC1122 specific destination */
// peer很重要!
struct inet_peer *peer; /* long-living peer info */
};
注意这个peer字段,很重要!peer结构体记如下:
123456789101112131415
struct inet_peer
{
/* group together avl_left,avl_right,v4daddr to speedup lookups */
struct inet_peer *avl_left, *avl_right;
__be32 v4daddr; /* peer's address */
__u16 avl_height;
__u16 ip_id_count; /* IP ID for the next packet */
struct list_head unused;
__u32 dtime; /* the time of last use of not
* referenced entries */
atomic_t refcnt;
atomic_t rid; /* Frag reception counter */
__u32 tcp_ts;
unsigned long tcp_ts_stamp;
};
struct inet_peer {
/* group together avl_left,avl_right,v4daddr to speedup lookups */
struct inet_peer __rcu *avl_left, *avl_right;
struct inetpeer_addr daddr;
__u32 avl_height;
// 此为关键!
u32 metrics[RTAX_MAX];
u32 rate_tokens; /* rate limiting for ICMP */
unsigned long rate_last;
union {
struct list_head gc_list;
struct rcu_head gc_rcu;
};
/*
* Once inet_peer is queued for deletion (refcnt == -1), following fields
* are not available: rid, ip_id_count
* We can share memory with rcu_head to help keep inet_peer small.
*/
union {
struct {
atomic_t rid; /* Frag reception counter */
atomic_t ip_id_count; /* IP ID for the next packet */
};
struct rcu_head rcu;
struct inet_peer *gc_next;
};
/* following fields might be frequently dirtied */
__u32 dtime; /* the time of last use of not referenced entries */
atomic_t refcnt;
};
ip tun add lxT mode ipip remote 192.168.9.6 local 192.168.9.5
ip link set lxT up
ip add add 192.168.200.1 brd 255.255.255.255 peer 192.168.200.2 dev lxT
ip ro add 192.168.200.0/24 via 192.168.200.1
B:
1234
ip tun add lxT mode ipip remote 192.168.9.5 local 192.168.9.6
ip link set lxT up
ip add add 192.168.200.2 brd 255.255.255.255 peer 192.168.200.1 dev lxT
ip ro add 192.168.200.0/24 via 192.168.200.2
在A机器添加路由信息,指定到192.168.10.6通过lxT
1
ip ro add 192.168.10.6/32 dev lxT
在B机器添加路由信息,指定到192.168.8.5通过lxT
1
ip ro add 192.168.8.5/32 dev lxT
这样 192.168.8.5 和 192.168.10.6 就可以相互ping通了
部分参数
ttl N 设置进入通道数据包的TTL为N。N是一个1—255之间的数字。0是一个特殊的值,表示这个数据包的TTL值是继承(inherit)的。ttl参数的缺省值是:inherit。
tos T或者dsfield T 设置进入通道数据包的TOS域,缺省是inherit。
mode MODE 设置通道模式。有效的模式包括:ipip、sit和gre。
nopmtudisc 在这个通道上禁止路径最大传输单元发现( Path MTU Discovery)。默认情况下,这个功能是打开的。注意:这个选项和固定的ttl是不兼容的,如果使用了固定的ttl参数,系统会打开路径最大传输单元发现( Path MTU Discovery)功能。
ifconfig eth0 192.168.1.191/24
modprobe ipip # ip tun add tunl0 mode ipip remote any local any
ip link set tunl0 mtu 1480 up
ifconfig tunl0:0 14.0.0.1/24
ip tun add tunl1 mode ipip remote 12.0.0.1 local 12.0.0.102 dev eth1 # better
#ip tun add tunl1 mode ipip remote 192.168.1.102 local 192.168.1.191 dev eth0 # upload err
ip link set tunl1 mtu 1480 up
ip rule add from 14.0.0.1 table 1
ip route add table 1 default dev tunl1
find /proc/ -name rp_filter -exec cat {} \;
find /proc/ -name rp_filter -exec sh -c "echo 0 > {}" \;
GW:
1234567891011121314
find /proc/ -name rp_filter -exec cat {} \;
find /proc/ -name rp_filter -exec sh -c "echo 0 > {}" \;
echo 1 > /proc/sys/net/ipv4/ip_forward
modprobe ipip # ip tun add tunl0 mode ipip remote any local any
ip link set tunl0 mtu 1480 up
ip tun add tunl1 mode ipip remote 192.168.1.191 local 192.168.1.102 dev enp3s0 # better
#ip tun add tunl1 mode ipip remote 12.0.0.102 local 12.0.0.1 dev enx00e04b367c0c # upload err?
ip link set tunl1 mtu 1480 up
ip rule add from 11.0.0.20 table 1
ip route add table 1 default dev tunl1
router># ip rule
0: from all lookup local
1500 from 192.168.3.112/32 [tos 0x10] lookup 2
32766: from all lookup main
32767: from all lookup default
32800: from all lookup 1
Usually, the route taken from Host A to Host B is the same as the route used to get back from Host B to Host A; the route is then called symmetric . In complex setups, the route back may be different; in this case, it is asymmetric.
关于反向路径过滤,参考《Understanding Linux Network Internals》三十一章,
31.7. Reverse Path Filtering
We saw what an asymmetric route is in the section “Essential Elements of Routing in Chapter 30. Asymmetric routes are not common, but may be necessary in certain cases. The default behavior of Linux is to consider asymmetric routing suspicious and therefore to drop any packet whose source IP address is not reachable through the device the packet was received from, according to the routing table.
However, this behavior can be tuned via /proc on a per-device basis, as we will see in Chapter 36. See also the section “Input Routing” in Chapter 35.
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.
The max value from conf/{all,interface}/rp_filter is used
when doing source validation on the {interface}.
Default value is 0. Note that some distributions enable it
in startup scripts.
The routing cache has been suppressed in Linux 3.6 after a 2 years effort by David and the other Linux kernel developers.
The global cache has been suppressed and some stored information have been moved to more separate resources like socket.
Metrics were stored in the routing cache entry which has disappeared. So it has been necessary to introduce a separate TCP metrics cache.
A netlink interface is available to update/delete/add entry to the cache.
当内核接收到一个TCP数据包来说,首先需要查找skb对应的路由,然后查找skb对应的socket。David Miller 发现这样做是一种浪费,对于属于同一个socket(只考虑ESTABLISHED情况)的路由是相同的,那么如果能将skb的路由缓存到socket(skb->sk)中,就可以只查找查找一次skb所属的socket,就可以顺便把路由找到了,于是David Miller提交了一个patch ipv4: Early TCP socket demux
1234567891011121314151617181920212223242526272829
(commit 41063e9dd11956f2d285e12e4342e1d232ba0ea2)
ipv4: Early TCP socket demux.
Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.
But we can optimize this down to one demux for certain kinds of local
sockets.
Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.
If a TCP socket is established then it's identity is fully specified.
This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.
Once we move to established state, we cache the receive packet's input
route to use later.
Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().
Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.
然而Davem添加的这个patch是有局限的,因为这个处理对于转发的数据包,增加了一个在查找路由之前查找socket的逻辑,可能导致转发效率的降低。
Alexander Duyck提出增加一个ip_early_demux参数来控制是否启动这个特性。
12345678910
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%. For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added. It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.
By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.