(1)Open:Normal state, no dubious events, fast path.
(2)Disorder:In all respects it is Open, but requres a bit more attention.
It is entered when we see some SACKs or dupacks. It is split of Open mainly to move some processing from fast path to slow one.
(3)CWR:cwnd was reduced due to some Congestion Notification event.
It can be ECN, ICMP source quench, local device congestion.
(4)Recovery:cwnd was reduced, we are fast-retransmitting.
(5)Loss:cwnd was reduced due to RTO timeout or SACK reneging.
tcp_fastretrans_alert() is entered:
(1)each incoming ACK, if state is not Open
(2)when arrived ACK is unusual, namely:
SACK
Duplicate ACK
ECN ECE
Counting packets in flight is pretty simple.
(1)in_flight = packets_out - left_out + retrans_out
packets_out is SND.NXT - SND.UNA counted in packets.
retrans_out is number of retransmitted segments.
left_out is number of segments left network, but not ACKed yet.
(2)left_out = sacked_out + lost_out
sacked_out:Packets, which arrived to receiver out of order and hence not ACKed. With SACK this number is simply amount of SACKed data. Even without SACKs it is easy to give pretty reliable estimate of this number, counting duplicate ACKs.
(3)lost_out:Packets lost by network. TCP has no explicit loss notification feedback from network(for now). It means that this number can be only guessed. Actually, it is the heuristics to predict lossage that distinguishes different algorithms.
F.e. after RTO, when all the queue is considered as lost, lost_out = packets_out and in_flight = retrans_out.
Essentially, we have now two algorithms counting lost packets.
1)FACK:It is the simplest heuristics. As soon as we decided that something is lost, we decide that all not SACKed packets until the most forward SACK are lost. I.e.
lost_out = fackets_out - sacked_out and left_out = fackets_out
It is absolutely correct estimate, if network does not reorder packets. And it loses any connection to reality when reordering takes place. We use FACK by defaut until reordering is suspected on the path to this destination.
2)NewReno:when Recovery is entered, we assume that one segment is lost (classic Reno). While we are in Recovery and a partial ACK arrives, we assume that one more packet is lost (NewReno).
This heuristics are the same in NewReno and SACK.
Imagine, that’s all! Forget about all this shamanism about CWND inflation deflation etc. CWND is real congestion window, never inflated, changes only according to classic VJ rules.
Really tricky (and requiring careful tuning) part of algorithm is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
tcp_time_to_recover()
It determines the moment when we should reduce cwnd and, hence, slow down forward transmission. In fact, it determines the moment when we decide that hole is caused by loss, rather than by a reorder.
tcp_xmit_retransmit_queue()
It decides what we should retransmit to fill holes, caused by lost packets.
undo heuristics
And the most logically complicated part of algorithm is undo heuristics. We detect false retransmits due to both too early fast retransmit (reordering) and underestimated RTO, analyzing timestamps and D-SACKs. When we detect that some segments were retransmitted by mistake and CWND reduction was wrong, we undo window reduction and abort recovery phase. This logic is hidden inside several functions named tcp_try_undo_.
此函数分成几个阶段:
A. FLAG_ECE,收到包含ECE标志的ACK。
B. reneging SACKs,ACK指向已经被SACK的数据段。如果是此原因,进入超时处理,然后返回。
C. state is not Open,发现丢包,需要标志出丢失的包,这样就知道该重传哪些包了。
D. 检查是否有错误( left_out > packets_out)。
E. 各个状态是怎样退出的,当snd_una >= high_seq时候。
F. 各个状态的处理和进入。
/* Process an event, which can update packets-in-flight not trivially.
* Main goal of this function is to calculate new estimate for left_out,
* taking into account both packets sitting in receiver's buffer and
* packets lost by network.
*
* Besides that it does CWND reduction, when packet loss is detected
* and changes state of machine.
*
* It does not decide what to send, it is made in function
* tcp_xmit_retransmit_queue().
*/
/* 此函数被调用的条件:
* (1) each incoming ACK, if state is not Open
* (2) when arrived ACK is unusual, namely:
* SACK
* Duplicate ACK
* ECN ECE
*/
static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
/* 判断是不是重复的ACK*/
int is_dupack = ! (flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
/* tcp_fackets_out()返回hole的大小,如果大于reordering,则认为发生丢包.*/
int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
(tcp_fackets_out(tp) > tp->reordering ));
int fast_rexmit = 0, mib_idx;
/* 如果packet_out为0,那么不可能有sacked_out */
if (WARN_ON(!tp->packets_out && tp->sacked_out))
tp->sacked_out = 0;
/* fack的计数至少需要依赖一个SACK的段.*/
if (WARN_ON(!tp->sacked_out && tp->fackets_out))
tp->fackets_out = 0;
/* Now state machine starts.
* A. ECE, hence prohibit cwnd undoing, the reduction is required.
* 禁止拥塞窗口撤销,并开始减小拥塞窗口。
*/
if (flag & FLAG_ECE)
tp->prior_ssthresh = 0;
/* B. In all the states check for reneging SACKs.
* 检查是否为虚假的SACK,即ACK是否确认已经被SACK的数据.
*/
if (tcp_check_sack_reneging(sk, flag))
return;
/* C. Process data loss notification, provided it is valid.
* 为什么需要这么多个条件?不太理解。
* 此时不在Open态,发现丢包,需要标志出丢失的包。
*/
if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOSS) &&
before(tp->snd_una, tp->high_seq) &&
icsk->icsk_ca_state != TCP_CA_Open &&
tp->fackets_out > tp->reordering) {
tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS);
}
/* D. Check consistency of the current state.
* 确定left_out < packets_out
*/
tcp_verify_left_out(tp);
/* E. Check state exit conditions. State can be terminated
* when high_seq is ACKed. */
if (icsk->icsk_ca_state == TCP_CA_Open) {
/* 在Open状态,不可能有重传且尚未确认的段*/
WARN_ON(tp->retrans_out != 0);
/* 清除上次重传阶段第一个重传段的发送时间*/
tp->retrans_stamp = 0;
} else if (!before(tp->snd_una, tp->high_seq) {/* high_seq被确认了*/
switch(icsk->icsk_ca_state) {
case TCP_CA_Loss:
icsk->icsk_retransmits = 0; /*超时重传次数归0*/
/*不管undo成功与否,都会返回Open态,除非没有使用SACK*/
if (tcp_try_undo_recovery(sk))
return;
break;
case TCP_CA_CWR:
/* CWR is to be held someting *above* high_seq is ACKed
* for CWR bit to reach receiver.
* 需要snd_una > high_seq才能撤销
*/
if (tp->snd_una != tp->high_seq) {
tcp_complete_cwr(sk);
tcp_set_ca_state(sk, TCP_CA_Open);
}
break;
case TCP_CA_Disorder:
tcp_try_undo_dsack(sk);
/* For SACK case do not Open to allow to undo
* catching for all duplicate ACKs.?*/
if (!tp->undo_marker || tcp_is_reno(tp) ||
tp->snd_una != tp->high_seq) {
tp->undo_marker = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
}
case TCP_CA_Recovery:
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp)); /* sacked_out清零*/
if (tcp_try_undo_recovery(sk))
return;
tcp_complete_cwr(sk);
break;
}
}
/* F. Process state. */
switch(icsk->icsk_ca_state) {
case TCP_CA_Recovery:
if (!(flag & FLAG_SND_UNA_ADVANCED)) {
if (tcp_is_reno(tp) && is_dupack)
tcp_add_reno_sack(sk); /* 增加sacked_out ,检查是否出现reorder*/
} else
do_lost = tcp_try_undo_partial(sk, pkts_acked);
break;
case TCP_CA_Loss:
/* 收到partical ack,超时重传的次数归零*/
if (flag & FLAG_DATA_ACKED)
icsk->icsk_retransmits = 0;
if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
tcp_reset_reno_sack(tp); /* sacked_out清零*/
if (!tcp_try_undo_loss(sk)) { /* 尝试撤销拥塞调整,进入Open态*/
/* 如果不能撤销,则继续重传标志为丢失的包*/
tcp_moderate_cwnd(tp);
tcp_xmit_retransmit_queue(sk); /* 待看*/
return;
}
if (icsk->icsk_ca_state != TCP_CA_Open)
return;
/* Loss is undone; fall through to process in Open state.*/
default:
if (tcp_is_reno(tp)) {
if (flag & FLAG_SND_UNA_ADVANCED)
tcp_reset_reno_sack(tp);
if (is_dupack)
tcp_add_reno_sack(sk);
}
if (icsk->icsk_ca_state == TCP_CA_Disorder)
tcp_try_undo_dsack(sk); /*D-SACK确认了所有重传的段*/
/* 判断是否应该进入Recovery状态*/
if (! tcp_time_to_recover(sk)) {
/*此过程中,会判断是否进入Open、Disorder、CWR状态*/
tcp_try_to_open(sk, flag);
return;
}
/* MTU probe failure: don't reduce cwnd */
/* 关于MTU探测部分此处略过!*/
......
/* Otherwise enter Recovery state */
if (tcp_is_reno(tp))
mib_idx = LINUX_MIB_TCPRENORECOVERY;
else
mib_idx = LINUX_MIB_TCPSACKRECOVERY;
NET_INC_STATS_BH(sock_net(sk), mib_idx);
/* 进入Recovery状态前,保存那些用于恢复的数据*/
tp->high_seq = tp->snd_nxt; /* 用于判断退出时机*/
tp->prior_ssthresh = 0;
tp->undo_marker = tp->snd_una;
tp->undo_retrans=tp->retrans_out;
if (icsk->icsk_ca_state < TCP_CA_CWR) {
if (! (flag & FLAG_ECE))
tp->prior_ssthresh = tcp_current_ssthresh(sk); /*保存旧阈值*/
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);/*更新阈值*/
TCP_ECN_queue_cwr(tp);
}
tp->bytes_acked = 0;
tp->snd_cwnd_cnt = 0;
tcp_set_ca_state(sk, TCP_CA_Recovery); /* 进入Recovery状态*/
fast_rexmit = 1; /* 快速重传标志 */
}
if (do_lost || (tcp_is_fack(tp) && tcp_head_timeout(sk)))
/* 更新记分牌,标志丢失和超时的数据包,增加lost_out */
tcp_update_scoreboard(sk, fast_rexmit);
/* 减小snd_cwnd */
tcp_cwnd_down(sk, flag);
tcp_xmit_retransmit_queue(sk);
}
/* If ACK arrived pointing to a remembered SACK, it means that our remembered
* SACKs do not reflect real state of receiver i.e. receiver host is heavily congested
* or buggy.
*
* Do processing similar to RTO timeout.
*/
static int tcp_check_sack_reneging (struct sock *sk, int flag)
{
if (flag & FLAG_SACK_RENEGING) {
struct inet_connection_sock *icsk = inet_csk(sk);
/* 记录mib信息,供SNMP使用*/
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
/* 进入loss状态,1表示清除SACKED标志*/
tcp_enter_loss(sk, 1); /* 此函数在前面blog中分析过:)*/
icsk->icsk_retransmits++; /* 未恢复的RTO加一*/
/* 重传发送队列中的第一个数据包*/
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
/* 更新超时重传定时器*/
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
icsk->icsk_rto, TCP_RTO_MAX);
return 1;
}
return 0;
}
/** 用于返回发送队列中的第一个数据包,或者NULL
* skb_peek - peek at the head of an &sk_buff_head
* @list_ : list to peek at
*
* Peek an &sk_buff. Unlike most other operations you must
* be careful with this one. A peek leaves the buffer on the
* list and someone else may run off with it. You must hold
* the appropriate locks or have a private queue to do this.
*
* Returns %NULL for an empty list or a pointer to the head element.
* The reference count is not incremented and the reference is therefore
* volatile. Use with caution.
*/
static inline struct sk_buff *skb_peek (const struct sk_buff_head *list_)
{
struct sk_buff *list = ((const struct sk_buff *) list_)->next;
if (list == (struct sk_buff *) list_)
list = NULL;
return list;
}
static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)
{
return skb_peek(&sk->sk_write_queue);
}
/* Function to create two new TCP segments. shrinks the given segment
* to the specified size and appends a new segment with the rest of the
* packet to the list. This won't be called frequently, I hope.
* Remember, these are still headerless SKBs at this point.
*/
int tcp_fragment (struct sock *sk, struct sk_buff *skb, u32 len,
unsigned int mss_now) {}
给一个段添加一个LOST标志
123456789101112131415161718192021
static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
{
if (! (TCP_SKB_CB(skb)->sacked & (TCPCB_LOST | TCPCB_SACKED_ACKED))) {
tcp_verify_retransmit_hint(tp, skb); /* 更新重传队列*/
tp->lost_out += tcp_skb_pcount(skb); /*增加LOST的段数*/
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; /* 添加LOST标志*/
}
}
/* This must be called before lost_out is incremented */
static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb)
{
if ((tp->retransmit_skb_hint == NULL) ||
before(TCP_SKB_CB(skb)->seq,
TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
tp->retransmit_skb_hint = skb;
if (! tp->lost_out ||
after(TCP_SKB_CB(skb)->end_seq, tp->retransmit_high))
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
}
If seq number greater than high_seq is acked, it indicates that the CWR indication has reached the peer TCP, call tcp_complete_cwr() to bring down the cwnd to ssthresh value.
/* This gets called after a retransmit timeout, and the initially retransmitted data is
* acknowledged. It tries to continue resending the rest of the retransmit queue, until
* either we've sent it all or the congestion window limit is reached. If doing SACK,
* the first ACK which comes back for a timeout based retransmit packet might feed us
* FACK information again. If so, we use it to avoid unnecessarily retransmissions.
*/
void tcp_xmit_retransmit_queue (struct sock *sk) {}
/* This function decides, when we should leave Disordered state and enter Recovery
* phase, reducing congestion window.
* 决定什么时候离开Disorder状态,进入Recovery状态。
*/
static int tcp_time_to_recover(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
/* Do not perform any recovery during F-RTO algorithm
* 这说明Recovery状态不能打断Loss状态。
*/
if (tp->frto_counter)
return 0;
/* Trick#1: The loss is proven.
* 如果传输过程中存在丢失段,则可以进入Recovery状态。
*/
if (tp->lost_out)
return 1;
/* Not-A-Trick#2 : Classic rule...
* 如果收到重复的ACK大于乱序的阈值,表示有数据包丢失了,
* 可以进入到Recovery状态。
*/
if (tcp_dupack_heuristics(tp) > tp->reordering)
return 1;
/* Trick#3 : when we use RFC2988 timer restart, fast
* retransmit can be triggered by timeout of queue head.
* 如果发送队列的第一个数据包超时,则进入Recovery状态。
*/
if (tcp_is_fack(tp) && tcp_head_timeout(sk))
return 1;
/* Trick#4 : It is still not OK... But will it be useful to delay recovery more?
* 如果此时由于应用程序或接收窗口的限制而不能发包,且接收到很多的重复ACK。那么不能再等下去了,
* 推测发生了丢包,且马上进入Recovery状态。
*/
if (packets_out <= tp->reordering &&
tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering)
&& ! tcp_may_send_now(sk) ) {
/* We have nothing to send. This connection is limited
* either by receiver window or by application.
*/
return 1;
}
/* If a thin stream is detected, retransmit after first received
* dupack. Employ only if SACK is supported in order to avoid
* possible corner-case series of spurious retransmissions
* Use only if there are no unsent data.
*/
if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
tcp_is_sack(tp) && ! tcp_send_head(sk))
return 1;
return 0; /*表示为假*/
}
/* Heurestics to calculate number of duplicate ACKs. There's no
* dupACKs counter when SACK is enabled (without SACK, sacked_out
* is used for that purpose).
* Instead, with FACK TCP uses fackets_out that includes both SACKed
* segments up to the highest received SACK block so far and holes in
* between them.
*
* With reordering, holes may still be in filght, so RFC3517 recovery uses
* pure sacked_out (total number of SACKed segment) even though it
* violates the RFC that uses duplicate ACKs, often these are equal but
* when e.g. out-of-window ACKs or packet duplication occurs, they differ.
* Since neither occurs due to loss, TCP shuld really ignore them.
*/
static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
{
return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
}
/* Determines whether this is a thin stream (which may suffer from increased
* latency). Used to trigger latency-reducing mechanisms.
*/
static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)
{
return tp->packets_out < 4 && ! tcp_in_initial_slowstart(tp);
}
#define TCP_INFINITE_SSTHRESH 0x7fffffff
static inline bool tcp_in_initial_slowstart(const struct tcp_sock *tp)
{
return tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH;
}
This function examines various parameters (like number of packet lost) for TCP connection to decide whether it is the right time to move to Recovery state. It’s time to recover when TCP heuristics suggest a strong possibility of packet loss in the network, the following checks are made.
总的来说,一旦确定有丢包,或者很可能丢包,就可以进入Recovery状态恢复丢包了。
可以进入Recovery状态的条件包括:
(1) some packets are lost (lost_out is non zero)。发现有丢包。
(2) SACK is an acknowledgement for out of order packets. If number of packets Sacked is greater than the
reordering metrics of the network, then loss is assumed to have happened.
被fack数据或收到的重复ACK,大于乱序的阈值,表明很可能发生丢包。
(3) If the first packet waiting to be acked (head of the write Queue) has waited for time equivalent to retransmission
timeout, the packet is assumed to have been lost. 发送队列的第一个数据段超时,表明它可能丢失了。
(4) If the following three conditions are true, TCP sender is in a state where no more data can be transmitted
and number of packets acked is big enough to assume that rest of the packets are lost in the network:
A: If packets in flight is less than the reordering metrics.
B: More than half of the packets in flight have been sacked by the receiver or number of packets sacked is more
than the Fast Retransmit thresh. (Fast Retransmit thresh is the number of dupacks that sender awaits before
fast retransmission)
C: The sender can not send any more packets because either it is bound by the sliding window or the application
has not delivered any more data to it in anticipation of ACK for already provided data.
我们收到很多的重复ACK,那么很可能有数据段丢失了。如果此时由于接收窗口或应用程序的限制而不能发送数据,那么我们不打算再等下去,直接进入Recovery状态。
(5) 当检测到当前流量很小时(packets_out < 4),如果还满足以下条件:
A: tp->thin_dupack == 1 / Fast retransmit on first dupack /
或者sysctl_tcp_thin_dupack为1,表明允许在收到第一个重复的包时就重传。
B: 启用SACK,且FACK或SACK的数据量大于1。
C: 没有未发送的数据,tcp_send_head(sk) == NULL。
这是一种特殊的情况,只有当流量非常小的时候才采用。
In Disorder state TCP is still unsure of genuiness of loss, after receiving acks with sack there may be a clearing ack which acks many packets non dubiously in one go. Such a clearing ack may cause a packet burst in the network, to avoid this cwnd size is reduced to allow no more than max_burst (usually 3) number of packets.
拥塞控制(Congestion Control) — A mechanism to prevent a TCP sender from overwhelming the network.
流控制(Flow Control) — A mechanism to prevent a TCP sender from overwhelming a TCP receiver.
下面是一段关于流控制原理的简要描述。
“The basic flow control algorithm works as follows: The receiver communicates to the sender the maximum amount of data it can accept using the rwnd protocol field. This is called the receive window. The TCP sender then sends no more than this amount of data across the network. The TCP sender then stops and waits for acknowledgements back from the receiver. When acknowledgement of the previously sent data is returned to the sender, the sender then resumes sending new data. It’s essentially the old maxim hurry up and wait. ”
It has been demomstrated that this method can successfully grow the receiver’s advertised window at a pace sufficient to avoid constraining the sender’s throughput. As a result, systems can avoid the network performance problems that result from either the under-utilization or over-utilization of buffer space.
下面是一段对此方法的评价:
If the sender is being throttled by the network, this estimate will be valid. However, if the sending application did not have any data to send, the measured time could be much larger than the actual round-trip time. Thus this measurement acts only as an upper-bound on the round-trip time.
/* win_dep表示是否对RTT采样进行微调,1为不进行微调,0为进行微调。*/
static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
{
u32 new_sample = tp->rcv_rtt_est.rtt;
long m = sample;
if (m == 0)
m = 1; /* 时延最小为1ms*/
if (new_sample != 0) { /* 不是第一次获得样本*/
/* If we sample in larger samples in the non-timestamp case, we could grossly
* overestimate the RTT especially with chatty applications or bulk transfer apps
* which are stalled on filesystem I/O.
*
* Also, since we are only going for a minimum in the non-timestamp case, we do
* not smooth things out else with timestamps disabled convergence takes too long.
*/
/* 对RTT采样进行微调,新的RTT样本只占最终RTT的1/8 */
if (! win_dep) {
m -= (new_sample >> 3);
new_sample += m;
} else if (m < new_sample)
/* 不对RTT采样进行微调,直接取最小值,原因可见上面那段注释*/
new_sample = m << 3;
} else {
/* No previous measure. 第一次获得样本*/
new_sample = m << 3;
}
if (tp->rcv_rtt_est.rtt != new_sample)
tp->rcv_rtt_est.rtt = new_sample; /* 更新RTT*/
}
对于没有使用时间戳选项的RTT测量方法,不进行微调。因为用此种方法获得的RTT采样值已经偏高而且收敛很慢。直接选择最小RTT样本作为最终的RTT测量值。
对于使用时间戳选项的RTT测量方法,进行微调,新样本占最终RTT的1/8,即rtt = 7/8 old + 1/8 new。
在tcp_moderate_rcvbuf启用的情况下,用来对计算接收缓冲区和接收窗口的参数进行微调,默认值为2。
This means that the application buffer is ¼th of the total buffer space specified in the tcp_rmem variable.
In order to keep pace with the growth of the sender’s congestion window during slow-start, the receiver should use the same doubling factor. Thus the receiver should advertise a window that is twice the size of the last measured window size.
这样就能保证接收窗口上限的增长速度不小于拥塞窗口的增长速度,避免接收窗口成为传输瓶颈。
(2)收到乱序包时有什么影响?
Packets that are received out of order may have lowered the goodput during this measurement, but will increase the goodput of the following measurement which, if larger, will supercede this measurement.
乱序包会使本次的吞吐量测量值偏小,使下次的吞吐量测量值偏大。
Reference
[1] Mike Fisk, Wu-chun Feng, “Dynamic Right-Sizing in TCP”.
In computer networking, large segment offload (LSO) is a technique for increasing outbound
throughput of high-bandwidth network connections by reducing CPU overhead. It works by queuing
up large buffers and letting the network interface card (NIC) split them into separate packets.
The technique is also called TCP segmentation offload (TSO) when applied to TCP, or generic
segmentation offload (GSO).
The inbound counterpart of large segment offload is large recive offload (LRO).
When large chunks of data are to be sent over a computer network, they need to be first broken
down to smaller segments that can pass through all the network elements like routers and
switches between the source and destination computers. This process it referred to as
segmentation. Segmentation is often done by the TCP protocol in the host computer. Offloading
this work to the NIC is called TCP segmentation offload (TSO).
For example, a unit of 64KB (65,536 bytes) of data is usually segmented to 46 segments of 1448
bytes each before it is sent over the network through the NIC. With some intelligence in the NIC,
the host CPU can hand over the 64KB of data to the NIC in a single transmit request, the NIC can
break that data down into smaller segments of 1448 bytes, add the TCP, IP, and data link layer
protocol headers——according to a template provided by the host’s TCP/IP stack——to each
segment, and send the resulting frames over the network. This significantly reduces the work
done by the CPU. Many new NICs on the market today support TSO. [1]
具体
It is a method to reduce CPU workload of packet cutting in 1500byte and asking hardware to
perform the same functionality.
1.TSO feature is implemented using the hardware support. This means hardware should be
able to segment the packets in max size of 1500 byte and reattach the header with every
packets.
2.Every network hardware is represented by netdevice structure in kernel. If hardware supports
TSO, it enables the Segmentation offload features in netdevice, mainly represented by
“ NETIF_F_TSO” and other fields. [2]
TCP Segmentation Offload is supported in Linux by the network device layer. A driver that wants
to offer TSO needs to set the NETIF_F_TSO bit in the network device structure. In order for a
device to support TSO, it needs to also support Net : TCP Checksum Offloading and
Net : Scatter Gather.
The driver will then receive super-sized skb’s. These are indicated to the driver by
skb_shinfo(skb)->gso_size being non-zero. The gso_size is the size the hardware should
fragment the TCP data. TSO may change how and when TCP decides to send data. [3]
实现
1234567891011
/* This data is invariant across clones and lives at the end of the
* header data, ie. at skb->end.
*/
struct skb_share_info {
...
unsigned short gso_size; // 每个数据段的大小
unsigned short gso_segs; // skb被分割成多少个数据段
unsigned short gso_type;
struct sk_buff *frag_list; // 分割后的数据包列表
...
}
/* Initialize TSO state of skb.
* This must be invoked the first time we consider transmitting
* SKB onto the wire.
*/
static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb,
unsigned int mss_now)
{
int tso_segs = tcp_skb_pcount(skb);
/* 如果还没有分段,或者有多个分段但是分段长度不等于当前MSS,则需处理*/
if (! tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
tcp_set_skb_tso_segs(sk, skb, mss_now);
tso_segs = tcp_skb_pcount(skb);/* 重新获取分段数量 */
}
return tso_segs;
}
/* Initialize TSO segments for a packet. */
static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
unsigned int mss_now)
{
/* 有以下情况则不需要分片:
* 1. 数据的长度不超过允许的最大长度MSS
* 2. 网卡不支持GSO
* 3. 网卡不支持重新计算校验和
*/
if (skb->len <= mss_now || ! sk_can_gso(sk) ||
skb->ip_summed == CHECKSUM_NONE) {
/* Avoid the costly divide in the normal non-TSO case.*/
skb_shinfo(skb)->gso_segs = 1;
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_type = 0;
} else {
/* 计算需要分成几个数据段*/
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);/*向上取整*/
skb_shinfo(skb)->gso_size = mss_now; /* 每个数据段的大小*/
skb_shinfo(skb)->gso_type = sk->sk_gso_type;
}
}
/* Due to TSO, an SKB can be composed of multiple actual packets.
* To keep these tracked properly, we use this.
*/
static inline int tcp_skb_pcount (const struct sk_buff *skb)
{
return skb_shinfo(skb)->gso_segs;
}
/* This is valid if tcp_skb_pcount() > 1 */
static inline int tcp_skb_mss(const struct sk_buff *skb)
{
return skb_shinfo(skb)->gso_size;
}
static inline int sk_can_gso(const struct sock *sk)
{
/* sk_route_caps标志网卡驱动的特征, sk_gso_type表示GSO的类型,
* 设置为SKB_GSO_TCPV4
*/
return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
}
static inline int net_gso_ok(int features, int gso_type)
{
int feature = gso_type << NETIF_F_GSO_SHIFT;
return (features & feature) == feature;
}
sk_gso_max_size
NIC also specify the maximum segment size which it can handle, in sk_gso_max_size field.
Mostly it will be set to 64k. This 64k values means if the data at TCP is more than 64k,
then again TCP has to segment it in 64k and then push to interface.
The idea behind GSO seems to be that many of the performance benefits of LSO (TSO/UFO)
can be obtained in a hardware-independent way, by passing large “superpackets” around for
as long as possible, and deferring segmentation to the last possible moment - for devices
without hardware segmentation/fragmentation support, this would be when data is actually
handled to the device driver; for devices with hardware support, it could even be done in hardware.
Try to defer sending, if possible, in order to minimize the amount of TSO splitting we do.
View it as a kind of TSO Nagle test.
/** This algorithm is from John Heffner.
* 0: send now ; 1: deferred
*/
static int tcp_tso_should_defer (struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
u32 in_flight, send_win, cong_win, limit;
int win_divisor;
/* 如果此skb包含结束标志,则马上发送*/
if (TCP_SKB_CB(skb)->flags & TCPHDR_FIN)
goto send_now;
/* 如果此时不处于Open态,则马上发送*/
if (icsk->icsk_ca_state != TCP_CA_Open)
goto send_now;
/* Defer for less than two clock ticks.
* 上个skb被延迟了,且超过现在1ms以上,则不再延迟。
* 也就是说,TSO延迟不能超过2ms!
*/
if (tp->tso_deferred && (((u32)jiffies <<1) >> 1) - (tp->tso_deferred >> 1) > 1)
goto send_now;
in_flight = tcp_packets_in_flight(tp);
/* 如果此数据段不用分片,或者受到拥塞窗口的限制不能发包,则报错*/
BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));
/* 通告窗口的剩余大小*/
send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
/* 拥塞窗口的剩余大小*/
cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;
/* 取其小者作为最终的发送限制*/
limit = min(send_win, cong_win);
/*If a full-sized TSO skb can be sent, do it.
* 一般来说是64KB
*/
if (limit >= sk->sk_gso_max_size)
goto send_now;
/* Middle in queue won't get any more data, full sendable already ? */
if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
goto send_now;
win_divisor = ACCESS_ONCE(sysctl_tcp_tso_win_divisor);
if (win_divisor) {
/* 一个RTT内允许发送的最大字节数*/
u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);
chunk /= win_divisor; /* 单个TSO段可消耗的发送量*/
/* If at least some fraction of a window is available, just use it. */
if (limit >= chunk)
goto send_now;
} else {
/* Different approach, try not to defer past a single ACK.
* Receiver should ACK every other full sized frame, so if we have space for
* more than 3 frames then send now.
*/
if (limit > tcp_max_burst(tp) * tp->mss_cache)
goto send_now;
}
/* OK, it looks like it is advisable to defer. */
tp->tso_deferred = 1 | (jiffies << 1); /* 记录此次defer的时间戳*/
return 1;
send_now:
tp->tso_deferred = 0;
return 0;
}
/* Returns end sequence number of the receiver's advertised window */
static inline u32 tcp_wnd_end (const struct tcp_sock *tp)
{
/* snd_wnd的单位为字节*/
return tp->snd_una + tp->snd_wnd;
}
我们注意到tcp_is_cwnd_limited()中的注释说:
“ This is the inverse of cwnd check in tcp_tso_should_defer",所以可以认为在tcp_tso_should_defer()中包含判断
tcp_is_not_cwnd_limited (或者tcp_is_application_limited) 的条件。