struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, int features)
{
struct sk_buff *segs = ERR_PTR(-EINVAL);
unsigned int mss;
int offset;
__wsum csum;
mss = skb_shinfo(skb)->gso_size;
if (unlikely(skb->len <= mss))
goto out;
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
int type = skb_shinfo(skb)->gso_type;
if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY) ||
!(type & (SKB_GSO_UDP))))
goto out;
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
segs = NULL;
goto out;
}
/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
* do checksum of UDP packets sent as multiple IP fragments.
*/
offset = skb->csum_start - skb_headroom(skb);
csum = skb_checksum(skb, offset, skb->len - offset, 0);
offset += skb->csum_offset;
*(__sum16 *)(skb->data + offset) = csum_fold(csum);
skb->ip_summed = CHECKSUM_NONE;
计算udp的checksum
/* Fragment the skb. IP headers of the fragments are updated in
* inet_gso_segment()
*/
segs = skb_segment(skb, features);
out:
return segs;
}
F-RTO:Forward RTO-Recovery,for a TCP sender to recover after a retransmission timeout.
F-RTO的主要目的:The main motivation of the algorithm is to recover efficiently from a spurious
RTO.
F-RTO的基本思想
The guideline behind F-RTO is, that an RTO either indicates a loss, or it is caused by an
excessive delay in packet delivery while there still are outstanding segments in flight. If the
RTO was due to delay, i.e. the RTO was spurious, acknowledgements for non-retransmitted
segments sent before the RTO should arrive at the sender after the RTO occurred. If no such
segments arrive, the RTO is concluded to be non-spurious and the conventional RTO
recovery with go-back-N retransmissions should take place at the TCP sender.
To implement the principle described above, an F-RTO sender acts as follows: if the first ACK
arriving after a RTO-triggered retransmission advances the window, transmit two new segments
instead of continuing retransmissions. If also the second incoming acknowledgement advances
the window, RTO is likely to be spurious, because the second ACK is triggered by an originally
transmitted segment that has not been retransmitted after the RTO. If either one of the two
acknowledgements after RTO is a duplicate ACK, the sender continues retransmissions similarly
to the conventional RTO recovery algorithm.
When the retransmission timer expires, the F-RTO algorithm takes the following steps at the TCP
sender. In the algorithm description below we use SND.UNA to indicate the first unacknowledged
segment.
1.When the retransmission timer expires, retransmit the segment that triggered the timeout. As
required by the TCP congestion control specifications, the ssthresh is adjusted to half of the
number of currently outstanding segments. However, the congestion window is not yet set to one
segment, but the sender waits for the next two acknowledgements before deciding on what to do
with the congestion window.
2.When the first acknowledgement after RTO arrives at the sender, the sender chooses the
following actions depending on whether the ACK advances the window or whether it is a duplicate
ACK.
(a)If the acknowledgement advances SND.UNA, transmit up to two new (previously unsent)
segments. This is the main point in which the F-RTO algorithm differs from the conventional way
of recovering from RTO. After transmitting the two new segments, the congestion window size
is set to have the same value as ssthresh. In effect this reduces the transmission rate of the
sender to half of the transmission rate before the RTO. At this point the TCP sender has transmitted
a total of three segments after the RTO, similarly to the conventional recovery algorithm. If
transmitting two new segments is not possible due to advertised window limitation, or because
there is no more data to send, the sender may transmit only one segment. If now new data can
be transmitted, the TCP sender follows the conventional RTO recovery algorithm and starts
retransmitting the unacknowledged data using slow start.
(b)If the acknowledgement is duplicate ACK, set the congestion window to one segment and
proceed with the conventional RTO recovery. Two new segments are not transmitted in this case,
because the conventional RTO recovery algorithm would not transmit anything at this point either.
Instead, the F-RTO sender continues with slow start and performs similarly to the conventional
TCP sender in retransmitting the unacknowledged segments. Step 3 of the F-RTO algorithm is
not entered in this case. A common reason for executing this branch is the loss of a segment,
in which case the segments injected by the sender before the RTO may still trigger duplicate
ACKs that arrive at the sender after the RTO.
3.When the second acknowledgement after the RTO arrives, either continue transmitting new
data, or start retransmitting with the slow start algorithm, depending on whether new data was
acknowledged.
(a)If the acknowledgement advances SND.UNA, continue transmitting new data following
the congestion avoidance algorithm. Because the TCP sender has retransmitted only one
segment after the RTO, this acknowledgement indicates that an originally transmitted
segment has arrived at the receiver. This is regarded as a strong indication of a suprious
RTO. However, since the TCP sender cannot surely know at this point whether the segment
that triggered the RTO was actually lost, adjusting the congestion control parameters after
the RTO is the conservative action. From this point on, the TCP sender continues as in the
normal congestion avoidance.
If this algorithm branch is taken, the TCP sender ignores the send_high variable that indicates
the highest sequence number transmitted so far. The send_high variable was proposed as a
bugfix for avoiding unnecessary multiple fast retransmits when RTO expires during fast recovery
with NewReon TCP. As the sender has not retransmitted other segments but the one that
triggered RTO, the problem addressed by the bugfix cannot occur. Therefore, if there are
duplicate ACKs arriving at the sender after the RTO, they are likely to indicate a packet loss,
hence fast retransmit should bu used to allow efficient recovery. Alternatively, if there are not
enough duplicate ACKs arriving at the sender after a packet loss, the retransmission timer
expires another time and the sender enters step 1 of this algorithm to detect whether the
new RTO is spurious.
(b)If the acknowledgement is duplicate ACK, set the congestion window to three segments,
continue with the slow start algorithm retransmitting unacknowledged segments. The duplicate
ACK indicates that at least one segment other than the segment that triggered RTO is lost in the
last window of data. There is no sufficient evidence that any of the segments was delayed.
Therefore the sender proceeds with retransmissions similarly to the conventional RTO recovery
algorithm, with the send_high variable stored when the retransmission timer expired to avoid
unnecessary fast retransmits.
引起RTO的主要因素:
(1)Sudden delays
The primary motivation of the F-RTO algorithm is to improve the TCP performance when sudden
delays cause spurious retransmission timeouts.
(2)Packet losses
These timeouts occur mainly when retransmissions are lost, since lost original packets are
usually recovered by fast retransmit.
(3)Bursty losses
Losses of several successive packets can result in a retransmission timeout.
造成虚假RTO的原因还有:
Wireless links may also suffer from link outages that cause persistent data loss for a period
of time.
Oher potential reasons for sudden delays that have been reported to trigger spurious RTOs
include a delay due to tedious actions required to complete a hand-off or re-routing of packets
to the new serving access point after the hand-off, arrival of competing traffic on a shared link
with low bandwidth, and a sudden bandwidth degradation due to reduced resources on a
wireless channel.
造成真实RTO的原因:
A RTO-triggered retransmission is needed when a retransmission is lost, or when nearly a whole
window of data is lost, thus making it impossible for the receiver to generate enough duplicate
ACKs for triggering TCP fast retransmit.
虚假RTO的后果
If no segments were lost but the retransmission timer expires spuriously, the segments retransmitted
in the slow-start are sent unnecessarily. Particularly, this phenomenon is very possible with the
various wireless access network technologies that are prone to sudden delay spikes.
The retransmission timer expires because of the delay, spuriously triggering the RTO recovery and
unnecessarily retransmission of all unacknowledged segments. This happens because after the
delay the ACKs for the original segments arrive at the sender one at the time but too late, because
the TCP sender has already entered the RTO recovery. Therefore, each of the ACKs trigger the
retransmission of segments for which the original ACKs will arrive after a while. This continues
until the whole window of segments is eventually unnecessarily retransmitted. Furthermore,
because a full window of retransmitted segments arrive unnecessarily at the receiver, it generates
duplicate ACKs for these out-of-order segments. Later on, the duplicate ACKs unnecessarily
trigger fast retransmit at the sender.
TCP uses the fast retransmit mechanism to trigger retransmissions after receiving three successive
duplicate acknowledgements (ACKs). If for a certain time period TCP sender does not receive ACKs
that acknowledge new data, the TCP retransmission timer expires as a backoff mechanism.
When the retransmission time expires, the TCP sender retransmits the first unacknowledged
segment assuming it was lost in the network. Because a retransmission timeout (RTO) can be
an indication of severe congestion in the network, the TCP sender resets its congestion window
to one segment and starts increasing it according to the slow start algorithm.
However, if the RTO occurs spuriously and there still are segments outstanding in the network,
a false slow start is harmful for the potentially congested network as it injects extra segments
to the network at increasing rate.
How about Reliable link-layer protocol ?
Since wireless networks are often subject to high packet loss rate due to corruption or hand-offs,
reliable link-layer protocols are widely employed with wireless links. The link-layer receiver often
aims to deliver the packets to the upper protocol layers in order, which implies that the later
arriving packets are blocked until the head of the queue arrives successfully. Due to the strict
link-layer ordering, the communication end point observe a pause in packet delivery that can
cause a spurious TCP RTO instead of getting out-of-order packets that could result in a false
fast retransmit instead. Either way, interaction between TCP retransmission mechanisms
and link-layer recovery can cause poor performance.
DSACK不能解决此问题
If the unnecessary retransmissions occurred due to spurious RTO caused by a sudden delay,
the acknowledgements with the DSACK information arrive at the sender only after the
acknowledgements of the original segments. Therefore, the unnecessary retransmissions
following the spurious RTO cannot be avoided by using DSACK. Instead, the suggested
recovery algorithm using DSACK can only revert the congestion control parameters to the
state preceding the spurious retransmissions.
F-RTO实现
F-RTO is implemented (mainly) in four functions:
(1)tcp_use_frto() is used to determine if TCP can use F-RTO.
(2)tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is called when tcp_use_frto() showed green light.
(3)tcp_process_frto() handles incoming ACKs during F-RTO algorithm.
(4)tcp_enter_frto_loss() is called if there is not enough evidence to prove that the RTO is indeed spurious. It transfers the control from F-RTO to the conventional RTO recovery.
/* F-RTO can only be used if TCP has never retransmitted anything other than
* head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
*/
int tcp_use_frto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
struct sk_buff *skb;
if (! sysctl_tcp_frto)
return 0;
/* MTU probe and F-RTO won't really play nicely along currently */
if (icsk->icsk_mtup.probe_size)
return 0;
if (tcp_is_sackfrto(tp))
return 1;
/* Avoid expensive walking of rexmit queue if possible */
if (tp->retrans_out > 1)
return 0; /* 不能重过传除了head以外的数据*/
skb = tcp_write_queue_head(sk);
if (tcp_skb_is_last(sk, skb))
return 1;
skb = tcp_write_queue_next(sk, skb); /* Skips head */
tcp_for_write_queue_from(skb, sk) {
if (skb == tcp_send_head(sk))
break;
if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
return 0; /* 不允许处head以外的数据包被重传过 */
/* Short-circut when first non-SACKed skb has been checked */
if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
break;
}
return 1;
}
static int tcp_is_sackfrto(const struct tcp_sock *tp)
{
return (sysctl_tcp_frto == 0x2) && ! tcp_is_reno(tp);
}
/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO, which
* indicates that we should follow the traditional RTO recovery, i.e. mark
* erverything lost and do go-back-N retransmission.
*/
static void tcp_enter_frto_loss (struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int cnt = 0;
/* 进入Loss状态后,清零SACK、lost、retrans_out等数据*/
tp->sacked_out = 0;
tp->lost_out = 0;
tp->fackets_out = 0;
/* 遍历重传队列,重新标志LOST。对于那些在RTO发生后传输
* 的数据不用标志为LOST。
*/
sk_stream_for_retrans_queue(skb, sk) {
cnt += tcp_skb_pcount(skb);
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
/* 对于那些没被SACK的数据包,需要把它标志为LOST。*/
if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
/* Do not mark those segments lost that were forward
* transmitted after RTO.
*/
if (! after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark))
{
TCP_SKB_CB(skb)->sacked |= TCP_LOST;
tp->lost_out += tcp_skb_pcount(skb);
}
} else { /* 对于那些已被sacked的数据包,则不用标志LOST。*/
tp->sacked_out += tcp_skb_pcount(skb);
tp->fackets_out = cnt;
}
}
tcp_syn_left_out(tp);
tp->snd_cwnd = tp->frto_counter + tcp_packets_in_flight(tp) + 1;
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tp->undo_marker = 0; /* 不需要undo标志*/
tp->frto_counter = 0; /* 表示F-RTO结束了*/
/* 更新乱序队列的最大值*/
tp->reordering = min_t(unsigned int, tp->reordering, sysctl_tcp_reordering);
tcp_set_ca_state(sk, TCP_CA_Loss); /* 进入loss状态*/
tp->high_seq = tp->frto_highmark; /*RTO时的最大序列号*/
TCP_ECN_queue_cwr(tp); /* 设置显示拥塞标志*/
clear_all_retrans_hints(tp);
}
3.2.12的F-RTO
F-RTO spurious RTO detection algorithm (RFC4138)
F-RTO affects during two new ACKs following RTO (well, almost, see inline
comments). State (ACK number) is kept in frto_counter. When ACK advances
window (but not to or beyond highest sequence sent before RTO) :
On First ACK, send two new segments out.
On second ACK, RTO was likely spurious. Do spurious response (response
algorithm is not part of the F-RTO detection algorithm given in RFC4138 but
can be selected separately).
Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss and
TCP falls back to conventional RTO recovery. F-RTO allows overriding of Nagle,
this is done using frto_counter states 2 and 3, when a new data segment of any
size sent during F-RTO, state 2 is upgraded to 3.
Rationale: if the RTO was suprious, new ACKs should arrive from the original
window even after we transmit two new data segments.
SACK version:
on first step, wait until first cumulative ACK arrives, then move to the second
step. In second step, the next ACK decides.
/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO,
* which indicates that we should follow the tradditional RTO recovery,
* i.e. mark everything lost and do go-back-N retransmission.
*/
static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
tp->lost_out = 0;
tp->retrans_out = 0;
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
/*
* Count the retransmission made on RTO correctly (only when waiting for
* the first ACK and did not get it.
*/
if ((tp->frto_counter == 1) && !(flag & FLAG_DATA_ACKED)) {
/* For some reason this R-bit might get cleared ? */
if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)
tp->retrans_out += tcp_skb_pcount(skb);
/* enter this if branch just for the first segment */
flag |= FLAG_DATA_ACKED;
} else {
if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
tp->undo_marker = 0;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
}
/* Marking forward transmissions that were made after RTO lost can
* cause unnecessary retransmissions in some scenarios,
* SACK blocks will mitigate that in some but not in all cases.
* We used to not mark them but it was casuing break-ups with
* receivers that do only in-order receival.
*
* TODO: we could detect presence of such receiver and select different
* behavior per flow.
*/
if (! (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
}
}
tcp_verify_left_out(tp);
/* allowed_segments应该不大于3*/
tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments;
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tp->frto_counter = 0; /* F-RTO结束了*/
tp->bytes_acked = 0;
/* 更新乱序队列的最大长度*/
tp->reordering = min_t(unsigned int, tp->reordering,
sysctl_tcp_reordering);
tcp_set_ca_state(sk, TCP_CA_Loss); /*设置成Loss状态*/
tp->high_seq = tp->snd_nxt;
TCP_ECN_queue_cwr(tp); /*设置显式拥塞标志*/
tcp_clear_all_retrans_hints(tp);
}
(1)Open:Normal state, no dubious events, fast path.
(2)Disorder:In all respects it is Open, but requres a bit more attention.
It is entered when we see some SACKs or dupacks. It is split of Open mainly to move some processing from fast path to slow one.
(3)CWR:cwnd was reduced due to some Congestion Notification event.
It can be ECN, ICMP source quench, local device congestion.
(4)Recovery:cwnd was reduced, we are fast-retransmitting.
(5)Loss:cwnd was reduced due to RTO timeout or SACK reneging.
tcp_fastretrans_alert() is entered:
(1)each incoming ACK, if state is not Open
(2)when arrived ACK is unusual, namely:
SACK
Duplicate ACK
ECN ECE
Counting packets in flight is pretty simple.
(1)in_flight = packets_out - left_out + retrans_out
packets_out is SND.NXT - SND.UNA counted in packets.
retrans_out is number of retransmitted segments.
left_out is number of segments left network, but not ACKed yet.
(2)left_out = sacked_out + lost_out
sacked_out:Packets, which arrived to receiver out of order and hence not ACKed. With SACK this number is simply amount of SACKed data. Even without SACKs it is easy to give pretty reliable estimate of this number, counting duplicate ACKs.
(3)lost_out:Packets lost by network. TCP has no explicit loss notification feedback from network(for now). It means that this number can be only guessed. Actually, it is the heuristics to predict lossage that distinguishes different algorithms.
F.e. after RTO, when all the queue is considered as lost, lost_out = packets_out and in_flight = retrans_out.
Essentially, we have now two algorithms counting lost packets.
1)FACK:It is the simplest heuristics. As soon as we decided that something is lost, we decide that all not SACKed packets until the most forward SACK are lost. I.e.
lost_out = fackets_out - sacked_out and left_out = fackets_out
It is absolutely correct estimate, if network does not reorder packets. And it loses any connection to reality when reordering takes place. We use FACK by defaut until reordering is suspected on the path to this destination.
2)NewReno:when Recovery is entered, we assume that one segment is lost (classic Reno). While we are in Recovery and a partial ACK arrives, we assume that one more packet is lost (NewReno).
This heuristics are the same in NewReno and SACK.
Imagine, that’s all! Forget about all this shamanism about CWND inflation deflation etc. CWND is real congestion window, never inflated, changes only according to classic VJ rules.
Really tricky (and requiring careful tuning) part of algorithm is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
tcp_time_to_recover()
It determines the moment when we should reduce cwnd and, hence, slow down forward transmission. In fact, it determines the moment when we decide that hole is caused by loss, rather than by a reorder.
tcp_xmit_retransmit_queue()
It decides what we should retransmit to fill holes, caused by lost packets.
undo heuristics
And the most logically complicated part of algorithm is undo heuristics. We detect false retransmits due to both too early fast retransmit (reordering) and underestimated RTO, analyzing timestamps and D-SACKs. When we detect that some segments were retransmitted by mistake and CWND reduction was wrong, we undo window reduction and abort recovery phase. This logic is hidden inside several functions named tcp_try_undo_.
此函数分成几个阶段:
A. FLAG_ECE,收到包含ECE标志的ACK。
B. reneging SACKs,ACK指向已经被SACK的数据段。如果是此原因,进入超时处理,然后返回。
C. state is not Open,发现丢包,需要标志出丢失的包,这样就知道该重传哪些包了。
D. 检查是否有错误( left_out > packets_out)。
E. 各个状态是怎样退出的,当snd_una >= high_seq时候。
F. 各个状态的处理和进入。
/* Process an event, which can update packets-in-flight not trivially.
* Main goal of this function is to calculate new estimate for left_out,
* taking into account both packets sitting in receiver's buffer and
* packets lost by network.
*
* Besides that it does CWND reduction, when packet loss is detected
* and changes state of machine.
*
* It does not decide what to send, it is made in function
* tcp_xmit_retransmit_queue().
*/
/* 此函数被调用的条件:
* (1) each incoming ACK, if state is not Open
* (2) when arrived ACK is unusual, namely:
* SACK
* Duplicate ACK
* ECN ECE
*/
static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
/* 判断是不是重复的ACK*/
int is_dupack = ! (flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
/* tcp_fackets_out()返回hole的大小,如果大于reordering,则认为发生丢包.*/
int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
(tcp_fackets_out(tp) > tp->reordering ));
int fast_rexmit = 0, mib_idx;
/* 如果packet_out为0,那么不可能有sacked_out */
if (WARN_ON(!tp->packets_out && tp->sacked_out))
tp->sacked_out = 0;
/* fack的计数至少需要依赖一个SACK的段.*/
if (WARN_ON(!tp->sacked_out && tp->fackets_out))
tp->fackets_out = 0;
/* Now state machine starts.
* A. ECE, hence prohibit cwnd undoing, the reduction is required.
* 禁止拥塞窗口撤销,并开始减小拥塞窗口。
*/
if (flag & FLAG_ECE)
tp->prior_ssthresh = 0;
/* B. In all the states check for reneging SACKs.
* 检查是否为虚假的SACK,即ACK是否确认已经被SACK的数据.
*/
if (tcp_check_sack_reneging(sk, flag))
return;
/* C. Process data loss notification, provided it is valid.
* 为什么需要这么多个条件?不太理解。
* 此时不在Open态,发现丢包,需要标志出丢失的包。
*/
if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOSS) &&
before(tp->snd_una, tp->high_seq) &&
icsk->icsk_ca_state != TCP_CA_Open &&
tp->fackets_out > tp->reordering) {
tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS);
}
/* D. Check consistency of the current state.
* 确定left_out < packets_out
*/
tcp_verify_left_out(tp);
/* E. Check state exit conditions. State can be terminated
* when high_seq is ACKed. */
if (icsk->icsk_ca_state == TCP_CA_Open) {
/* 在Open状态,不可能有重传且尚未确认的段*/
WARN_ON(tp->retrans_out != 0);
/* 清除上次重传阶段第一个重传段的发送时间*/
tp->retrans_stamp = 0;
} else if (!before(tp->snd_una, tp->high_seq) {/* high_seq被确认了*/
switch(icsk->icsk_ca_state) {
case TCP_CA_Loss:
icsk->icsk_retransmits = 0; /*超时重传次数归0*/
/*不管undo成功与否,都会返回Open态,除非没有使用SACK*/
if (tcp_try_undo_recovery(sk))
return;
break;
case TCP_CA_CWR:
/* CWR is to be held someting *above* high_seq is ACKed
* for CWR bit to reach receiver.
* 需要snd_una > high_seq才能撤销
*/
if (tp->snd_una != tp->high_seq) {
tcp_complete_cwr(sk);
tcp_set_ca_state(sk, TCP_CA_Open);
}
break;
case TCP_CA_Disorder:
tcp_try_undo_dsack(sk);
/* For SACK case do not Open to allow to undo
* catching for all duplicate ACKs.?*/
if (!tp->undo_marker || tcp_is_reno(tp) ||
tp->snd_una != tp->high_seq) {
tp->undo_marker = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
}
case TCP_CA_Recovery:
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp)); /* sacked_out清零*/
if (tcp_try_undo_recovery(sk))
return;
tcp_complete_cwr(sk);
break;
}
}
/* F. Process state. */
switch(icsk->icsk_ca_state) {
case TCP_CA_Recovery:
if (!(flag & FLAG_SND_UNA_ADVANCED)) {
if (tcp_is_reno(tp) && is_dupack)
tcp_add_reno_sack(sk); /* 增加sacked_out ,检查是否出现reorder*/
} else
do_lost = tcp_try_undo_partial(sk, pkts_acked);
break;
case TCP_CA_Loss:
/* 收到partical ack,超时重传的次数归零*/
if (flag & FLAG_DATA_ACKED)
icsk->icsk_retransmits = 0;
if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
tcp_reset_reno_sack(tp); /* sacked_out清零*/
if (!tcp_try_undo_loss(sk)) { /* 尝试撤销拥塞调整,进入Open态*/
/* 如果不能撤销,则继续重传标志为丢失的包*/
tcp_moderate_cwnd(tp);
tcp_xmit_retransmit_queue(sk); /* 待看*/
return;
}
if (icsk->icsk_ca_state != TCP_CA_Open)
return;
/* Loss is undone; fall through to process in Open state.*/
default:
if (tcp_is_reno(tp)) {
if (flag & FLAG_SND_UNA_ADVANCED)
tcp_reset_reno_sack(tp);
if (is_dupack)
tcp_add_reno_sack(sk);
}
if (icsk->icsk_ca_state == TCP_CA_Disorder)
tcp_try_undo_dsack(sk); /*D-SACK确认了所有重传的段*/
/* 判断是否应该进入Recovery状态*/
if (! tcp_time_to_recover(sk)) {
/*此过程中,会判断是否进入Open、Disorder、CWR状态*/
tcp_try_to_open(sk, flag);
return;
}
/* MTU probe failure: don't reduce cwnd */
/* 关于MTU探测部分此处略过!*/
......
/* Otherwise enter Recovery state */
if (tcp_is_reno(tp))
mib_idx = LINUX_MIB_TCPRENORECOVERY;
else
mib_idx = LINUX_MIB_TCPSACKRECOVERY;
NET_INC_STATS_BH(sock_net(sk), mib_idx);
/* 进入Recovery状态前,保存那些用于恢复的数据*/
tp->high_seq = tp->snd_nxt; /* 用于判断退出时机*/
tp->prior_ssthresh = 0;
tp->undo_marker = tp->snd_una;
tp->undo_retrans=tp->retrans_out;
if (icsk->icsk_ca_state < TCP_CA_CWR) {
if (! (flag & FLAG_ECE))
tp->prior_ssthresh = tcp_current_ssthresh(sk); /*保存旧阈值*/
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);/*更新阈值*/
TCP_ECN_queue_cwr(tp);
}
tp->bytes_acked = 0;
tp->snd_cwnd_cnt = 0;
tcp_set_ca_state(sk, TCP_CA_Recovery); /* 进入Recovery状态*/
fast_rexmit = 1; /* 快速重传标志 */
}
if (do_lost || (tcp_is_fack(tp) && tcp_head_timeout(sk)))
/* 更新记分牌,标志丢失和超时的数据包,增加lost_out */
tcp_update_scoreboard(sk, fast_rexmit);
/* 减小snd_cwnd */
tcp_cwnd_down(sk, flag);
tcp_xmit_retransmit_queue(sk);
}
/* If ACK arrived pointing to a remembered SACK, it means that our remembered
* SACKs do not reflect real state of receiver i.e. receiver host is heavily congested
* or buggy.
*
* Do processing similar to RTO timeout.
*/
static int tcp_check_sack_reneging (struct sock *sk, int flag)
{
if (flag & FLAG_SACK_RENEGING) {
struct inet_connection_sock *icsk = inet_csk(sk);
/* 记录mib信息,供SNMP使用*/
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
/* 进入loss状态,1表示清除SACKED标志*/
tcp_enter_loss(sk, 1); /* 此函数在前面blog中分析过:)*/
icsk->icsk_retransmits++; /* 未恢复的RTO加一*/
/* 重传发送队列中的第一个数据包*/
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
/* 更新超时重传定时器*/
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
icsk->icsk_rto, TCP_RTO_MAX);
return 1;
}
return 0;
}
/** 用于返回发送队列中的第一个数据包,或者NULL
* skb_peek - peek at the head of an &sk_buff_head
* @list_ : list to peek at
*
* Peek an &sk_buff. Unlike most other operations you must
* be careful with this one. A peek leaves the buffer on the
* list and someone else may run off with it. You must hold
* the appropriate locks or have a private queue to do this.
*
* Returns %NULL for an empty list or a pointer to the head element.
* The reference count is not incremented and the reference is therefore
* volatile. Use with caution.
*/
static inline struct sk_buff *skb_peek (const struct sk_buff_head *list_)
{
struct sk_buff *list = ((const struct sk_buff *) list_)->next;
if (list == (struct sk_buff *) list_)
list = NULL;
return list;
}
static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)
{
return skb_peek(&sk->sk_write_queue);
}
/* Function to create two new TCP segments. shrinks the given segment
* to the specified size and appends a new segment with the rest of the
* packet to the list. This won't be called frequently, I hope.
* Remember, these are still headerless SKBs at this point.
*/
int tcp_fragment (struct sock *sk, struct sk_buff *skb, u32 len,
unsigned int mss_now) {}
给一个段添加一个LOST标志
123456789101112131415161718192021
static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
{
if (! (TCP_SKB_CB(skb)->sacked & (TCPCB_LOST | TCPCB_SACKED_ACKED))) {
tcp_verify_retransmit_hint(tp, skb); /* 更新重传队列*/
tp->lost_out += tcp_skb_pcount(skb); /*增加LOST的段数*/
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; /* 添加LOST标志*/
}
}
/* This must be called before lost_out is incremented */
static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb)
{
if ((tp->retransmit_skb_hint == NULL) ||
before(TCP_SKB_CB(skb)->seq,
TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
tp->retransmit_skb_hint = skb;
if (! tp->lost_out ||
after(TCP_SKB_CB(skb)->end_seq, tp->retransmit_high))
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
}
If seq number greater than high_seq is acked, it indicates that the CWR indication has reached the peer TCP, call tcp_complete_cwr() to bring down the cwnd to ssthresh value.
/* This gets called after a retransmit timeout, and the initially retransmitted data is
* acknowledged. It tries to continue resending the rest of the retransmit queue, until
* either we've sent it all or the congestion window limit is reached. If doing SACK,
* the first ACK which comes back for a timeout based retransmit packet might feed us
* FACK information again. If so, we use it to avoid unnecessarily retransmissions.
*/
void tcp_xmit_retransmit_queue (struct sock *sk) {}
/* This function decides, when we should leave Disordered state and enter Recovery
* phase, reducing congestion window.
* 决定什么时候离开Disorder状态,进入Recovery状态。
*/
static int tcp_time_to_recover(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
/* Do not perform any recovery during F-RTO algorithm
* 这说明Recovery状态不能打断Loss状态。
*/
if (tp->frto_counter)
return 0;
/* Trick#1: The loss is proven.
* 如果传输过程中存在丢失段,则可以进入Recovery状态。
*/
if (tp->lost_out)
return 1;
/* Not-A-Trick#2 : Classic rule...
* 如果收到重复的ACK大于乱序的阈值,表示有数据包丢失了,
* 可以进入到Recovery状态。
*/
if (tcp_dupack_heuristics(tp) > tp->reordering)
return 1;
/* Trick#3 : when we use RFC2988 timer restart, fast
* retransmit can be triggered by timeout of queue head.
* 如果发送队列的第一个数据包超时,则进入Recovery状态。
*/
if (tcp_is_fack(tp) && tcp_head_timeout(sk))
return 1;
/* Trick#4 : It is still not OK... But will it be useful to delay recovery more?
* 如果此时由于应用程序或接收窗口的限制而不能发包,且接收到很多的重复ACK。那么不能再等下去了,
* 推测发生了丢包,且马上进入Recovery状态。
*/
if (packets_out <= tp->reordering &&
tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering)
&& ! tcp_may_send_now(sk) ) {
/* We have nothing to send. This connection is limited
* either by receiver window or by application.
*/
return 1;
}
/* If a thin stream is detected, retransmit after first received
* dupack. Employ only if SACK is supported in order to avoid
* possible corner-case series of spurious retransmissions
* Use only if there are no unsent data.
*/
if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
tcp_is_sack(tp) && ! tcp_send_head(sk))
return 1;
return 0; /*表示为假*/
}
/* Heurestics to calculate number of duplicate ACKs. There's no
* dupACKs counter when SACK is enabled (without SACK, sacked_out
* is used for that purpose).
* Instead, with FACK TCP uses fackets_out that includes both SACKed
* segments up to the highest received SACK block so far and holes in
* between them.
*
* With reordering, holes may still be in filght, so RFC3517 recovery uses
* pure sacked_out (total number of SACKed segment) even though it
* violates the RFC that uses duplicate ACKs, often these are equal but
* when e.g. out-of-window ACKs or packet duplication occurs, they differ.
* Since neither occurs due to loss, TCP shuld really ignore them.
*/
static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
{
return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
}
/* Determines whether this is a thin stream (which may suffer from increased
* latency). Used to trigger latency-reducing mechanisms.
*/
static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)
{
return tp->packets_out < 4 && ! tcp_in_initial_slowstart(tp);
}
#define TCP_INFINITE_SSTHRESH 0x7fffffff
static inline bool tcp_in_initial_slowstart(const struct tcp_sock *tp)
{
return tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH;
}
This function examines various parameters (like number of packet lost) for TCP connection to decide whether it is the right time to move to Recovery state. It’s time to recover when TCP heuristics suggest a strong possibility of packet loss in the network, the following checks are made.
总的来说,一旦确定有丢包,或者很可能丢包,就可以进入Recovery状态恢复丢包了。
可以进入Recovery状态的条件包括:
(1) some packets are lost (lost_out is non zero)。发现有丢包。
(2) SACK is an acknowledgement for out of order packets. If number of packets Sacked is greater than the
reordering metrics of the network, then loss is assumed to have happened.
被fack数据或收到的重复ACK,大于乱序的阈值,表明很可能发生丢包。
(3) If the first packet waiting to be acked (head of the write Queue) has waited for time equivalent to retransmission
timeout, the packet is assumed to have been lost. 发送队列的第一个数据段超时,表明它可能丢失了。
(4) If the following three conditions are true, TCP sender is in a state where no more data can be transmitted
and number of packets acked is big enough to assume that rest of the packets are lost in the network:
A: If packets in flight is less than the reordering metrics.
B: More than half of the packets in flight have been sacked by the receiver or number of packets sacked is more
than the Fast Retransmit thresh. (Fast Retransmit thresh is the number of dupacks that sender awaits before
fast retransmission)
C: The sender can not send any more packets because either it is bound by the sliding window or the application
has not delivered any more data to it in anticipation of ACK for already provided data.
我们收到很多的重复ACK,那么很可能有数据段丢失了。如果此时由于接收窗口或应用程序的限制而不能发送数据,那么我们不打算再等下去,直接进入Recovery状态。
(5) 当检测到当前流量很小时(packets_out < 4),如果还满足以下条件:
A: tp->thin_dupack == 1 / Fast retransmit on first dupack /
或者sysctl_tcp_thin_dupack为1,表明允许在收到第一个重复的包时就重传。
B: 启用SACK,且FACK或SACK的数据量大于1。
C: 没有未发送的数据,tcp_send_head(sk) == NULL。
这是一种特殊的情况,只有当流量非常小的时候才采用。
In Disorder state TCP is still unsure of genuiness of loss, after receiving acks with sack there may be a clearing ack which acks many packets non dubiously in one go. Such a clearing ack may cause a packet burst in the network, to avoid this cwnd size is reduced to allow no more than max_burst (usually 3) number of packets.