orange723

#tcp - orange723

4 posts

一个数据包的生命周期

实验流程来自:01|一个数据包的网络之旅:网络是如何工作的?

也可以阅读此文章:life-of-a-packet-in-the-linux-kernel

通过一个 HTTP 请求来观察数据包的旅程

$ sudo tcpdump -s0 -X -nn "tcp port 80" -w packet.pcap --print

packet.pcap

$ curl -o /dev/null -v http://example.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host example.com:80 was resolved.
* IPv6: 2600:1406:5e00:6::17ce:bc1b, 2600:1408:ec00:36::1736:7f24, 2600:1406:bc00:53::b81e:94ce, 2600:1408:ec00:36::1736:7f31, 2600:1406:5e00:6::17ce:bc12, 2600:1406:bc00:53::b81e:94c8
* IPv4: 23.215.0.136, 23.192.228.80, 23.220.75.232, 23.220.75.245, 23.192.228.84, 23.215.0.138
*   Trying 23.215.0.136:80...
* Connected to example.com (23.215.0.136) port 80
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.5.0
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 200 OK
< Content-Type: text/html
< ETag: "bc2473a18e003bdb249eba5ce893033f:1760028122.592274"
< Last-Modified: Thu, 09 Oct 2025 16:42:02 GMT
< Cache-Control: max-age=86000
< Date: Fri, 28 Nov 2025 08:31:14 GMT
< Content-Length: 513
< Connection: keep-alive
<
{ [513 bytes data]
100   513  100   513    0     0    418      0  0:00:01  0:00:01 --:--:--   418
* Connection #0 to host example.com left intact

先是 dns 解析,知道 ip 后和 23.215.0.136:80 tcp 连接

* IPv6: 2600:1406:5e00:6::17ce:bc1b, 2600:1408:ec00:36::1736:7f24, 2600:1406:bc00:53::b81e:94ce, 2600:1408:ec00:36::1736:7f31, 2600:1406:5e00:6::17ce:bc12, 2600:1406:bc00:53::b81e:94c8
* IPv4: 23.215.0.136, 23.192.228.80, 23.220.75.232, 23.220.75.245, 23.192.228.84, 23.215.0.138
*   Trying 23.215.0.136:80...

连接成功后发送 GET / 请求

> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.5.0
> Accept: */*

example.com 回复 http 状态码 200,在通过抓包看本地机器 192.168.139.111 发起了 tcp 连接关闭

< HTTP/1.1 200 OK
< Content-Type: text/html
< ETag: "bc2473a18e003bdb249eba5ce893033f:1760028122.592274"
< Last-Modified: Thu, 09 Oct 2025 16:42:02 GMT
< Cache-Control: max-age=86000
< Date: Fri, 28 Nov 2025 08:31:14 GMT
< Content-Length: 513
< Connection: keep-alive

网络分层

图片来自:网络架构实战课

穿过客户端局域网

一句话总结:同局域网 arp 查询 mac 直接发送,不同局域网路由发送

计算下我的 ip 和 example.com 的 ip 在不在同一局域网

$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.139.111  netmask 255.255.255.0  broadcast 192.168.139.255
        inet6 fd07:b51a:cc66:0:a0db:deff:fea3:9cb5  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::a0db:deff:fea3:9cb5  prefixlen 64  scopeid 0x20<link>
        ether a2:db:de:a3:9c:b5  txqueuelen 1000  (Ethernet)
        RX packets 46016  bytes 17701736 (17.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 330  bytes 29500 (29.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
# 128 64 32 16 8 4 2 1
# 这里的对比是拿本机的子网掩码去和目的ip和本机ip对比,网络位相同则在同一网络
        
本机:192.168.139.111 子网掩码:255.255.255.0

IP:         11000000.10101000.10001011.01101111
子网掩码:     11111111.11111111.11111111.00000000
按位与运算:    11000000.10101000.10001011.00000000
网络位:        192.168.139.0

example.com:23.215.0.136

IP:          00010111.11010111.00000000.10001000
子网掩码:      11111111.11111111.11111111.00000000
按位与运算:     00010111.11010111.00000000.00000000
网络位:         23.215.0.0

既然不在一定会走路由规则,能看到走 192.168.139.1 网关设备是 eth0

$ ip route get 23.215.0.136
23.215.0.136 via 192.168.139.1 dev eth0 src 192.168.139.111 uid 501
    cache

网关一定是和主机在同一网络,观察下 arp 是怎么工作的

$ sudo arp -d 192.168.139.1

$ sudo tcpdump -s0 -X -nn "arp" -w arp.pcap --print

arp.pcap

$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.139.1            ether   da:9b:d0:54:e0:02   C                     eth0

traceroute 查看也是一样,虽然解析的 ip 不同但不影响

$ sudo traceroute -n -I example.com
traceroute to example.com (23.220.75.232), 30 hops max, 60 byte packets
 1  192.168.139.1  0.036 ms  0.014 ms  0.005 ms
 2  192.168.1.1  4.661 ms  4.647 ms  4.642 ms
 3  * 100.101.0.1  12.028 ms *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  23.220.75.232  186.684 ms  186.577 ms  260.319 ms

推荐使用 NextTrace

在服务器上测试中间设备会响应 icmp 协议,可能是个人住址经过的设备屏蔽了 icmp

$ traceroute -I -n -m 50 example.com
traceroute to example.com (23.220.75.245), 50 hops max, 60 byte packets
 1  10.59.252.86  1.378 ms  1.446 ms  1.442 ms
 2  11.73.60.253  1.937 ms * *
 3  26.25.187.33  1.519 ms  1.529 ms  1.630 ms
 4  10.216.220.118  3.104 ms  3.179 ms  3.160 ms
 5  10.216.229.106  3.177 ms  3.178 ms  3.232 ms
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  219.158.5.174  178.130 ms  178.125 ms  178.145 ms
13  * * *
14  154.54.77.53  162.103 ms  162.088 ms  162.185 ms
15  154.54.63.70  157.864 ms  157.916 ms  157.911 ms
16  154.54.47.165  238.095 ms  238.125 ms  243.526 ms
17  154.54.169.178  260.650 ms  260.635 ms  260.638 ms
18  154.54.29.134  249.462 ms  249.394 ms  248.589 ms
19  154.54.40.249  249.505 ms  249.495 ms *
20  154.54.165.26  247.671 ms  249.877 ms  250.234 ms
21  154.54.166.58  251.909 ms  252.616 ms  252.628 ms
22  154.54.44.86  254.282 ms  254.289 ms  254.572 ms
23  154.54.27.118  250.352 ms  252.829 ms  252.919 ms
24  38.104.84.101  236.554 ms  236.499 ms  236.548 ms
25  218.30.54.6  242.952 ms  242.911 ms  242.916 ms
26  * * *
27  * * *
28  * * *
29  23.220.75.245  239.147 ms  236.590 ms  239.235 ms

推荐案例

分析了下此篇文章的问题,很有趣 0.01% 的概率超时问题

我的回答是:

两个包还有个区别

正常的:server 会给 client 发 zerowindow 随后又发 window update,server 处理的慢但节奏在 server 这里。

超时的:没看到窗口更新的包 都是 client 给 server 发送,2136 包到 2149包能看到重试 15次。

要说 server 处理的慢,只看到一次超时后面全部正常,你说中间设备处理的有问题吧 它还只有0.01的超时概率

作者给了回复:

zero window 在这里其实是一个好的现象。

数据进入的处理路径是:

NIC -> Kernel process -> tcp connection buffer -> 应用程序读取

正常的:

正是因为 kernel 处理的速度够快,才能填满 buffer,应用程序处理的不够快,导致 buffer 填满了,接收端发送 zero window 让发送端暂停发送。

超时的:

因为 kernel 处理的带宽(由于没有开启 LRO)变慢,导致无法填满 buffer,所以不会出现 zero window。同时,由于 NIC 收包比较快,很可呢是 kernel 处理不过来,导致了丢包。

当时看到这个分析又重新对比了两个包,确实是像关了网卡 Offload 功能,关于 Offload 之前测试时碰到过在这里 TCP 数据的发送和接收,同时又能看到数据包里都是 vxlan 封装了一层发送的数据,分析下来就会认为是 server 的处理能力不够导致,后面作者回复后又想了想,为什么 server 处理能力不够呢 其实并不是,因为之前没替换设备是正常的,所以是因为 NIC 收包快 kernal 处理变慢,才会以为是 server 处理能力不够。

借助 zero window 看到的现象,直接进行分析得出的结论还是太草率,不是根本原因。

阅读全文 →

TCP 数据的发送和接收

实验流程来自 知识星球:程序员踩坑案例分享

TCP 超时重传

TCP 基于时间的重传

vm-1

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-time-retries.pcap --print

vm-1-tcp-send-receive-time-retries.pcap

当连接建立好后,vm-2 拦截 vm-1 发的包

$ sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

vm-1 发送数据

$ nc -k -l 192.168.139.111 9527
abc
abc
abc

$ while true;do sudo netstat -anpo|grep 9527|grep -v LISTEN; sleep 1;done
...
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (3.14/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (2.12/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (1.10/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.07/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.00/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.00/15/0)

能看到重传了 15 次,但在抓包里看到了 16 个重传包

tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout. If tcp_rto_max_ms is decreased, it is recommended to also change tcp_retries2.

RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.

第一个包其实不算做重传,而是对端在 rto 时间没有回复,然后触发重传

47.086654-46.884133=0.202521

$ sudo sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

文中对 RTT 和 RTO 的描述

RTT(Round Trip Time):指一个数据包从发出去到回来的时间
RTO(Retransmission TimeOut):指的是重传超时的时间

linux 中有 TCP_RTO_MIN 内核参数

tcp_rto_min_us - INTEGER
Minimal TCP retransmission timeout (in microseconds). Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN and TCP_RTO_MIN_US socket options, followed by this tcp_rto_min_us sysctl.

The recommended practice is to use a value less or equal to 200000 microseconds.

Possible Values: 1 - INT_MAX

Default: 200000
$ sudo sysctl net.ipv4.tcp_rto_min_us
net.ipv4.tcp_rto_min_us = 200000

具体的 rto 是根据 TCP_RTO_MIN 计算出的,也能查看到

$ sudo ss -tip|grep -A 1 9527
ESTAB 0      0      192.168.139.111:9527 192.168.139.151:57060 users:(("nc",pid=303,fd=4))
	 cubic wscale:10,10 rto:201 rtt:0.078/0.039 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 segs_in:2 send 1.49Gbps lastsnd:21909 lastrcv:21909 lastack:21909 pacing_rate 2.97Gbps delivered:1 app_limited rcv_space:14480 rcv_ssthresh:64088 minrtt:0.078 snd_wnd:64512

当从 vm-1 发送数据到 vm-2 再次查看 rto,重试后会增加的

$ sudo ss -tip|grep -A 1 9527
ESTAB 0      4      192.168.139.111:9527 192.168.139.151:57060 users:(("nc",pid=303,fd=4))
	 cubic wscale:10,10 rto:25728 backoff:7 rtt:0.078/0.039 ato:40 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:7 bytes_sent:36 bytes_retrans:32 bytes_received:4 segs_out:18 segs_in:11 data_segs_out:9 data_segs_in:9 send 149Mbps lastsnd:11529 lastrcv:42622 lastack:16135 pacing_rate 297Mbps delivered:1 app_limited busy:38299ms unacked:1 retrans:1/8 lost:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.078 snd_wnd:64512

TCP 快速重传

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack3-wrong.pcap --print

vm-1-tcp-send-receive-dupack3-wrong.pcap

确实能看到 TCP Fast Retransmission 但是和文章中确不一样,vm-1 的数据没发完 vm-2 就在发 TCP Dup ACK,在 fast 前已经发了 7 个 dup ack,根据 net.ipv4.tcp_reordering 默认是3,也就是发 3 次 dup ack 才会快速重传,一开始猜测是我本地两台虚拟机传输速度太快的原因,这里贴下 vm-1 server 和 vm-2 client 的代码

测试前需要在 vm-2 drop 发送给 vm-1 的 rst 包

$ sudo iptables -A OUTPUT -p tcp --tcp-flags RST RST --dport 9527 -j DROP
vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    client.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # 禁用 Nagle 算法

    client.sendall(b"a" * 1460)
    time.sleep(0.1) # 避免协议栈合并包的方式,不严谨但是凑合能工作
    client.sendall(b"b" * 1460)
    time.sleep(0.1)
    client.sendall(b"c" * 1460)
    time.sleep(0.1)
    client.sendall(b"d" * 1460)
    time.sleep(0.1)
    client.sendall(b"e" * 1460)
    time.sleep(0.1)
    client.sendall(b"f" * 1460)
    time.sleep(0.1)
    client.sendall(b"g" * 1460)

    time.sleep(10000)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

后面将 vm-1 server 的 time.sleep(0.1) 注释,保持 iptables 规则,重新测试

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack3.pcap --print

vm-1-tcp-send-receive-dupack3.pcap

能看到这三个 TCP Dup ACK 的序列号全是 1566486280

然后快速重传发送了 seq 为 1566486280 的包,这里和文中有一些不一样,我的抓包显示第 9 个包是 vm-2 回复了 ack 对应第 4 个发送数据的包。

假如这时你断开 vm-1 的 server,你会发现 vm-1 和 vm-2 本来为 established 的状态变为 FIN1,但是 vm-1 还是在继续重传此时是基于 rto 的重传会发送 15 次,很 TCP。

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      698/python3          off (0.00/0/0)
tcp        0   8760 192.168.139.111:9527    192.168.139.151:9528    ESTABLISHED 698/python3          on (108.19/12/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0   8761 192.168.139.111:9527    192.168.139.151:9528    FIN_WAIT1   -                    on (102.22/12/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0   8761 192.168.139.111:9527    192.168.139.151:9528    FIN_WAIT1   -                    on (82.74/12/0)

这里能看到快速重传和基于时间重传的差距

快速

1

基于时间

2

在测试下,修改 net.ipv4.tcp_reordering

$ sudo sysctl net.ipv4.tcp_reordering
net.ipv4.tcp_reordering = 3

$ sudo sysctl -w net.ipv4.tcp_reordering=1
net.ipv4.tcp_reordering = 1
$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack1.pcap --print

vm-1-tcp-send-receive-dupack1.pcap

TCP Selective Acknowledgment(SACK)

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-scak.pcap --print

vm-1-tcp-send-receive-scak.pcap

vm-2 的 iptables 规则还是要保持

vm-1

$ nc -k -l 192.168.139.111 9527

vm-2

import time
from scapy.all import *
from scapy.layers.inet import *


def main():
    ip = IP(dst="192.168.139.111")

    myself_seq = 1
    tcp = TCP(sport=9528, dport=9527, flags='S', seq=myself_seq, options=[("SAckOK", '')])
    print("send SYN, seq=0")
    resp = sr1(ip/tcp, timeout=2)
    if not resp:
        print("recv timeout")
        return

    resp_tcp = resp[TCP]
    if 'SA' in str(resp_tcp.flags):
        recv_seq = resp_tcp.seq
        recv_ack = resp_tcp.ack
        print(f"received SYN, seq={recv_seq}, ACK={recv_ack}")

        myself_seq += 1
        send_ack = recv_seq + 1
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        print(f"send ACK={send_ack}")
        send(ip/tcp)

        # 特意注释掉,让发的数据有空洞
        # send data
        # payload = b"a" * 10
        # tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        # send(ip/tcp/payload)
        myself_seq += 10

        payload = b"b" * 10
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        send(ip/tcp/payload)
        myself_seq += 10

        # 特意注释掉,让发的数据有空洞
        # payload = b"c" * 10
        # tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        # send(ip/tcp/payload)
        myself_seq += 10

        payload = b"d" * 10
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        send(ip/tcp/payload)


    elif 'R' in str(resp_tcp.flags):
        print(f"received RST")
    else:
        print("received different TCP flags")

    time.sleep(100)


if __name__ == "__main__":
    main()

这时 vm-2 是作为发送端,原文有些没法理解,问了 gpt 给出了关于发送数据空洞和 vm-1 发送的 sack 选项

“让数据有空洞”到底是什么意思(网络上发生了什么)

普通 TCP 顺序数据到达时,接收端按序累计确认(cumulative ACK),ACK 的值是最后连续无缺口字节的下一个序号。

当你不发送某些中间段(比如注释掉的 a、c),但发送了其后的段 b、d 时,接收端会收到 “out-of-order” 数据:它能接收并缓存 b、d,但不能推进累计 ACK,因为 a 缺失阻断了连续序列。

若接收端支持 SACK,它会在返回的 ACK 中带上 SACK 选项,告知发送端“我已经收到了哪些非连续的块(例如 [seq_b, seq_b+9], [seq_d, seq_d+9])”,从而允许发送端只重传缺失的块(a, c),而不用重传整条流。

简短:你制造的“空洞”就是模拟中间包丢失 / 未发送 的场景,用来触发 SACK 的行为(接收端用 SACK 报告已收到的非连续数据块)。

结合抓包来看,vm-2 发送了第 4 个包 是个 ack,seq number 是 11,next seq number 应该是 21,第 6 个包 seq number 32,next seq number 是 41

在分别看 5 和 7 包的 sack

也就是缺失 0-10 和 21-30 的包,需要 vm-1 重新发送。

这部分后面如果碰到不是模拟的场景,和缺失的场景对比下会更好。

TCP 窗口管理

默认开启 gso/tso

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale.pcap --print

vm-1-tcp-window-scale.pcap

不读数据的服务端和循环发送的客户端代码

vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    time.sleep(10000)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)
    
-----------------------------------------------------------------
vm-2

import socket
import time

def start_client(host, port):
    client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client.connect((host, port))
    client.setblocking(False)

    send_size = 0
    data = b"a" * 100000
    while True:
        try:
            size = client.send(data)
            if size > 0:
                send_size += size
                print(f"send_size: {send_size}")
        except BlockingIOError:
            time.sleep(0.1)
            pass

if __name__ == '__main__':
    start_client('192.168.139.111', 9527)

能看到 vm-2 发的包基本都超过 mtu 大小,检查 generic-segmentation-offload/generic-receive-offload/tcp-segmentation-offload,vm-1 和 vm-2 是一样的。

$ sudo ethtool -k eth0|grep -E "generic-segmentation-offload|generic-receive-offload"
generic-segmentation-offload: on
generic-receive-offload: off

$ sudo ethtool -k eth0|grep tcp-segmentation-offload
tcp-segmentation-offload: on

gpt 解释如下,简单来说会把包合并发送

特性 方向 执行位置 含义 是否硬件相关
TSO (TCP Segmentation Offload) 发送 网卡硬件 TCP 大包分片由网卡完成 ✅ 硬件
GSO (Generic Segmentation Offload) 发送 内核/驱动 软件模拟 TSO 的功能 ⚙️ 软件
GRO (Generic Receive Offload) 接收 内核/驱动 把多个包合并成一个大的 ⚙️ 软件

同时能看到 vm-1 的 Recv-Q 和 vm-2 的 Send-Q 都有堆积,这里恰好能和 TCP 连接的建立 中最后的 nginx 实验部分关联上,云上的 nginx 发送给 vm-2 数据但 vm-2 还没回 ack,Send-Q 会有数值

此代码是 vm-2 一直发送,vm-1 接收但不读取数据

vm-1

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      335/python3          off (0.00/0/0)
tcp   123976      0 192.168.139.111:9527    192.168.139.151:45010   ESTABLISHED 335/python3          off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0 706624 192.168.139.151:45010   192.168.139.111:9527    ESTABLISHED 295/python3          probe (3.53/0/0)

关闭 gso/tso

两边机器都关闭

$ sudo ethtool -K eth0 gso off
$ sudo ethtool -K eth0 tso off
$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale-disable-gso-tso.pcap --print

vm-1-tcp-window-scale-disable-gso-tso.pcap

这回变正常了,数据该堆积还是堆积,不过有一点 关闭 gso/tso 后抓的包比未关闭要多将近 100 个左右,效率较低。

窗口变化

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale-recv.pcap --print

vm-1-tcp-window-scale-recv.pcap

变更服务端代码

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    client.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # 禁用 Nagle 算法

    while True:
        for i in range(5):
            client.recv(4096)

        time.sleep(1)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

感觉这个图表有助于理解,这里放下原文

这里简单解释下这个图,X 轴是时间,Y 轴是 Sequence Number,绿色线是接收方的 window size,蓝色线是发送的包(捕获到的包),黄色的线是 ACK 过的 Sequence Number 值。那么从图上就可以看出来,每当接收方的 window size 增大的时候,立即就有包发送出去了。当 window size 为 0(线平了)的时候,发送立即就停止了。所以这个图告诉我们这个传输是接收方的瓶颈,是接收方通过 window size 的关闭对发送端进行了限流。

这张图鼠标可点击的点更多,比如我这里点 x 轴的点,会直接跳转到对应的包 Frame 112 是 keep-alive 下一个包 113 是 TCP ZeroWindow,从 116 到 132 都是 vm-2 给 vm-1 发送数据 代表y 轴上升的线证明有数据发送,113-116 表示 vm-1 没有窗口接收了,参照代码正是读一会停一会,换到正常业务就是服务端处理的比较慢

tcp window full - tcp 窗口已满
tcp zerowindow - 接收方无法处理更多数据
tcp window update - 有缓冲区可以处理,请继续发送

110 号包到 116 是第一段 x 轴线平了未发送数据阶段,111 是 vm-1 告诉 vm-2 我无法接收更多数据,直到 115 vm-1 发送 window update,跟 vm-2 说你可以发送数据了。

wireshark-tcp-window-zero-update-full

TCP 拥塞控制

此部分暂时不做测试,接触的较少。

阅读全文 →

TCP 连接的关闭

实验流程来自 知识星球:程序员踩坑案例分享

断开连接

同样 vm-2 连接 vm-1,然后 vm-2 做为客户端断开连接

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close.pcap --print

vm-1-tcp_close.pcap

很正常的三次握手连接之后四次挥手关闭连接,一切正常

在看 vm-2 上的连接状态,正常进入 TIME_WAIT 等待 2 * MSL 时间是 60s,没有重试就是在等待

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    ESTABLISHED 34043/nc             off (0.00/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (57.43/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (46.88/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (39.88/0/0)

短连接和 TIME_WAIT 状态

调大 tw_buckets 关闭 tw_reuse 测试

$ sudo sysctl -w net.ipv4.tcp_max_tw_buckets=1000000
net.ipv4.tcp_max_tw_buckets = 1000000

$ sudo sysctl -w net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_tw_reuse = 0
$ cat loopconnect.py
import socket

def connect_and_immediately_disconnect(host, port, count):
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.connect((host, port))
            cli.close()
    except Exception as e:
        print(f"Failed to connect: {e}")

if __name__ == '__main__':
    connect_and_immediately_disconnect('192.168.139.111', 9527, 70000)

然后测试发现机器的 TIME_WAIT 到 5000 左右就上不去了,无法复现无可用地址的错误,尝试将参数缩小5倍,在将本地可用端口范围变小,能看到达到 233 个 TIME_WAIT 提示无可用地址

$ sudo sysctl -w net.ipv4.ip_local_port_range="32768 33000"
net.ipv4.ip_local_port_range = 32768 33000

$ python3 loopconnect.py
Failed to connect: [Errno 99] Cannot assign requested address

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
233

开启 tw_reuse 测试

然后开启 net.ipv4.tcp_tw_reuse 参数,将本地可用端口扩大些

$ sudo sysctl net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range net.ipv4.tcp_max_tw_buckets
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 32768	38414
net.ipv4.tcp_max_tw_buckets = 200000

查看 TIME_WAIT 数量,稳定在 2000 多,脚本正常跑完退出

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
2287
2301
2303
2303
2269
2291
2292
2322

关闭 tw_reuse 修改 tw_buckets 测试

$ sudo sysctl -w net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_tw_reuse = 0

$ sudo sysctl -w net.ipv4.tcp_max_tw_buckets=1000
net.ipv4.tcp_max_tw_buckets = 1000

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
1000
1000
1000

基本相同,TIME_WAIT 大部分是 1000,监测一会会出现 900 多的状况,我猜测是超过了 2 * MSL 后本地可用地址被释放出来,继续被使用 因为关闭了 tw_reuse 不会被重用,只会等待有可用的地址在继续使用

连接脚本中是 14000 个连接,也就是有 11353 个连接没等待 TIME_WAIT 的 60s 直接被系统处理了

$ sudo netstat -s|grep TCPTimeWaitOverflow
    TCPTimeWaitOverflow: 11353

观测 FIN1

vm-2 连接 vm-1,连接后在 vm-1 drop vm-2 发送过来的 FIN,vm-2 发送一个 FIN 后就会进入 FIN1 状态

$ sudo iptables -A INPUT -p tcp --dport 9527 --tcp-flags FIN FIN -j DROP
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-fin1.pcap --print

vm-1-tcp_close-iptables-fin1.pcap

vm-2 向 vm-1 发送 FIN,vm-1 直接 drop,vm-2 因为收不到 vm-1 发送的 FIN+ACK 就会重传

能看到 vm-2 的网络状态

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:37356   192.168.139.111:9527    FIN_WAIT1   -                    on (2.52/4/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:37356   192.168.139.111:9527    FIN_WAIT1   -                    on (34.57/8/0)

这 9 次的重传受 net.ipv4.tcp_orphan_retries 影响,默认是 8

tcp_orphan_retries - INTEGER
This value influences the timeout of a locally closed TCP connection, when RTO retransmissions remain unacknowledged. See tcp_retries2 for more details.

The default value is 8.

If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans.

观测 FIN2 和 LAST_ACK

还是 vm-2 连接 vm-1 后,在 vm-2 使用 iptables 拦截 FIN,断开 vm-2的连接,FIN1 是 vm-2 发送 FIN 后直接就会进入 FIN1 状态,然后 vm-1 发送 ACK 过来 vm-2 就会进入 FIN2 状态,因为我们拦截了 FIN 所以就能观测到 vm-2 的 FIN2 和 vm-1 的 LAST_ACK

$ sudo iptables -A INPUT -p tcp --sport 9527 --tcp-flags FIN FIN -j DROP
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-fin2.pcap --print

vm-1-tcp_close-iptables-fin2.pcap

能看到 vm-2 这个连接进入了 FIN2 状态,最后的值显示 timewait (多少s/0/0),这个 s 是 60,通过 tcp_fin_timeout 控制,这并不是重传 就是 FIN2 的超时时间,过了 60s 连接就会消失

$ sudo sysctl -a|grep tcp_fin_timeout
net.ipv4.tcp_fin_timeout = 60
$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:36456   192.168.139.111:9527    FIN_WAIT2   -                    timewait (57.45/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:36456   192.168.139.111:9527    FIN_WAIT2   -                    timewait (54.12/0/0)

vm-1 的 LAST_ACK,能看到也是在重传 9 次

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1091/nc              off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:36456   ESTABLISHED 1091/nc              off (0.00/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1091/nc              off (0.00/0/0)
tcp        0      1 192.168.139.111:9527    192.168.139.151:36456   LAST_ACK    -                    on (23.09/7/0)

观测 CLOSE_WAIT

使用 python 连接

import socket

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.connect(('192.168.139.111', 9527))
c.shutdown(socket.SHUT_WR)
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-closewait-py.pcap --print

vm-1-tcp_close-iptables-closewait-py.pcap

能看到 vm-2 发送 FIN 后进入 FIN1 状态,vm-2 回复 ACK 就结束了,vm-2 收到 ACK 会进入到 FIN2,而 vm-1 只发送 ACK 自己进入 CLOSE_WAIT

同样 FIN2 等待 60s 后不进入 TIME_WAIT 直接结束状态 此状态和文中是对不上的,发送请求后 vm-2 FIN2 会进入 60s 的等待时间,而文中确不会,目前能看到 vm-1 的 CLOSE_WAIT 是一直存在的

vm-2

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:34774   192.168.139.111:9527    FIN_WAIT2   -                    timewait (55.39/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:34774   192.168.139.111:9527    FIN_WAIT2   -                    timewait (31.64/0/0)

-----------------------------------------------------------------
vm-1

$ sudo netstat -anpo|grep 9527
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1268/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:34774   CLOSE_WAIT  -                    off (0.00/0/0)

修改 net.ipv4.tcp_fin_timeout 测试

$ sudo sysctl -w net.ipv4.tcp_fin_timeout=30
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-closewait-py-30s-timeout.pcap --print

vm-1-tcp_close-iptables-closewait-py-30s-timeout.pcap

vm-1

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        2      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1268/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:37626   CLOSE_WAIT  -                    off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:34774   CLOSE_WAIT  -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37626   192.168.139.111:9527    FIN_WAIT2   -                    timewait (26.30/0/0)

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37626   192.168.139.111:9527    FIN_WAIT2   -                    timewait (0.49/0/0)

看样子是没法豁免的,时间只会随着 net.ipv4.tcp_fin_timeout 变动,可能和内核也有关系贴一下我的系统内核

Linux vm-2 6.17.4-orbstack-00308-g195e9689a04f #1 SMP PREEMPT Fri Oct 24 07:22:34 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

连接保活

默认 keepalive 相关参数

$ sudo sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
import socket
import time

def connect_and_hold(host, port, count):
    cli_list = []
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            cli.connect((host, port))
            cli_list.append(cli)
    except Exception as e:
        print(f"Failed to connect: {e}")

    while True:
        time.sleep(1)

if __name__ == '__main__':
    connect_and_hold('192.168.139.111', 9527, 1)
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-vm-2-keepalive.pcap --print

vm-1-tcp_close-vm-2-keepalive.pcap

连接后没数据传输,vm-2 每隔 75s 给 vm-1 发送 TCP Keep-Alive,走的 net.ipv4.tcp_keepalive_intvl

vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37118   192.168.139.111:9527    ESTABLISHED 25305/python3        keepalive (71.33/0/0)

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37118   192.168.139.111:9527    ESTABLISHED 25305/python3        keepalive (70.62/0/0)

叫 grok 改了 python 脚本

import socket
import time  # 添加 time 模块以便暂停脚本查看连接状态

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# 设置 keepalive 参数:TCP_KEEPIDLE 为空闲时间(相当于 net.ipv4.tcp_keepalive_time),单位秒

c.connect(('192.168.139.111', 9527))

# 暂停脚本以便用 netstat -anpo 或 ss -anto 查看连接状态(会显示 keepalive timer 如 timer:keepalive (10.000 sec))
print("连接已建立,按 Enter 退出...")
input()  # 或用 time.sleep(60) 自动等待 60 秒
c.close()
$ sudo sysctl -a|grep keep
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 10
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-vm-2-keepalive-10s.pcap --print

vm-1-tcp_close-vm-2-keepalive-10s.pcap

正常三次握手,然后 vm-2 发了 Keep-Alive vm-1 回复,3和4的包之间隔了 10s 也就是 net.ipv4.tcp_keepalive_time,然后 vm-1 和 vm-2 之间没发送数据,相隔 20s 第6个包 vm-2 发了 Keep-Alive 也就是 net.ipv4.tcp_keepalive_intvl,最后 vm-2 enter 直接断开 发了 FIN+ACK vm-1 回了 ACK 但是没回 FIN vm-1 状态就是 CLOSE_WAIT,vm-2 则是 FIN2 然后根据 fin_time 30s 过去就消失

vm-1

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1692/python3         off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:33360   ESTABLISHED -                    off (0.00/0/0)

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1692/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:33360   CLOSE_WAIT  -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    ESTABLISHED 27528/python3        keepalive (7.38/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    ESTABLISHED 27528/python3        keepalive (17.59/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    FIN_WAIT2   -                    timewait (26.96/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    FIN_WAIT2   -                    timewait (23.06/0/0)

TCP_CLOSE 状态

没找到合适的,自己画了个

tcp-close

阅读全文 →

TCP 连接的建立

实验流程来自 知识星球:程序员踩坑案例分享

创建连接

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp.pcap --print

vm-1-tcp.pcap

当时碰到个问题,在 vm-1 用 "nc -k -l vm-1 9527",vm-2 连接 vm-1 时 vm-1 窗口收不到消息

在两台 vm 里 hosts 文件加了对端的机器名和 ip

vm-1
198.19.249.151 vm-2

vm-2
198.19.249.111 vm-1

vm-1 上抓包看下

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost.pcap --print

vm-1-tcp-nc-localhost.pcap

在看 vm-1 监听的情况

sudo netstat -anpt

监听 127.0.0.1 去了,另外一块网卡没监听

vm-2 发 syn 给 vm-1,vm-1 直接回了个 rst,然后 vm-1 根据 net.ipv4.tcp_syn_retries 不停的重试

sudo sysctl -a|grep net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

但是我抓包发现会重传 10 次 共 11 个包

再次抓包验证

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost-retries.pcap --print

vm-1-tcp-nc-localhost-retries.pcap

很奇怪,和 net.ipv4.tcp_syn_linear_timeouts=6 的现象不一样,正常只应该有 7 个包,一个正常 syn 和 6 个重试

这时还有一个现象,正常来说 “指数退避” 应该是 1 2 4 8,但抓包前 4 次均是相隔 1s,第5个重试包才相隔 2s,根据这个现象和当前内核版本查询到

net.ipv4.tcp_syn_linear_timeouts

tcp_syn_linear_timeouts - INTEGER
The number of times for an active TCP connection to retransmit SYNs with a linear backoff timeout before defaulting to an exponential backoff timeout. This has no effect on SYNACK at the passive TCP side.

With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would expect SYN RTOs to be: 1,1,1,1,1,2,4,... (4 linear timeouts,and the first exponential backoff using 2^0 * initial_RTO). Default: 4

这就对了,后面更改 net.ipv4.tcp_syn_linear_timeouts 在继续测试

sudo sysctl -w net.ipv4.tcp_syn_retries=6 net.ipv4.tcp_syn_linear_timeouts=1

正常是 10 次,现在应该是 7 次

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost-retries-syn-linear.pcap --print

vm-1-tcp-nc-localhost-retries-syn-linear.pcap

while true;do sudo netstat -anpo|grep 9527;sleep 1;done

没错到7次自动停了

后面改成 nc -k -l 192.168.139.111 9527 直接就通了

观测 SYN_SENT

vm-1 使用 iptables drop vm-2 发来的 syn 包

sudo iptables -A INPUT -p tcp --dport 9527 -j DROP
sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-drop-9527.pcap --print

vm-1-tcp-iptables-drop-9527.pcap

能看到这回是 tcp retransmission,重传了 10 次 依旧是这两个参数控制

net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_linear_timeouts

能看到 vm-2 连接状态 SYN_SENT

观测 SYN_RECV

需要在 vm-2 drop 从 vm-1 传过来的 SYN+ACK 包,这样 vm-2 收不到 SYN+ACK 就没办法回 ACK,vm-1 也没办法将三次握手完成

sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

改用 nmap 测试连接

sudo nmap -sS 192.168.139.111 -p 9527

vm-1 查看连接状态

while true;do sudo netstat -anpo|grep 9527;sleep 1;done
sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-vm2-drop-9527.pcap --print

vm-1-tcp-iptables-vm2-drop-9527.pcap

能看到 vm-2 >(SYN) vm-1,vm-1 >(SYN+ACK) vm-2,然后 vm-1 一直在重试,试了5次

net.ipv4.tcp_synack_retries 默认是5

tcp_synack_retries - INTEGER
Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value is 5, which corresponds to 31seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for a passive TCP connection will happen after 63seconds.

文中提到:SYN FLOOD

客户端发了 1 个 SYN 到服务端,如果客户端不响应那服务端就会重试 5 次,一台机器是 5 次如果机器多服务端资源很快就会被消耗

文中提到:如果只使用 iptables 拦截第二次握手包的话,会导致源端协议栈 SYN 重传的,这样就没法测试 SYN+ACK 重传了。所以发送端在发完 SYN 包后不能有其他逻辑。nc 做不到只发送 SYN 包就退出,改用 nmap 来进行实验。

复现下 用 nc 然后抓包

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-vm2-drop-9527-nc.pcap --print

vm-1-tcp-iptables-vm2-drop-9527-nc.pcap

果然 vm-2 在重传

SYN Queue

借用下文中的图

(图片来自:https://www.emqx.com/en/blog/emqx-performance-tuning-tcp-syn-queue-and-accept-queue)

验证下半连接队列长度,修改相关的内核参数

sudo sysctl -w net.ipv4.tcp_syncookies=0 net.ipv4.tcp_max_syn_backlog=4 net.core.somaxconn=8

vm-2 测试

while true;do sudo nmap -sS 192.168.139.111 -p 9527;done

vm-1 查看状态,又和修改的内核参数对应不上

$ sudo netstat -anpo|grep RECV
tcp        0      0 192.168.139.111:9527    192.168.139.151:35013   SYN_RECV    -                    on (1.82/2/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:57984   SYN_RECV    -                    on (1.76/2/0)

半连接取值的规则是这样

min(backlog, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog)

syn_backlog 和 somaxconn 设置的都不是 2,唯一有关系的就是 backlog,backlog没有改直接用的 nc

nc -k -l 192.168.139.111 9527
$ sudo ss -anpt
State     Recv-Q    Send-Q         Local Address:Port       Peer Address:Port   Process
LISTEN    0         1            192.168.139.111:9527            0.0.0.0:*       users:(("nc",pid=37710,fd=3))

关于 ss 的 Send-Q 解释

High Send-Q means the data is put on TCP/IP send buffer, but it is not sent or it is sent but not ACKed

表示数据在 tcp/ip 发送缓存中,但未发送或已发送但未 ack

对比我们情况就是 vm-2 拦截了 vm-1 发过来的 syn+ack,未回复 ack

也就是 nc 的 backlog 设置的是 1,server 的半连接队列只允许有1个等待

用 go 写一个

package main

import (
	"fmt"
	"log"
	"net"
	"time"
)

func main() {
	fmt.Print("h")
	conn,err := net.Listen("tcp4","0.0.0.0:9527")
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	log.Println("listen :9527 success")

	for {
		time.Sleep(time.Second * 10)
	}
}
$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          8                    0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=37717,fd=4))

能看到 send-q 是 8,根据公示 min(backlog, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog),somaxconn 是 8,syn_backlog 是 4

我们把内核参数恢复默认看下 go server 的默认 backlog

$ sudo sysctl -a|grep tcp_syncookies;sudo sysctl -a|grep max_syn_backlog;sudo sysctl -a|grep net.core.somaxconn
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 512
net.core.somaxconn = 4096

$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          4096                 0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=318,fd=4))

现在唯一的问题是最小应是4,通过 ss -anpt 查看显示是8,我们访问测试下,改完内核参数记得重新运行服务

$ while true;do sudo nmap -sS 192.168.139.111 -p 9527;done

$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          8                    0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=344,fd=4))

$ sudo netstat -anpo | grep SYN_RECV | wc -l
4

$ sudo ss -anpt|grep 9527
LISTEN   0      8              0.0.0.0:9527         0.0.0.0:*     users:(("s",pid=344,fd=4))
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:53165
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:33241
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:50404
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:46060

能看到队列里是4,那上面的就是取值问题

netstat -s 能看到丢弃了多少 syn

$ sudo netstat -s | grep -E "LISTEN|overflowed"
    85 SYNs to LISTEN sockets dropped

Accept Queue

全连接队列最大长度
min(backlog, net.core.somaxconn)

vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)

    while True:
        time.sleep(1)

if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

vm-2

import socket
import time

def connect_and_hold(host, port, count):
    cli_list = []
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.connect((host, port))
            cli_list.append(cli)
    except Exception as e:
        print(f"Failed to connect: {e}")

    while True:
        time.sleep(1)

if __name__ == '__main__':
    connect_and_hold('192.168.139.111', 9527, 10)

清理掉之前的 iptables 规则,分别启动测试

$ sudo netstat -s|grep -E "LISTEN|overflow"
    6 times the listen queue of a socket overflowed # 全连接丢弃的包
    91 SYNs to LISTEN sockets dropped
vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        9      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      407/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54468   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54524   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54516   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54478   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54464   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54494   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54508   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54536   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54496   ESTABLISHED -                    off (0.00/0/0)

vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:54464   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54468   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54478   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54494   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54496   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54508   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54516   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54524   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54536   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      1 192.168.139.151:54538   192.168.139.111:9527    SYN_SENT    33260/python3        on (0.81/7/0)

没错,vm-2 的第10个包 SYN_SENT 在重传

也就是全连接满了 半连接是不接收直接drop掉的

观测下全连接不满,半连接什么情况

vm-2 的连接改成6

vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        6      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      463/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51454   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51402   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51440   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51426   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51418   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51450   ESTABLISHED -                    off (0.00/0/0)

vm-2 拦截 vm-1 过来的 syn+ack

$ sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

$ nc 192.168.139.111 9527

vm-1 能看到这个 SYN_RECV 在重试,也就是进了半连接队列,因为 vm-2 拦截了 vm-1 过来的包,vm-2 不会给 vm-1 发送 ack,vm-1 就会一直重试

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        6      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      463/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51454   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:40638   SYN_RECV    -                    on (12.04/4/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51402   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51440   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51426   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51418   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51450   ESTABLISHED -                    off (0.00/0/0)

当全连接没满,半连接是可以接收的

文中描述

net.ipv4.tcp_abort_on_overflow
此值为 0 表示握手到第三步时全连接队列满时则扔掉客户端发过来的 ACK 包。但是客户端那边因为握手包已经发出,已经自动进入 ESTABLISHED 状态准备传输数据了。服务端丢弃了 ACK 包后这个链接还是处于 SYN_RECV 状态的(如果此时客户端发数据,服务端会直接丢弃。客户端就开始重传,此时的重传次数受内核的 net.ipv4.tcp_retries2 参数控制);

此值为 1 则直接给客户端发送 RST 包直接断开连接。

这里强调下,这个参数只在半连接队列往全连接队列移动时才有效。而全连接队列已经满的情况下,内核的默认行为只是丢弃新的 SYN 包(而且目前没有参数可以控制这个行为),这会导致客户端 SYN 不断重传。

默认 net.ipv4.tcp_abort_on_overflow 是 0,要想测试很难,只在半连接向全连接移动时有效。

另外握手到第三步,就是 vm-2 向 vm-1 发 ack,既要满足发送 ack 又要叫全连接是满的,也就是发送 syn+ack 时候全连接还没满,回 ack 时 vm-1 恰巧有一个比当前请求还快的握手,让 vm-1 的全连接队列满。

我尝试在 vm-1 全连接队列满的时候,发送一个正常包到 vm-1,看看 vm-1 和 vm-2 的状态

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_abort_on_overflow.pcap --print

vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        9      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      538/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56822   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56802   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56786   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56840   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56838   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56796   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56790   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56780   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56818   ESTABLISHED -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:42424   192.168.139.111:9527    SYN_SENT    33284/nc             on (1.58/6/0)
tcp        0      0 192.168.139.151:56780   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56786   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56790   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56796   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56802   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56818   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56822   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56838   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56840   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      1 192.168.139.151:56844   192.168.139.111:9527    SYN_SENT    33283/python3        on (48.94/7/0)

能看到 vm-1 建立9个连接后这边就停止了,没有 SYN_RECV,也就是全连接满了 半连接的请求直接被 drop

而 vm-2 通过抓包能看到 56844 python 在发送 SYN_SENT

nc 的 42424 也是,全部都在重试,试了7次,正常现象 我的 net.ipv4.tcp_syn_retries = 6 net.ipv4.tcp_syn_linear_timeouts = 1

重传这里还能看到个现象:vm-1 使用 iptables 拒绝 vm-2 过来的 syn 包和全连接满了直接拒绝半连接反应的抓包是一样的,区别是一个是用户行为一个是系统行为

我将 vm-1 重启内核参数恢复默认,又启动一个nginx,能看到默认半连接 511

$ sudo ss -lnt
State     Recv-Q    Send-Q       Local Address:Port       Peer Address:Port    Process
LISTEN    0         511                0.0.0.0:80              0.0.0.0:*
LISTEN    0         511                   [::]:80                 [::]:*

这时候如果你的nginx无法处理连接,状况大致可分为几种

  1. 监听了lo网卡,导致无法处理外部请求,访问会拒绝。客户端走tcp重试
  2. 监听了正确的网卡,但有 iptables 或安全组等拦截。客户端走tcp重试
  3. 监听了正确的网卡 iptables 或安全组都放行,全连接满了。系统级别直接drop连接
  4. 监听了正确的网卡 iptables 或安全组都放行,全连接没满半连接也没满。但新机器上来就把内核参数改了,导致半连接过小,高并发情况下 系统基本指标都正常 这会让请求处理异常吗?(这一点存在疑问后面测试下)
  5. 监听了正确的网卡 iptables 或安全组都放行,全连接没满半连接也没满。但这台机器的基本指标都异常比如CPU内存使用100%,这样全连接就会一直堆积 accept 很慢,导致半连接也满了。你的机器最终也就不可用了

4 问题测试 会异常 从 server 观测到 vm-2 发送了大量的 tcp 重试,同时半连接队列从系统层又drop掉很多请求

我发现这个抓包少了并不全,但也不碍事,系统层drop掉请求是对的

$ sudo sysctl -w net.ipv4.tcp_syncookies=0 net.ipv4.tcp_max_syn_backlog=4 net.core.somaxconn=8
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_syn_backlog = 4
net.core.somaxconn = 8
vm-2

$ wrk -t4 -c400 -d60s http://101.200.150.26
$ netstat -anpo|grep -E "Recv|80"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      28539/nginx: master  off (0.00/0/0)
tcp        0    862 172.22.7.89:80          x.x.x.x:55481    ESTABLISHED 28540/nginx: worker  on (0.31/0/0)
tcp        0    862 172.22.7.89:80          x.x.x.x:56121    ESTABLISHED 28541/nginx: worker  on (6.32/6/0)

$ netstat -s|grep -E "LISTEN|overflow"
    5517 times the listen queue of a socket overflowed
    78079 SYNs to LISTEN sockets dropped
$ netstat -s|grep -E "LISTEN|overflow"
    5517 times the listen queue of a socket overflowed
    78388 SYNs to LISTEN sockets dropped

还能看到在tcp连接建立以后 nginx 也做了重传,同时 Send-Q 部分为 862 byte,通过抓包分析862 恰好是 tcp 层的 tcp segment len,这个请求是 server 发往 vm-2 的响应请求,server 发给了 vm-2 还在等待 vm-2 的 ack,所以能看到 Send-Q 是 862

不设置内核参数,在压测下

tcpdump -s0 -X -nn "tcp port 80" -w cloudserver-wrk-no-sysctl.pcap --print

这个包是全的

再来看系统是否有drop请求,空的 netstat -s|grep -E "LISTEN|overflow" 过滤直接没有

TcpExt:
    2 invalid SYN cookies received
    10 resets received for embryonic SYN_RECV sockets
    42 TCP sockets finished time wait in fast timer
    273 packets rejected in established connections because of timestamp
    19 delayed acks sent
    Quick ack mode was activated 13960 times
    630 packet headers predicted
    64905 acknowledgments not containing data payload received
    16255 predicted acknowledgments
    TCPSackRecovery: 1741
    18 congestion windows recovered without slow start after partial ack
阅读全文 →