实验流程来自 知识星球:程序员踩坑案例分享

TCP 超时重传

TCP 基于时间的重传

vm-1

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-time-retries.pcap --print

vm-1-tcp-send-receive-time-retries.pcap

当连接建立好后,vm-2 拦截 vm-1 发的包

$ sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

vm-1 发送数据

$ nc -k -l 192.168.139.111 9527
abc
abc
abc

$ while true;do sudo netstat -anpo|grep 9527|grep -v LISTEN; sleep 1;done
...
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (3.14/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (2.12/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (1.10/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.07/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.00/15/0)
tcp        0      4 192.168.139.111:9527    192.168.139.151:37506   ESTABLISHED 303/nc               on (0.00/15/0)

能看到重传了 15 次,但在抓包里看到了 16 个重传包

tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout. If tcp_rto_max_ms is decreased, it is recommended to also change tcp_retries2.

RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.

第一个包其实不算做重传,而是对端在 rto 时间没有回复,然后触发重传

47.086654-46.884133=0.202521

$ sudo sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

文中对 RTT 和 RTO 的描述

RTT(Round Trip Time):指一个数据包从发出去到回来的时间
RTO(Retransmission TimeOut):指的是重传超时的时间

linux 中有 TCP_RTO_MIN 内核参数

tcp_rto_min_us - INTEGER
Minimal TCP retransmission timeout (in microseconds). Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN and TCP_RTO_MIN_US socket options, followed by this tcp_rto_min_us sysctl.

The recommended practice is to use a value less or equal to 200000 microseconds.

Possible Values: 1 - INT_MAX

Default: 200000
$ sudo sysctl net.ipv4.tcp_rto_min_us
net.ipv4.tcp_rto_min_us = 200000

具体的 rto 是根据 TCP_RTO_MIN 计算出的,也能查看到

$ sudo ss -tip|grep -A 1 9527
ESTAB 0      0      192.168.139.111:9527 192.168.139.151:57060 users:(("nc",pid=303,fd=4))
	 cubic wscale:10,10 rto:201 rtt:0.078/0.039 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 segs_in:2 send 1.49Gbps lastsnd:21909 lastrcv:21909 lastack:21909 pacing_rate 2.97Gbps delivered:1 app_limited rcv_space:14480 rcv_ssthresh:64088 minrtt:0.078 snd_wnd:64512

当从 vm-1 发送数据到 vm-2 再次查看 rto,重试后会增加的

$ sudo ss -tip|grep -A 1 9527
ESTAB 0      4      192.168.139.111:9527 192.168.139.151:57060 users:(("nc",pid=303,fd=4))
	 cubic wscale:10,10 rto:25728 backoff:7 rtt:0.078/0.039 ato:40 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:7 bytes_sent:36 bytes_retrans:32 bytes_received:4 segs_out:18 segs_in:11 data_segs_out:9 data_segs_in:9 send 149Mbps lastsnd:11529 lastrcv:42622 lastack:16135 pacing_rate 297Mbps delivered:1 app_limited busy:38299ms unacked:1 retrans:1/8 lost:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.078 snd_wnd:64512

TCP 快速重传

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack3-wrong.pcap --print

vm-1-tcp-send-receive-dupack3-wrong.pcap

确实能看到 TCP Fast Retransmission 但是和文章中确不一样,vm-1 的数据没发完 vm-2 就在发 TCP Dup ACK,在 fast 前已经发了 7 个 dup ack,根据 net.ipv4.tcp_reordering 默认是3,也就是发 3 次 dup ack 才会快速重传,一开始猜测是我本地两台虚拟机传输速度太快的原因,这里贴下 vm-1 server 和 vm-2 client 的代码

测试前需要在 vm-2 drop 发送给 vm-1 的 rst 包

$ sudo iptables -A OUTPUT -p tcp --tcp-flags RST RST --dport 9527 -j DROP
vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    client.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # 禁用 Nagle 算法

    client.sendall(b"a" * 1460)
    time.sleep(0.1) # 避免协议栈合并包的方式,不严谨但是凑合能工作
    client.sendall(b"b" * 1460)
    time.sleep(0.1)
    client.sendall(b"c" * 1460)
    time.sleep(0.1)
    client.sendall(b"d" * 1460)
    time.sleep(0.1)
    client.sendall(b"e" * 1460)
    time.sleep(0.1)
    client.sendall(b"f" * 1460)
    time.sleep(0.1)
    client.sendall(b"g" * 1460)

    time.sleep(10000)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

后面将 vm-1 server 的 time.sleep(0.1) 注释,保持 iptables 规则,重新测试

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack3.pcap --print

vm-1-tcp-send-receive-dupack3.pcap

能看到这三个 TCP Dup ACK 的序列号全是 1566486280

然后快速重传发送了 seq 为 1566486280 的包,这里和文中有一些不一样,我的抓包显示第 9 个包是 vm-2 回复了 ack 对应第 4 个发送数据的包。

假如这时你断开 vm-1 的 server,你会发现 vm-1 和 vm-2 本来为 established 的状态变为 FIN1,但是 vm-1 还是在继续重传此时是基于 rto 的重传会发送 15 次,很 TCP。

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      698/python3          off (0.00/0/0)
tcp        0   8760 192.168.139.111:9527    192.168.139.151:9528    ESTABLISHED 698/python3          on (108.19/12/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0   8761 192.168.139.111:9527    192.168.139.151:9528    FIN_WAIT1   -                    on (102.22/12/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0   8761 192.168.139.111:9527    192.168.139.151:9528    FIN_WAIT1   -                    on (82.74/12/0)

这里能看到快速重传和基于时间重传的差距

快速

1

基于时间

2

在测试下,修改 net.ipv4.tcp_reordering

$ sudo sysctl net.ipv4.tcp_reordering
net.ipv4.tcp_reordering = 3

$ sudo sysctl -w net.ipv4.tcp_reordering=1
net.ipv4.tcp_reordering = 1
$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-dupack1.pcap --print

vm-1-tcp-send-receive-dupack1.pcap

TCP Selective Acknowledgment(SACK)

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-send-receive-scak.pcap --print

vm-1-tcp-send-receive-scak.pcap

vm-2 的 iptables 规则还是要保持

vm-1

$ nc -k -l 192.168.139.111 9527

vm-2

import time
from scapy.all import *
from scapy.layers.inet import *


def main():
    ip = IP(dst="192.168.139.111")

    myself_seq = 1
    tcp = TCP(sport=9528, dport=9527, flags='S', seq=myself_seq, options=[("SAckOK", '')])
    print("send SYN, seq=0")
    resp = sr1(ip/tcp, timeout=2)
    if not resp:
        print("recv timeout")
        return

    resp_tcp = resp[TCP]
    if 'SA' in str(resp_tcp.flags):
        recv_seq = resp_tcp.seq
        recv_ack = resp_tcp.ack
        print(f"received SYN, seq={recv_seq}, ACK={recv_ack}")

        myself_seq += 1
        send_ack = recv_seq + 1
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        print(f"send ACK={send_ack}")
        send(ip/tcp)

        # 特意注释掉,让发的数据有空洞
        # send data
        # payload = b"a" * 10
        # tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        # send(ip/tcp/payload)
        myself_seq += 10

        payload = b"b" * 10
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        send(ip/tcp/payload)
        myself_seq += 10

        # 特意注释掉,让发的数据有空洞
        # payload = b"c" * 10
        # tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        # send(ip/tcp/payload)
        myself_seq += 10

        payload = b"d" * 10
        tcp = TCP(sport=9528, dport=9527, flags='A', seq=myself_seq, ack=send_ack)
        send(ip/tcp/payload)


    elif 'R' in str(resp_tcp.flags):
        print(f"received RST")
    else:
        print("received different TCP flags")

    time.sleep(100)


if __name__ == "__main__":
    main()

这时 vm-2 是作为发送端,原文有些没法理解,问了 gpt 给出了关于发送数据空洞和 vm-1 发送的 sack 选项

“让数据有空洞”到底是什么意思(网络上发生了什么)

普通 TCP 顺序数据到达时,接收端按序累计确认(cumulative ACK),ACK 的值是最后连续无缺口字节的下一个序号。

当你不发送某些中间段(比如注释掉的 a、c),但发送了其后的段 b、d 时,接收端会收到 “out-of-order” 数据:它能接收并缓存 b、d,但不能推进累计 ACK,因为 a 缺失阻断了连续序列。

若接收端支持 SACK,它会在返回的 ACK 中带上 SACK 选项,告知发送端“我已经收到了哪些非连续的块(例如 [seq_b, seq_b+9], [seq_d, seq_d+9])”,从而允许发送端只重传缺失的块(a, c),而不用重传整条流。

简短:你制造的“空洞”就是模拟中间包丢失 / 未发送 的场景,用来触发 SACK 的行为(接收端用 SACK 报告已收到的非连续数据块)。

结合抓包来看,vm-2 发送了第 4 个包 是个 ack,seq number 是 11,next seq number 应该是 21,第 6 个包 seq number 32,next seq number 是 41

在分别看 5 和 7 包的 sack

也就是缺失 0-10 和 21-30 的包,需要 vm-1 重新发送。

这部分后面如果碰到不是模拟的场景,和缺失的场景对比下会更好。

TCP 窗口管理

默认开启 gso/tso

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale.pcap --print

vm-1-tcp-window-scale.pcap

不读数据的服务端和循环发送的客户端代码

vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    time.sleep(10000)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)
    
-----------------------------------------------------------------
vm-2

import socket
import time

def start_client(host, port):
    client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client.connect((host, port))
    client.setblocking(False)

    send_size = 0
    data = b"a" * 100000
    while True:
        try:
            size = client.send(data)
            if size > 0:
                send_size += size
                print(f"send_size: {send_size}")
        except BlockingIOError:
            time.sleep(0.1)
            pass

if __name__ == '__main__':
    start_client('192.168.139.111', 9527)

能看到 vm-2 发的包基本都超过 mtu 大小,检查 generic-segmentation-offload/generic-receive-offload/tcp-segmentation-offload,vm-1 和 vm-2 是一样的。

$ sudo ethtool -k eth0|grep -E "generic-segmentation-offload|generic-receive-offload"
generic-segmentation-offload: on
generic-receive-offload: off

$ sudo ethtool -k eth0|grep tcp-segmentation-offload
tcp-segmentation-offload: on

gpt 解释如下,简单来说会把包合并发送

特性 方向 执行位置 含义 是否硬件相关
TSO (TCP Segmentation Offload) 发送 网卡硬件 TCP 大包分片由网卡完成 ✅ 硬件
GSO (Generic Segmentation Offload) 发送 内核/驱动 软件模拟 TSO 的功能 ⚙️ 软件
GRO (Generic Receive Offload) 接收 内核/驱动 把多个包合并成一个大的 ⚙️ 软件

同时能看到 vm-1 的 Recv-Q 和 vm-2 的 Send-Q 都有堆积,这里恰好能和 TCP 连接的建立 中最后的 nginx 实验部分关联上,云上的 nginx 发送给 vm-2 数据但 vm-2 还没回 ack,Send-Q 会有数值

此代码是 vm-2 一直发送,vm-1 接收但不读取数据

vm-1

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      335/python3          off (0.00/0/0)
tcp   123976      0 192.168.139.111:9527    192.168.139.151:45010   ESTABLISHED 335/python3          off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0 706624 192.168.139.151:45010   192.168.139.111:9527    ESTABLISHED 295/python3          probe (3.53/0/0)

关闭 gso/tso

两边机器都关闭

$ sudo ethtool -K eth0 gso off
$ sudo ethtool -K eth0 tso off
$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale-disable-gso-tso.pcap --print

vm-1-tcp-window-scale-disable-gso-tso.pcap

这回变正常了,数据该堆积还是堆积,不过有一点 关闭 gso/tso 后抓的包比未关闭要多将近 100 个左右,效率较低。

窗口变化

$ sudo tcpdump -S -s0 -X -nn "tcp port 9527" -w vm-1-tcp-window-scale-recv.pcap --print

vm-1-tcp-window-scale-recv.pcap

变更服务端代码

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)
    client, _ = server.accept()
    client.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # 禁用 Nagle 算法

    while True:
        for i in range(5):
            client.recv(4096)

        time.sleep(1)


if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

感觉这个图表有助于理解,这里放下原文

这里简单解释下这个图,X 轴是时间,Y 轴是 Sequence Number,绿色线是接收方的 window size,蓝色线是发送的包(捕获到的包),黄色的线是 ACK 过的 Sequence Number 值。那么从图上就可以看出来,每当接收方的 window size 增大的时候,立即就有包发送出去了。当 window size 为 0(线平了)的时候,发送立即就停止了。所以这个图告诉我们这个传输是接收方的瓶颈,是接收方通过 window size 的关闭对发送端进行了限流。

这张图鼠标可点击的点更多,比如我这里点 x 轴的点,会直接跳转到对应的包 Frame 112 是 keep-alive 下一个包 113 是 TCP ZeroWindow,从 116 到 132 都是 vm-2 给 vm-1 发送数据 代表y 轴上升的线证明有数据发送,113-116 表示 vm-1 没有窗口接收了,参照代码正是读一会停一会,换到正常业务就是服务端处理的比较慢

tcp window full - tcp 窗口已满
tcp zerowindow - 接收方无法处理更多数据
tcp window update - 有缓冲区可以处理,请继续发送

110 号包到 116 是第一段 x 轴线平了未发送数据阶段,111 是 vm-1 告诉 vm-2 我无法接收更多数据,直到 115 vm-1 发送 window update,跟 vm-2 说你可以发送数据了。

wireshark-tcp-window-zero-update-full

TCP 拥塞控制

此部分暂时不做测试,接触的较少。

实验流程来自 知识星球:程序员踩坑案例分享

断开连接

同样 vm-2 连接 vm-1,然后 vm-2 做为客户端断开连接

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close.pcap --print

vm-1-tcp_close.pcap

很正常的三次握手连接之后四次挥手关闭连接,一切正常

在看 vm-2 上的连接状态,正常进入 TIME_WAIT 等待 2 * MSL 时间是 60s,没有重试就是在等待

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    ESTABLISHED 34043/nc             off (0.00/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (57.43/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (46.88/0/0)

$ sudo netstat -anpo|grep Recv-Q;sudo netstat -anpo|grep 9527
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:58966   192.168.139.111:9527    TIME_WAIT   -                    timewait (39.88/0/0)

短连接和 TIME_WAIT 状态

调大 tw_buckets 关闭 tw_reuse 测试

$ sudo sysctl -w net.ipv4.tcp_max_tw_buckets=1000000
net.ipv4.tcp_max_tw_buckets = 1000000

$ sudo sysctl -w net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_tw_reuse = 0
$ cat loopconnect.py
import socket

def connect_and_immediately_disconnect(host, port, count):
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.connect((host, port))
            cli.close()
    except Exception as e:
        print(f"Failed to connect: {e}")

if __name__ == '__main__':
    connect_and_immediately_disconnect('192.168.139.111', 9527, 70000)

然后测试发现机器的 TIME_WAIT 到 5000 左右就上不去了,无法复现无可用地址的错误,尝试将参数缩小5倍,在将本地可用端口范围变小,能看到达到 233 个 TIME_WAIT 提示无可用地址

$ sudo sysctl -w net.ipv4.ip_local_port_range="32768 33000"
net.ipv4.ip_local_port_range = 32768 33000

$ python3 loopconnect.py
Failed to connect: [Errno 99] Cannot assign requested address

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
233

开启 tw_reuse 测试

然后开启 net.ipv4.tcp_tw_reuse 参数,将本地可用端口扩大些

$ sudo sysctl net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range net.ipv4.tcp_max_tw_buckets
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 32768	38414
net.ipv4.tcp_max_tw_buckets = 200000

查看 TIME_WAIT 数量,稳定在 2000 多,脚本正常跑完退出

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
2287
2301
2303
2303
2269
2291
2292
2322

关闭 tw_reuse 修改 tw_buckets 测试

$ sudo sysctl -w net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_tw_reuse = 0

$ sudo sysctl -w net.ipv4.tcp_max_tw_buckets=1000
net.ipv4.tcp_max_tw_buckets = 1000

$ sudo netstat -anpo|grep 9527|grep timewait|wc -l
1000
1000
1000

基本相同,TIME_WAIT 大部分是 1000,监测一会会出现 900 多的状况,我猜测是超过了 2 * MSL 后本地可用地址被释放出来,继续被使用 因为关闭了 tw_reuse 不会被重用,只会等待有可用的地址在继续使用

连接脚本中是 14000 个连接,也就是有 11353 个连接没等待 TIME_WAIT 的 60s 直接被系统处理了

$ sudo netstat -s|grep TCPTimeWaitOverflow
    TCPTimeWaitOverflow: 11353

观测 FIN1

vm-2 连接 vm-1,连接后在 vm-1 drop vm-2 发送过来的 FIN,vm-2 发送一个 FIN 后就会进入 FIN1 状态

$ sudo iptables -A INPUT -p tcp --dport 9527 --tcp-flags FIN FIN -j DROP
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-fin1.pcap --print

vm-1-tcp_close-iptables-fin1.pcap

vm-2 向 vm-1 发送 FIN,vm-1 直接 drop,vm-2 因为收不到 vm-1 发送的 FIN+ACK 就会重传

能看到 vm-2 的网络状态

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:37356   192.168.139.111:9527    FIN_WAIT1   -                    on (2.52/4/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:37356   192.168.139.111:9527    FIN_WAIT1   -                    on (34.57/8/0)

这 9 次的重传受 net.ipv4.tcp_orphan_retries 影响,默认是 8

tcp_orphan_retries - INTEGER
This value influences the timeout of a locally closed TCP connection, when RTO retransmissions remain unacknowledged. See tcp_retries2 for more details.

The default value is 8.

If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans.

观测 FIN2 和 LAST_ACK

还是 vm-2 连接 vm-1 后,在 vm-2 使用 iptables 拦截 FIN,断开 vm-2的连接,FIN1 是 vm-2 发送 FIN 后直接就会进入 FIN1 状态,然后 vm-1 发送 ACK 过来 vm-2 就会进入 FIN2 状态,因为我们拦截了 FIN 所以就能观测到 vm-2 的 FIN2 和 vm-1 的 LAST_ACK

$ sudo iptables -A INPUT -p tcp --sport 9527 --tcp-flags FIN FIN -j DROP
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-fin2.pcap --print

vm-1-tcp_close-iptables-fin2.pcap

能看到 vm-2 这个连接进入了 FIN2 状态,最后的值显示 timewait (多少s/0/0),这个 s 是 60,通过 tcp_fin_timeout 控制,这并不是重传 就是 FIN2 的超时时间,过了 60s 连接就会消失

$ sudo sysctl -a|grep tcp_fin_timeout
net.ipv4.tcp_fin_timeout = 60
$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:36456   192.168.139.111:9527    FIN_WAIT2   -                    timewait (57.45/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:36456   192.168.139.111:9527    FIN_WAIT2   -                    timewait (54.12/0/0)

vm-1 的 LAST_ACK,能看到也是在重传 9 次

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1091/nc              off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:36456   ESTABLISHED 1091/nc              off (0.00/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1091/nc              off (0.00/0/0)
tcp        0      1 192.168.139.111:9527    192.168.139.151:36456   LAST_ACK    -                    on (23.09/7/0)

观测 CLOSE_WAIT

使用 python 连接

import socket

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.connect(('192.168.139.111', 9527))
c.shutdown(socket.SHUT_WR)
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-closewait-py.pcap --print

vm-1-tcp_close-iptables-closewait-py.pcap

能看到 vm-2 发送 FIN 后进入 FIN1 状态,vm-2 回复 ACK 就结束了,vm-2 收到 ACK 会进入到 FIN2,而 vm-1 只发送 ACK 自己进入 CLOSE_WAIT

同样 FIN2 等待 60s 后不进入 TIME_WAIT 直接结束状态 此状态和文中是对不上的,发送请求后 vm-2 FIN2 会进入 60s 的等待时间,而文中确不会,目前能看到 vm-1 的 CLOSE_WAIT 是一直存在的

vm-2

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:34774   192.168.139.111:9527    FIN_WAIT2   -                    timewait (55.39/0/0)

$ sudo netstat -anpo|grep -E "Recv|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:34774   192.168.139.111:9527    FIN_WAIT2   -                    timewait (31.64/0/0)

-----------------------------------------------------------------
vm-1

$ sudo netstat -anpo|grep 9527
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1268/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:34774   CLOSE_WAIT  -                    off (0.00/0/0)

修改 net.ipv4.tcp_fin_timeout 测试

$ sudo sysctl -w net.ipv4.tcp_fin_timeout=30
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-iptables-closewait-py-30s-timeout.pcap --print

vm-1-tcp_close-iptables-closewait-py-30s-timeout.pcap

vm-1

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        2      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1268/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:37626   CLOSE_WAIT  -                    off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:34774   CLOSE_WAIT  -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37626   192.168.139.111:9527    FIN_WAIT2   -                    timewait (26.30/0/0)

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37626   192.168.139.111:9527    FIN_WAIT2   -                    timewait (0.49/0/0)

看样子是没法豁免的,时间只会随着 net.ipv4.tcp_fin_timeout 变动,可能和内核也有关系贴一下我的系统内核

Linux vm-2 6.17.4-orbstack-00308-g195e9689a04f #1 SMP PREEMPT Fri Oct 24 07:22:34 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

连接保活

默认 keepalive 相关参数

$ sudo sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
import socket
import time

def connect_and_hold(host, port, count):
    cli_list = []
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            cli.connect((host, port))
            cli_list.append(cli)
    except Exception as e:
        print(f"Failed to connect: {e}")

    while True:
        time.sleep(1)

if __name__ == '__main__':
    connect_and_hold('192.168.139.111', 9527, 1)
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-vm-2-keepalive.pcap --print

vm-1-tcp_close-vm-2-keepalive.pcap

连接后没数据传输,vm-2 每隔 75s 给 vm-1 发送 TCP Keep-Alive,走的 net.ipv4.tcp_keepalive_intvl

vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37118   192.168.139.111:9527    ESTABLISHED 25305/python3        keepalive (71.33/0/0)

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:37118   192.168.139.111:9527    ESTABLISHED 25305/python3        keepalive (70.62/0/0)

叫 grok 改了 python 脚本

import socket
import time  # 添加 time 模块以便暂停脚本查看连接状态

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# 设置 keepalive 参数:TCP_KEEPIDLE 为空闲时间(相当于 net.ipv4.tcp_keepalive_time),单位秒

c.connect(('192.168.139.111', 9527))

# 暂停脚本以便用 netstat -anpo 或 ss -anto 查看连接状态(会显示 keepalive timer 如 timer:keepalive (10.000 sec))
print("连接已建立,按 Enter 退出...")
input()  # 或用 time.sleep(60) 自动等待 60 秒
c.close()
$ sudo sysctl -a|grep keep
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 10
$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_close-vm-2-keepalive-10s.pcap --print

vm-1-tcp_close-vm-2-keepalive-10s.pcap

正常三次握手,然后 vm-2 发了 Keep-Alive vm-1 回复,3和4的包之间隔了 10s 也就是 net.ipv4.tcp_keepalive_time,然后 vm-1 和 vm-2 之间没发送数据,相隔 20s 第6个包 vm-2 发了 Keep-Alive 也就是 net.ipv4.tcp_keepalive_intvl,最后 vm-2 enter 直接断开 发了 FIN+ACK vm-1 回了 ACK 但是没回 FIN vm-1 状态就是 CLOSE_WAIT,vm-2 则是 FIN2 然后根据 fin_time 30s 过去就消失

vm-1

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1692/python3         off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:33360   ESTABLISHED -                    off (0.00/0/0)

$ sudo netstat -anpo | grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        1      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      1692/python3         off (0.00/0/0)
tcp        1      0 192.168.139.111:9527    192.168.139.151:33360   CLOSE_WAIT  -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    ESTABLISHED 27528/python3        keepalive (7.38/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    ESTABLISHED 27528/python3        keepalive (17.59/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    FIN_WAIT2   -                    timewait (26.96/0/0)

$ sudo netstat -anpo|grep 9527
tcp        0      0 192.168.139.151:33360   192.168.139.111:9527    FIN_WAIT2   -                    timewait (23.06/0/0)

TCP_CLOSE 状态

没找到合适的,自己画了个

tcp-close

linux

sudo apt update
sudo apt install -y tcpdump net-tools iptables wget nmap telnet man lsof ipvsadm ipset btop git ca-certificates tmux jq curl gpg file

cloud-init

#cloud-config
package_update: true
package_upgrade: true

packages:
  - bridge-utils
  - make
  - cmake
  - pkg-config
  - build-essential
  - wget
  - net-tools
  - inetutils-ping
  - dnsutils
  - vim
  - unzip
  - zip
  - sysstat
  - git
  - iproute2
  - ca-certificates
  - telnet
  - curl
  - htop
  - tcpdump
  - gnupg
  - lsb-release
  - btrfs-progs
  - libssl-dev
  - m4
  - gcc
  - g++
  - clang
  - fuse
  - tree
  - iftop
  - jq
  - gpg
  - man
  - tmux
  - traceroute
  - ethtool
  - lz4
  - btop
  - gh
  - qemu-guest-agent
  - supervisor

bootcmd:
  - mkdir -p /opt/docker/cadvisor
  - mkdir -p /opt/docker/node_exporter

write_files:
  - path: /opt/docker/node_exporter/compose.yaml
    owner: root:root
    permissions: 0o755
    defer: true
    content: |
      services:
        node_exporter:
          image: quay.io/prometheus/node-exporter:latest
          container_name: node_exporter
          command:
            - '--web.listen-address=:10000'
            - '--path.rootfs=/host'
          network_mode: host
          pid: host
          restart: unless-stopped
          volumes:
            - '/:/host:ro,rslave'
  
  - path: /opt/docker/cadvisor/compose.yaml
    owner: root:root
    permissions: 0o755
    content: |
      services:
        cadvisor:
          image: gcr.io/cadvisor/cadvisor:latest
          container_name: cadvisor
          privileged: true
          devices:
            - /dev/kmsg:/dev/kmsg
          volumes:
            - /:/rootfs:ro
            - /var/run:/var/run:ro
            - /sys:/sys:ro
            - /var/lib/docker/:/var/lib/docker:ro
            - /dev/disk/:/dev/disk:ro
          ports:
            - "10001:8080"
          restart: unless-stopped

runcmd:
  - install -m 0755 -d /etc/apt/keyrings
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
  - chmod a+r /etc/apt/keyrings/docker.asc
  - echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
  - apt update
  - apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  - systemctl enable docker
  - systemctl enable qemu-guest-agent
  - systemctl start qemu-guest-agent
  - systemctl start docker
  - docker compose -f /opt/docker/node_exporter/compose.yaml up -d
  - docker compose -f /opt/docker/cadvisor/compose.yaml up -d

users:
  - default
  - name: orange723
    sudo: "ALL=(ALL) NOPASSWD:ALL"
    groups: sudo, sys, root, adm, docker
    shell: /bin/bash
    homedir: /home/orange723
    lock_passwd: true
    ssh_authorized_keys:
      - ssh-ed25519 AAAAC

no_ssh_fingerprints: true

ssh:
  emit_keys_to_console: false

golang

g

# It is recommended to clear the `GOROOT`, `GOBIN`, and other environment variables before installation.
$ curl -sSL https://raw.githubusercontent.com/voidint/g/master/install.sh | bash
$ cat << 'EOF' >> ~/.bashrc
# Check if the alias 'g' exists before trying to unalias it
if [[ -n $(alias g 2>/dev/null) ]]; then
    unalias g
fi
EOF
$ source "$HOME/.g/env"

$ g ls-remote stable
$ g install stable

node

nvm

$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash

$ nvm install --lts
$ nvm use --lts

使用 nvm 要注意多版本,比如 gemini 装到 A 版本 node,但目前是在用 B 版本 node,调用会提示找不到

实验流程来自 知识星球:程序员踩坑案例分享

创建连接

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp.pcap --print

vm-1-tcp.pcap

当时碰到个问题,在 vm-1 用 "nc -k -l vm-1 9527",vm-2 连接 vm-1 时 vm-1 窗口收不到消息

在两台 vm 里 hosts 文件加了对端的机器名和 ip

vm-1
198.19.249.151 vm-2

vm-2
198.19.249.111 vm-1

vm-1 上抓包看下

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost.pcap --print

vm-1-tcp-nc-localhost.pcap

在看 vm-1 监听的情况

sudo netstat -anpt

监听 127.0.0.1 去了,另外一块网卡没监听

vm-2 发 syn 给 vm-1,vm-1 直接回了个 rst,然后 vm-1 根据 net.ipv4.tcp_syn_retries 不停的重试

sudo sysctl -a|grep net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

但是我抓包发现会重传 10 次 共 11 个包

再次抓包验证

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost-retries.pcap --print

vm-1-tcp-nc-localhost-retries.pcap

很奇怪,和 net.ipv4.tcp_syn_linear_timeouts=6 的现象不一样,正常只应该有 7 个包,一个正常 syn 和 6 个重试

这时还有一个现象,正常来说 “指数退避” 应该是 1 2 4 8,但抓包前 4 次均是相隔 1s,第5个重试包才相隔 2s,根据这个现象和当前内核版本查询到

net.ipv4.tcp_syn_linear_timeouts

tcp_syn_linear_timeouts - INTEGER
The number of times for an active TCP connection to retransmit SYNs with a linear backoff timeout before defaulting to an exponential backoff timeout. This has no effect on SYNACK at the passive TCP side.

With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would expect SYN RTOs to be: 1,1,1,1,1,2,4,... (4 linear timeouts,and the first exponential backoff using 2^0 * initial_RTO). Default: 4

这就对了,后面更改 net.ipv4.tcp_syn_linear_timeouts 在继续测试

sudo sysctl -w net.ipv4.tcp_syn_retries=6 net.ipv4.tcp_syn_linear_timeouts=1

正常是 10 次,现在应该是 7 次

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-nc-localhost-retries-syn-linear.pcap --print

vm-1-tcp-nc-localhost-retries-syn-linear.pcap

while true;do sudo netstat -anpo|grep 9527;sleep 1;done

没错到7次自动停了

后面改成 nc -k -l 192.168.139.111 9527 直接就通了

观测 SYN_SENT

vm-1 使用 iptables drop vm-2 发来的 syn 包

sudo iptables -A INPUT -p tcp --dport 9527 -j DROP
sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-drop-9527.pcap --print

vm-1-tcp-iptables-drop-9527.pcap

能看到这回是 tcp retransmission,重传了 10 次 依旧是这两个参数控制

net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_linear_timeouts

能看到 vm-2 连接状态 SYN_SENT

观测 SYN_RECV

需要在 vm-2 drop 从 vm-1 传过来的 SYN+ACK 包,这样 vm-2 收不到 SYN+ACK 就没办法回 ACK,vm-1 也没办法将三次握手完成

sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

改用 nmap 测试连接

sudo nmap -sS 192.168.139.111 -p 9527

vm-1 查看连接状态

while true;do sudo netstat -anpo|grep 9527;sleep 1;done
sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-vm2-drop-9527.pcap --print

vm-1-tcp-iptables-vm2-drop-9527.pcap

能看到 vm-2 >(SYN) vm-1,vm-1 >(SYN+ACK) vm-2,然后 vm-1 一直在重试,试了5次

net.ipv4.tcp_synack_retries 默认是5

tcp_synack_retries - INTEGER
Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value is 5, which corresponds to 31seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for a passive TCP connection will happen after 63seconds.

文中提到:SYN FLOOD

客户端发了 1 个 SYN 到服务端,如果客户端不响应那服务端就会重试 5 次,一台机器是 5 次如果机器多服务端资源很快就会被消耗

文中提到:如果只使用 iptables 拦截第二次握手包的话,会导致源端协议栈 SYN 重传的,这样就没法测试 SYN+ACK 重传了。所以发送端在发完 SYN 包后不能有其他逻辑。nc 做不到只发送 SYN 包就退出,改用 nmap 来进行实验。

复现下 用 nc 然后抓包

sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp-iptables-vm2-drop-9527-nc.pcap --print

vm-1-tcp-iptables-vm2-drop-9527-nc.pcap

果然 vm-2 在重传

SYN Queue

借用下文中的图

(图片来自:https://www.emqx.com/en/blog/emqx-performance-tuning-tcp-syn-queue-and-accept-queue)

验证下半连接队列长度,修改相关的内核参数

sudo sysctl -w net.ipv4.tcp_syncookies=0 net.ipv4.tcp_max_syn_backlog=4 net.core.somaxconn=8

vm-2 测试

while true;do sudo nmap -sS 192.168.139.111 -p 9527;done

vm-1 查看状态,又和修改的内核参数对应不上

$ sudo netstat -anpo|grep RECV
tcp        0      0 192.168.139.111:9527    192.168.139.151:35013   SYN_RECV    -                    on (1.82/2/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:57984   SYN_RECV    -                    on (1.76/2/0)

半连接取值的规则是这样

min(backlog, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog)

syn_backlog 和 somaxconn 设置的都不是 2,唯一有关系的就是 backlog,backlog没有改直接用的 nc

nc -k -l 192.168.139.111 9527
$ sudo ss -anpt
State     Recv-Q    Send-Q         Local Address:Port       Peer Address:Port   Process
LISTEN    0         1            192.168.139.111:9527            0.0.0.0:*       users:(("nc",pid=37710,fd=3))

关于 ss 的 Send-Q 解释

High Send-Q means the data is put on TCP/IP send buffer, but it is not sent or it is sent but not ACKed

表示数据在 tcp/ip 发送缓存中,但未发送或已发送但未 ack

对比我们情况就是 vm-2 拦截了 vm-1 发过来的 syn+ack,未回复 ack

也就是 nc 的 backlog 设置的是 1,server 的半连接队列只允许有1个等待

用 go 写一个

package main

import (
	"fmt"
	"log"
	"net"
	"time"
)

func main() {
	fmt.Print("h")
	conn,err := net.Listen("tcp4","0.0.0.0:9527")
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	log.Println("listen :9527 success")

	for {
		time.Sleep(time.Second * 10)
	}
}
$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          8                    0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=37717,fd=4))

能看到 send-q 是 8,根据公示 min(backlog, net.core.somaxconn, net.ipv4.tcp_max_syn_backlog),somaxconn 是 8,syn_backlog 是 4

我们把内核参数恢复默认看下 go server 的默认 backlog

$ sudo sysctl -a|grep tcp_syncookies;sudo sysctl -a|grep max_syn_backlog;sudo sysctl -a|grep net.core.somaxconn
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 512
net.core.somaxconn = 4096

$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          4096                 0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=318,fd=4))

现在唯一的问题是最小应是4,通过 ss -anpt 查看显示是8,我们访问测试下,改完内核参数记得重新运行服务

$ while true;do sudo nmap -sS 192.168.139.111 -p 9527;done

$ sudo ss -anpt
State     Recv-Q     Send-Q         Local Address:Port         Peer Address:Port    Process
LISTEN    0          8                    0.0.0.0:9527              0.0.0.0:*        users:(("s",pid=344,fd=4))

$ sudo netstat -anpo | grep SYN_RECV | wc -l
4

$ sudo ss -anpt|grep 9527
LISTEN   0      8              0.0.0.0:9527         0.0.0.0:*     users:(("s",pid=344,fd=4))
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:53165
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:33241
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:50404
SYN-RECV 0      0      192.168.139.111:9527 192.168.139.151:46060

能看到队列里是4,那上面的就是取值问题

netstat -s 能看到丢弃了多少 syn

$ sudo netstat -s | grep -E "LISTEN|overflowed"
    85 SYNs to LISTEN sockets dropped

Accept Queue

全连接队列最大长度
min(backlog, net.core.somaxconn)

vm-1

import socket
import time

def start_server(host, port, backlog):
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.bind((host, port))
    server.listen(backlog)

    while True:
        time.sleep(1)

if __name__ == '__main__':
    start_server('192.168.139.111', 9527, 8)

vm-2

import socket
import time

def connect_and_hold(host, port, count):
    cli_list = []
    try:
        for i in range(count):
            cli = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            cli.connect((host, port))
            cli_list.append(cli)
    except Exception as e:
        print(f"Failed to connect: {e}")

    while True:
        time.sleep(1)

if __name__ == '__main__':
    connect_and_hold('192.168.139.111', 9527, 10)

清理掉之前的 iptables 规则,分别启动测试

$ sudo netstat -s|grep -E "LISTEN|overflow"
    6 times the listen queue of a socket overflowed # 全连接丢弃的包
    91 SYNs to LISTEN sockets dropped
vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        9      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      407/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54468   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54524   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54516   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54478   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54464   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54494   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54508   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54536   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:54496   ESTABLISHED -                    off (0.00/0/0)

vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 192.168.139.151:54464   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54468   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54478   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54494   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54496   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54508   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54516   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54524   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:54536   192.168.139.111:9527    ESTABLISHED 33260/python3        off (0.00/0/0)
tcp        0      1 192.168.139.151:54538   192.168.139.111:9527    SYN_SENT    33260/python3        on (0.81/7/0)

没错,vm-2 的第10个包 SYN_SENT 在重传

也就是全连接满了 半连接是不接收直接drop掉的

观测下全连接不满,半连接什么情况

vm-2 的连接改成6

vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        6      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      463/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51454   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51402   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51440   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51426   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51418   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51450   ESTABLISHED -                    off (0.00/0/0)

vm-2 拦截 vm-1 过来的 syn+ack

$ sudo iptables -A INPUT -p tcp --sport 9527 -j DROP

$ nc 192.168.139.111 9527

vm-1 能看到这个 SYN_RECV 在重试,也就是进了半连接队列,因为 vm-2 拦截了 vm-1 过来的包,vm-2 不会给 vm-1 发送 ack,vm-1 就会一直重试

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        6      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      463/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51454   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:40638   SYN_RECV    -                    on (12.04/4/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51402   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51440   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51426   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51418   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:51450   ESTABLISHED -                    off (0.00/0/0)

当全连接没满,半连接是可以接收的

文中描述

net.ipv4.tcp_abort_on_overflow
此值为 0 表示握手到第三步时全连接队列满时则扔掉客户端发过来的 ACK 包。但是客户端那边因为握手包已经发出,已经自动进入 ESTABLISHED 状态准备传输数据了。服务端丢弃了 ACK 包后这个链接还是处于 SYN_RECV 状态的(如果此时客户端发数据,服务端会直接丢弃。客户端就开始重传,此时的重传次数受内核的 net.ipv4.tcp_retries2 参数控制);

此值为 1 则直接给客户端发送 RST 包直接断开连接。

这里强调下,这个参数只在半连接队列往全连接队列移动时才有效。而全连接队列已经满的情况下,内核的默认行为只是丢弃新的 SYN 包(而且目前没有参数可以控制这个行为),这会导致客户端 SYN 不断重传。

默认 net.ipv4.tcp_abort_on_overflow 是 0,要想测试很难,只在半连接向全连接移动时有效。

另外握手到第三步,就是 vm-2 向 vm-1 发 ack,既要满足发送 ack 又要叫全连接是满的,也就是发送 syn+ack 时候全连接还没满,回 ack 时 vm-1 恰巧有一个比当前请求还快的握手,让 vm-1 的全连接队列满。

我尝试在 vm-1 全连接队列满的时候,发送一个正常包到 vm-1,看看 vm-1 和 vm-2 的状态

$ sudo tcpdump -s0 -X -nn "tcp port 9527" -w vm-1-tcp_abort_on_overflow.pcap --print

vm-1

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        9      0 192.168.139.111:9527    0.0.0.0:*               LISTEN      538/python3          off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56822   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56802   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56786   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56840   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56838   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56796   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56790   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56780   ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 192.168.139.111:9527    192.168.139.151:56818   ESTABLISHED -                    off (0.00/0/0)

-----------------------------------------------------------------
vm-2

$ sudo netstat -anpo|grep -E "Recv-Q|9527"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      1 192.168.139.151:42424   192.168.139.111:9527    SYN_SENT    33284/nc             on (1.58/6/0)
tcp        0      0 192.168.139.151:56780   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56786   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56790   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56796   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56802   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56818   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56822   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56838   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      0 192.168.139.151:56840   192.168.139.111:9527    ESTABLISHED 33283/python3        off (0.00/0/0)
tcp        0      1 192.168.139.151:56844   192.168.139.111:9527    SYN_SENT    33283/python3        on (48.94/7/0)

能看到 vm-1 建立9个连接后这边就停止了,没有 SYN_RECV,也就是全连接满了 半连接的请求直接被 drop

而 vm-2 通过抓包能看到 56844 python 在发送 SYN_SENT

nc 的 42424 也是,全部都在重试,试了7次,正常现象 我的 net.ipv4.tcp_syn_retries = 6 net.ipv4.tcp_syn_linear_timeouts = 1

重传这里还能看到个现象:vm-1 使用 iptables 拒绝 vm-2 过来的 syn 包和全连接满了直接拒绝半连接反应的抓包是一样的,区别是一个是用户行为一个是系统行为

我将 vm-1 重启内核参数恢复默认,又启动一个nginx,能看到默认半连接 511

$ sudo ss -lnt
State     Recv-Q    Send-Q       Local Address:Port       Peer Address:Port    Process
LISTEN    0         511                0.0.0.0:80              0.0.0.0:*
LISTEN    0         511                   [::]:80                 [::]:*

这时候如果你的nginx无法处理连接,状况大致可分为几种

  1. 监听了lo网卡,导致无法处理外部请求,访问会拒绝。客户端走tcp重试
  2. 监听了正确的网卡,但有 iptables 或安全组等拦截。客户端走tcp重试
  3. 监听了正确的网卡 iptables 或安全组都放行,全连接满了。系统级别直接drop连接
  4. 监听了正确的网卡 iptables 或安全组都放行,全连接没满半连接也没满。但新机器上来就把内核参数改了,导致半连接过小,高并发情况下 系统基本指标都正常 这会让请求处理异常吗?(这一点存在疑问后面测试下)
  5. 监听了正确的网卡 iptables 或安全组都放行,全连接没满半连接也没满。但这台机器的基本指标都异常比如CPU内存使用100%,这样全连接就会一直堆积 accept 很慢,导致半连接也满了。你的机器最终也就不可用了

4 问题测试 会异常 从 server 观测到 vm-2 发送了大量的 tcp 重试,同时半连接队列从系统层又drop掉很多请求

我发现这个抓包少了并不全,但也不碍事,系统层drop掉请求是对的

$ sudo sysctl -w net.ipv4.tcp_syncookies=0 net.ipv4.tcp_max_syn_backlog=4 net.core.somaxconn=8
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_syn_backlog = 4
net.core.somaxconn = 8
vm-2

$ wrk -t4 -c400 -d60s http://101.200.150.26
$ netstat -anpo|grep -E "Recv|80"
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      28539/nginx: master  off (0.00/0/0)
tcp        0    862 172.22.7.89:80          x.x.x.x:55481    ESTABLISHED 28540/nginx: worker  on (0.31/0/0)
tcp        0    862 172.22.7.89:80          x.x.x.x:56121    ESTABLISHED 28541/nginx: worker  on (6.32/6/0)

$ netstat -s|grep -E "LISTEN|overflow"
    5517 times the listen queue of a socket overflowed
    78079 SYNs to LISTEN sockets dropped
$ netstat -s|grep -E "LISTEN|overflow"
    5517 times the listen queue of a socket overflowed
    78388 SYNs to LISTEN sockets dropped

还能看到在tcp连接建立以后 nginx 也做了重传,同时 Send-Q 部分为 862 byte,通过抓包分析862 恰好是 tcp 层的 tcp segment len,这个请求是 server 发往 vm-2 的响应请求,server 发给了 vm-2 还在等待 vm-2 的 ack,所以能看到 Send-Q 是 862

不设置内核参数,在压测下

tcpdump -s0 -X -nn "tcp port 80" -w cloudserver-wrk-no-sysctl.pcap --print

这个包是全的

再来看系统是否有drop请求,空的 netstat -s|grep -E "LISTEN|overflow" 过滤直接没有

TcpExt:
    2 invalid SYN cookies received
    10 resets received for embryonic SYN_RECV sockets
    42 TCP sockets finished time wait in fast timer
    273 packets rejected in established connections because of timestamp
    19 delayed acks sent
    Quick ack mode was activated 13960 times
    630 packet headers predicted
    64905 acknowledgments not containing data payload received
    16255 predicted acknowledgments
    TCPSackRecovery: 1741
    18 congestion windows recovered without slow start after partial ack

server

禁用密码登录

/etc/ssh/sshd_config

UsePAM no
PasswordAuthentication no

/etc/ssh/sshd_config.d/50-cloud-init.conf

# 示例,禁用密码登录要改成 no,cloudcone 使用 cloud-init 添加文件启用了密码登录
PasswordAuthentication yes

有些服务商会不修改 sshd_config 文件,而是在 sshd_config.d 目录下添加新文件进行覆盖。

如果改了 sshd_config 文件重启发现没有生效,要检查 sshd_config.d 目录

client

~/.ssh/config

# 通过跳板机做转发
Host beszel
  Hostname localhost
  User root
  ProxyJump orange723
  LocalForward 8090 localhost:8090

Host *
  IdentityAgent agent.sock
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  ServerAliveInterval 15

正常连接 server 后长时间不操作,就会卡住然后只能重新连接,根据 plantegg 看到

"ServerAliveInterval [seconds]" configuration in the SSH configuration so that your ssh client sends a "dummy packet" on a regular interval so that the router thinks that the connection is active even if it's particularly quiet

client 会每隔多少s 发送一个 packet 到 server

抓包看看

no-time

会发现连接后,长时间不操作 client 再去连接 server 已经连不上了,重新传输 server 都无响应,跟我们碰到的情况一样,只能重新连接

time

设置的 15s 可以看到连接后,在 16.923s 是 ssh 协议 client 发送一个包到 server,server 也响应了

ssh-local-forward

ssh -f -N -L 8090:localhost:8090 orange723