uncloud

安装

quick-start

注意项:
- mac 上使用 1password 的 ssh-agent,需覆盖 SSH_AUTH_SOCK 变量
- 使用私钥登录要写全路径,不然执行 uc 会提示找不到私钥
- 多集群可通过指定上下文区分 -c, --context string
- 因为域名和 dns 原因,并未直接安装 caddy 和 dns

$ export SSH_AUTH_SOCK=~/Library/Group\ Containers/com.1password/t/agent.sock

$ uc machine init root@example.com --name example-1 --no-caddy --no-dns

caddy ingress

  • 使用 ip ssl
  • caddy 仅为安装并配置 zerossl 签名证书
  • 路由划分放到 excalidraw,由于 ip ssl 的原因,个人测试不影响
$ cat certificate.crt ca_bundle.crt > fullchain.crt

caddy compose.yaml

services:
  caddy:
    image: docker.1ms.run/library/caddy:2.11
    command: caddy run -c /config/Caddyfile
    environment:
      CADDY_ADMIN: unix//run/caddy/admin.sock
    volumes:
      - /root/caddy:/root/caddy # 此处是在服务器上存放 ssl 证书文件的路径,可以根据需要修改
      - /var/lib/uncloud/caddy:/data
      - /var/lib/uncloud/caddy:/config
      - /run/uncloud/caddy:/run/caddy
    x-ports:
      - 80:80@host
      - 443:443@host
    deploy:
      mode: global

runme deploy

excalidraw

  • / 路径是 excalidraw
  • /message 是 leave-a-message 服务

excalidraw compose.yaml

services:
  excalidraw:
    image: docker.1ms.run/excalidraw/excalidraw
    x-caddy: |
      :443 {
          tls /root/caddy/fullchain.crt /root/caddy/private.key
          
          # excalidraw
          reverse_proxy {{upstreams}} {
              import common_proxy
          }
      
          # leave-a-message static
          handle /static/* {
              reverse_proxy {{upstreams "message" 3000}} {
                  import common_proxy
              }
          }

          # leave-a-message api
          handle /api/* {
              reverse_proxy {{upstreams "message" 3000}} {
                  import common_proxy
              }
          }
      
          # leave-a-message root pages
          handle_path /message/* {
              reverse_proxy {{upstreams "message" 3000}} {
                  import common_proxy
              }
          }
      
          log
      }

runme deploy

leave-a-message

leave-a-message compose-uncloud.yml

services:

  message:
    build:
      context: .
      platforms:
        - linux/amd64
    image: message:{{date "20060102-150405" "Local"}}
    configs:
      - source: app_config
        target: /app/.env
        mode: 0644

  db:
    image: docker.1ms.run/library/mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: 12345678
      MYSQL_DATABASE: message
    x-ports:
      - 3306:3306/tcp@host
    volumes:
      - mysql_data:/var/lib/mysql

volumes:
  mysql_data:

configs:
  app_config:
    content: |
      SERVER_HOST=0.0.0.0
      SERVER_PORT=3000

      ENABLE_LOGGER=true

      MYSQL_USER=root
      MYSQL_PASSWORD=12345678
      MYSQL_HOST=db
      MYSQL_DB=message

wireguard 组网测试

没用 uncloud 之前也没用过 wireguard 看了下 uncloud 的 blog connect-docker-containers-across-hosts-wireguard 不是很复杂,想要测试下 wireguard 组网效果

官方的图更直接

整体流程是(括号1代表第一台机器,括号2代表第二台)

docker bridge(1) -> iptables(1) -> wireguard(1) -> wireguard(2) -> iptables(2) -> docker bridge(2)

创建新的 bridge 网络,通过 iptables 将 bridge 网段流量转发到 wireguard,两台机器的 wireguard 之间做了 tunnel,另外一台机器的 wireguard 接收到流量通过 iptables 规则转发到 bridge 网络下的容器。

我的环境

docker-wireguard

先提前互 ping 下,免的访问不了白测试

# Machine 1
$ ping 74.48.78.11
PING 74.48.78.11 (74.48.78.11) 56(84) bytes of data.
64 bytes from 74.48.78.11: icmp_seq=1 ttl=48 time=189 ms
64 bytes from 74.48.78.11: icmp_seq=2 ttl=48 time=181 ms

# Machine 2
$ ping 101.200.150.26
PING 101.200.150.26 (101.200.150.26) 56(84) bytes of data.
64 bytes from 101.200.150.26: icmp_seq=1 ttl=50 time=193 ms
64 bytes from 101.200.150.26: icmp_seq=2 ttl=50 time=193 ms
# Machine 1
$ docker network create --subnet 10.200.1.0/24 -o com.docker.network.bridge.trusted_host_interfaces="wg0" multi-host
18ca4396c2977a987d3238f4a6b83d4dd3a0ceabad25c7c1f670a967c6b1cdef

# Machine 2
$ docker network create --subnet 10.200.2.0/24 -o com.docker.network.bridge.trusted_host_interfaces="wg0" multi-host
04c3ec0b2eaba837c5e974bdc6c76ed4dda8c10349fd3cb75af3506b0e2bac63
# 两边分别执行,确保云厂商方面安全组开通 51820
$ iptables -I INPUT -p udp --dport 51820 -j ACCEPT
# 安装 wireguard 生成配置
$ apt update && apt install wireguard
$ umask 077
$ wg genkey > privatekey
$ wg pubkey < privatekey > publickey

PrivateKey 填本机的,PublicKey 则要填对端的

原文:

PrivateKey = <replace with 'privatekey' file content from Machine 1>
PublicKey = <replace with 'publickey' file content from Machine 2>

# 生成 wg 配置
# Machine 1
$ cat /etc/wireguard/wg0.conf
[Interface]
ListenPort = 51820
PrivateKey = kKaGXES+s2mFVYTxEMN/KTJ0k/a0S9bv8JxeVz0bdnA=

[Peer]
PublicKey = Y8ENSMHYspa1x6Hfnj1YgZL0xSrOzBHeV7i6f6UZbhQ=
# IP ranges for which a peer will route traffic: Docker subnet on Machine 2
AllowedIPs = 10.200.2.0/24
# Public IP of Machine 2
Endpoint = 74.48.78.11:51820
# Periodically send keepalive packets to keep NAT/firewall mapping alive
PersistentKeepalive = 25

# Machine 2
$ cat /etc/wireguard/wg0.conf
[Interface]
ListenPort = 51820
PrivateKey = MBLrCQ2eVCga+unX3IIKPSnyzsyh1SwkmRIkIryX+0k=

[Peer]
PublicKey = TJOqJi0AuPP3GOOYFmq8jAsmak/hgKRfEpAH2nJNiVk=
# IP ranges for which a peer will route traffic: Docker subnet on Machine 2
AllowedIPs = 10.200.1.0/24
# Public IP of Machine 2
Endpoint = 101.200.150.26:51820
# Periodically send keepalive packets to keep NAT/firewall mapping alive
PersistentKeepalive = 25

# 分别启动
$ wg-quick up wg0

查看 wg 状态

# Machine 1
$ wg show
interface: wg0
  public key: TJOqJi0AuPP3GOOYFmq8jAsmak/hgKRfEpAH2nJNiVk=
  private key: (hidden)
  listening port: 51820

peer: Y8ENSMHYspa1x6Hfnj1YgZL0xSrOzBHeV7i6f6UZbhQ=
  endpoint: 74.48.78.11:51820
  allowed ips: 10.200.2.0/24
  latest handshake: 45 seconds ago
  transfer: 116.21 KiB received, 41.36 KiB sent
  persistent keepalive: every 25 seconds
  
# Machine 2
$ wg show
interface: wg0
  public key: Y8ENSMHYspa1x6Hfnj1YgZL0xSrOzBHeV7i6f6UZbhQ=
  private key: (hidden)
  listening port: 51820

peer: TJOqJi0AuPP3GOOYFmq8jAsmak/hgKRfEpAH2nJNiVk=
  endpoint: 101.200.150.26:51820
  allowed ips: 10.200.1.0/24
  latest handshake: 1 minute, 4 seconds ago
  transfer: 39.93 KiB received, 116.46 KiB sent
  persistent keepalive: every 25 seconds

配置 iptables 规则,允许 wg 到 bridge 网络

# Machine 1
$ docker network ls -f name=multi-host
NETWORK ID     NAME         DRIVER    SCOPE
18ca4396c297   multi-host   bridge    local
$ iptables -I DOCKER-USER -i wg0 -o br-18ca4396c297 -j ACCEPT

# Machine 2
$ docker network ls
NETWORK ID     NAME         DRIVER    SCOPE
e4613839f6f6   bridge       bridge    local
f1954f4aff63   host         host      local
04c3ec0b2eab   multi-host   bridge    local
7fb9f0ad47ba   none         null      local
$ iptables -I DOCKER-USER -i wg0 -o br-04c3ec0b2eab -j ACCEPT

最后配置下本机容器出网通过 wg 网卡的策略

# Machine 1
$ iptables -t nat -I POSTROUTING -s 10.200.1.0/24 -o wg0 -j RETURN

# Machine 2
$ iptables -t nat -I POSTROUTING -s 10.200.2.0/24 -o wg0 -j RETURN

测试下互通性

# Machine 2
$ docker run -d --name whoami --network multi-host traefik/whoami
$ docker inspect b992403e1e4f|grep 10.200
                    "Gateway": "10.200.2.1",
                    "IPAddress": "10.200.2.2",
                    
# Machine 1
$ docker run -it --rm --network multi-host busybox ping 10.200.2.2
PING 10.200.2.2 (10.200.2.2): 56 data bytes
64 bytes from 10.200.2.2: seq=0 ttl=62 time=194.198 ms
64 bytes from 10.200.2.2: seq=1 ttl=62 time=190.275 ms
64 bytes from 10.200.2.2: seq=2 ttl=62 time=198.435 ms
64 bytes from 10.200.2.2: seq=3 ttl=62 time=196.896 ms
64 bytes from 10.200.2.2: seq=4 ttl=62 time=197.444 ms
^C
--- 10.200.2.2 ping statistics ---
6 packets transmitted, 5 packets received, 16% packet loss
round-trip min/avg/max = 190.275/195.449/198.435 ms
$ docker run -it --rm --network multi-host alpine/curl http://10.200.2.2
Hostname: b992403e1e4f
IP: 127.0.0.1
IP: ::1
IP: 10.200.2.2
RemoteAddr: 10.200.1.2:60960
GET / HTTP/1.1
Host: 10.200.2.2
User-Agent: curl/8.17.0
Accept: */*

能看到互相通信没问题,就是延迟对比直 ping 会高一点点,毕竟本身两台机器就不近

在 1 机器把 iptables 规则导出看看

# filter
$ iptables -S -v
-P INPUT ACCEPT -c 608049 168734434
-P FORWARD DROP -c 0 0
-P OUTPUT ACCEPT -c 0 0
-N DOCKER
-N DOCKER-BRIDGE
-N DOCKER-CT
-N DOCKER-FORWARD
-N DOCKER-INTERNAL
-N DOCKER-USER
-A INPUT -p udp -m udp --dport 51820 -c 2385 192148 -j ACCEPT
-A FORWARD -c 270 22213 -j DOCKER-USER
-A FORWARD -c 147 12070 -j DOCKER-FORWARD
-A DOCKER ! -i docker0 -o docker0 -c 0 0 -j DROP
-A DOCKER ! -i br-18ca4396c297 -o br-18ca4396c297 -c 0 0 -j DROP
-A DOCKER-BRIDGE -o docker0 -c 0 0 -j DOCKER
-A DOCKER-BRIDGE -o br-18ca4396c297 -c 0 0 -j DOCKER
-A DOCKER-CT -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -c 0 0 -j ACCEPT
-A DOCKER-CT -o br-18ca4396c297 -m conntrack --ctstate RELATED,ESTABLISHED -c 0 0 -j ACCEPT
-A DOCKER-FORWARD -c 147 12070 -j DOCKER-CT
-A DOCKER-FORWARD -c 147 12070 -j DOCKER-INTERNAL
-A DOCKER-FORWARD -c 147 12070 -j DOCKER-BRIDGE
-A DOCKER-FORWARD -i docker0 -c 40 3360 -j ACCEPT
-A DOCKER-FORWARD -i br-18ca4396c297 -c 107 8710 -j ACCEPT
-A DOCKER-USER -i wg0 -o br-18ca4396c297 -c 123 10143 -j ACCEPT

# nat
$ iptables -t nat -S -v
-P PREROUTING ACCEPT -c 668150 42644933
-P INPUT ACCEPT -c 0 0
-P OUTPUT ACCEPT -c 7495 569953
-P POSTROUTING ACCEPT -c 7501 570385
-N DOCKER
-A PREROUTING -m addrtype --dst-type LOCAL -c 667962 42630817 -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -c 0 0 -j DOCKER
-A POSTROUTING -s 10.200.1.0/24 -o wg0 -c 6 432 -j RETURN
-A POSTROUTING -s 10.200.1.0/24 ! -o br-18ca4396c297 -c 0 0 -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -c 2 168 -j MASQUERADE

关键是这三条

-A INPUT -p udp -m udp --dport 51820 -c 2385 192148 -j ACCEPT # 允许 wg 入网
-A DOCKER-USER -i wg0 -o br-18ca4396c297 -c 123 10143 -j ACCEPT # 允许 input 是 wg0 设备到 br-18ca4396c297 设备
-A POSTROUTING -s 10.200.1.0/24 -o wg0 -c 6 432 -j RETURN # 从容器出网走 wg 然后转发出去

原文还说 DOCKER-USER 规则会优先 DOCKER 链处理,所以添加在 DOCKER-USER,以后在自定义容器规则可以参考,查询确实如此

测试将 DOCKER-USER 的规则配置到 DOCKER rule 下也能正常工作,不过 docker 重启 DOCKER 里手动加的规则就会消失

$ iptables -I DOCKER -i wg0 -o br-18ca4396c297 -j ACCEPT
$ iptables -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-BRIDGE
-N DOCKER-CT
-N DOCKER-FORWARD
-N DOCKER-INTERNAL
-N DOCKER-USER
-A INPUT -p udp -m udp --dport 51820 -j ACCEPT
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-FORWARD
-A DOCKER -i wg0 -o br-18ca4396c297 -j ACCEPT
-A DOCKER ! -i docker0 -o docker0 -j DROP
-A DOCKER ! -i br-18ca4396c297 -o br-18ca4396c297 -j DROP

原文也说了局限性,容器互相之间通过 ip 访问 没有服务发现,无法通过名称互相访问

DNS resolution

DNS resolution
The main limitation of this setup is that containers can't find each other by name across machines. You need to use their IP addresses directly or implement a service discovery solution like Consul or CoreDNS.

For small deployments, you can assign static IPs to containers and use those IPs in your app configuration. But service discovery is essential for larger and more dynamic deployments.

总结

不用 Kubernetes 情况下,用 uncloud 可以平替(平替谈不上,uncloud只是个多机管理容器的服务)测试玩玩,多机器通过 wireguard 组网 每个机器都会部署 caddy 接收流量,在小流量的情况下单台机器也足够,对域名做 A 记录添加多个 IP 会分流,通过 caddy 内部的 upstreams 变量会负载到所有机器,这时 A 记录添加一个也是可以的,如果机器在不同地域通过 dns 做地域区分 cloudflare 也有此功能,做 cicd 时 build 和 deploy 阶段也可分开,是在本地 ci 机器上 build 后 push image 到远端机器,这种形式 github action 就能实现。

wireguard 这部分如果机器距离较远,该有的延迟不会变只是机器能互相通信,从稳定性上来说 服务器距离较远还是不建议组网,自己玩玩尚可,企业有需求不如直接用云厂商做 peer 或者像 gcp 那样各个区域本身就是通的,延迟低也不额外付费,只需规划好 ip 即可。

在更新 caddy 时会出现配置同步较慢问题,比如更新了 caddy config,重新 uc deploy 后 uc caddy config 会发现配置只会有默认 caddy 内容,路由信息不会更新的很快,或者 caddy config 写错了会导致 caddy 启动只有默认配置,由于只有一台机器并且还是 ip ssl,多域名暂未测试,可以借助 caddy 配置文件检测来规避文件编写错误问题,caddy 是支持零停机更新的 在多域名多服务情况下应该不会出现此问题。

另外相比 uncloud 我觉得 nomad 会更纯粹,只是负责容器管理 服务发现交给 consul 搭配 apisix 网关,我觉得从管理和稳定性方面会更好。

uncloud

放一些工作当中借鉴过的文档。

devops

roadmap devops

learning devops

运维的未来是平台工程

关于高可用的系统

什么是 DDoS 攻击?

Microservices architecture on Google Cloud

Using SSH over the HTTPS port

terraform

Directory Structure and Modules Terraform on GCP — Organize your Terraform configurations using a clear directory structure and modules for different environments(Dev, Staging&Prod).

terragrunt

network

life-of-a-packet-in-the-linux-kernel

The Layers of the OSI Model Illustrated

uncloud network

monitor

awesome-prometheus-alerts

linux

iptables-essentials-common-firewall-rules-and-commands

Scripts I wrote that I use all the time

Working with Stdin, Stdout, and Stderr

How to secure your server from abuse and prevent IP blacklisting

machine price

gcloud-compute

面试

求职与简历

实验流程来自:01|一个数据包的网络之旅:网络是如何工作的?

也可以阅读此文章:life-of-a-packet-in-the-linux-kernel

通过一个 HTTP 请求来观察数据包的旅程

$ sudo tcpdump -s0 -X -nn "tcp port 80" -w packet.pcap --print

packet.pcap

$ curl -o /dev/null -v http://example.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host example.com:80 was resolved.
* IPv6: 2600:1406:5e00:6::17ce:bc1b, 2600:1408:ec00:36::1736:7f24, 2600:1406:bc00:53::b81e:94ce, 2600:1408:ec00:36::1736:7f31, 2600:1406:5e00:6::17ce:bc12, 2600:1406:bc00:53::b81e:94c8
* IPv4: 23.215.0.136, 23.192.228.80, 23.220.75.232, 23.220.75.245, 23.192.228.84, 23.215.0.138
*   Trying 23.215.0.136:80...
* Connected to example.com (23.215.0.136) port 80
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.5.0
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 200 OK
< Content-Type: text/html
< ETag: "bc2473a18e003bdb249eba5ce893033f:1760028122.592274"
< Last-Modified: Thu, 09 Oct 2025 16:42:02 GMT
< Cache-Control: max-age=86000
< Date: Fri, 28 Nov 2025 08:31:14 GMT
< Content-Length: 513
< Connection: keep-alive
<
{ [513 bytes data]
100   513  100   513    0     0    418      0  0:00:01  0:00:01 --:--:--   418
* Connection #0 to host example.com left intact

先是 dns 解析,知道 ip 后和 23.215.0.136:80 tcp 连接

* IPv6: 2600:1406:5e00:6::17ce:bc1b, 2600:1408:ec00:36::1736:7f24, 2600:1406:bc00:53::b81e:94ce, 2600:1408:ec00:36::1736:7f31, 2600:1406:5e00:6::17ce:bc12, 2600:1406:bc00:53::b81e:94c8
* IPv4: 23.215.0.136, 23.192.228.80, 23.220.75.232, 23.220.75.245, 23.192.228.84, 23.215.0.138
*   Trying 23.215.0.136:80...

连接成功后发送 GET / 请求

> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.5.0
> Accept: */*

example.com 回复 http 状态码 200,在通过抓包看本地机器 192.168.139.111 发起了 tcp 连接关闭

< HTTP/1.1 200 OK
< Content-Type: text/html
< ETag: "bc2473a18e003bdb249eba5ce893033f:1760028122.592274"
< Last-Modified: Thu, 09 Oct 2025 16:42:02 GMT
< Cache-Control: max-age=86000
< Date: Fri, 28 Nov 2025 08:31:14 GMT
< Content-Length: 513
< Connection: keep-alive

网络分层

图片来自:网络架构实战课

穿过客户端局域网

一句话总结:同局域网 arp 查询 mac 直接发送,不同局域网路由发送

计算下我的 ip 和 example.com 的 ip 在不在同一局域网

$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.139.111  netmask 255.255.255.0  broadcast 192.168.139.255
        inet6 fd07:b51a:cc66:0:a0db:deff:fea3:9cb5  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::a0db:deff:fea3:9cb5  prefixlen 64  scopeid 0x20<link>
        ether a2:db:de:a3:9c:b5  txqueuelen 1000  (Ethernet)
        RX packets 46016  bytes 17701736 (17.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 330  bytes 29500 (29.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
# 128 64 32 16 8 4 2 1
# 这里的对比是拿本机的子网掩码去和目的ip和本机ip对比,网络位相同则在同一网络
        
本机:192.168.139.111 子网掩码:255.255.255.0

IP:         11000000.10101000.10001011.01101111
子网掩码:     11111111.11111111.11111111.00000000
按位与运算:    11000000.10101000.10001011.00000000
网络位:        192.168.139.0

example.com:23.215.0.136

IP:          00010111.11010111.00000000.10001000
子网掩码:      11111111.11111111.11111111.00000000
按位与运算:     00010111.11010111.00000000.00000000
网络位:         23.215.0.0

既然不在一定会走路由规则,能看到走 192.168.139.1 网关设备是 eth0

$ ip route get 23.215.0.136
23.215.0.136 via 192.168.139.1 dev eth0 src 192.168.139.111 uid 501
    cache

网关一定是和主机在同一网络,观察下 arp 是怎么工作的

$ sudo arp -d 192.168.139.1

$ sudo tcpdump -s0 -X -nn "arp" -w arp.pcap --print

arp.pcap

$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.139.1            ether   da:9b:d0:54:e0:02   C                     eth0

traceroute 查看也是一样,虽然解析的 ip 不同但不影响

$ sudo traceroute -n -I example.com
traceroute to example.com (23.220.75.232), 30 hops max, 60 byte packets
 1  192.168.139.1  0.036 ms  0.014 ms  0.005 ms
 2  192.168.1.1  4.661 ms  4.647 ms  4.642 ms
 3  * 100.101.0.1  12.028 ms *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  23.220.75.232  186.684 ms  186.577 ms  260.319 ms

推荐使用 NextTrace

在服务器上测试中间设备会响应 icmp 协议,可能是个人住址经过的设备屏蔽了 icmp

$ traceroute -I -n -m 50 example.com
traceroute to example.com (23.220.75.245), 50 hops max, 60 byte packets
 1  10.59.252.86  1.378 ms  1.446 ms  1.442 ms
 2  11.73.60.253  1.937 ms * *
 3  26.25.187.33  1.519 ms  1.529 ms  1.630 ms
 4  10.216.220.118  3.104 ms  3.179 ms  3.160 ms
 5  10.216.229.106  3.177 ms  3.178 ms  3.232 ms
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  219.158.5.174  178.130 ms  178.125 ms  178.145 ms
13  * * *
14  154.54.77.53  162.103 ms  162.088 ms  162.185 ms
15  154.54.63.70  157.864 ms  157.916 ms  157.911 ms
16  154.54.47.165  238.095 ms  238.125 ms  243.526 ms
17  154.54.169.178  260.650 ms  260.635 ms  260.638 ms
18  154.54.29.134  249.462 ms  249.394 ms  248.589 ms
19  154.54.40.249  249.505 ms  249.495 ms *
20  154.54.165.26  247.671 ms  249.877 ms  250.234 ms
21  154.54.166.58  251.909 ms  252.616 ms  252.628 ms
22  154.54.44.86  254.282 ms  254.289 ms  254.572 ms
23  154.54.27.118  250.352 ms  252.829 ms  252.919 ms
24  38.104.84.101  236.554 ms  236.499 ms  236.548 ms
25  218.30.54.6  242.952 ms  242.911 ms  242.916 ms
26  * * *
27  * * *
28  * * *
29  23.220.75.245  239.147 ms  236.590 ms  239.235 ms

推荐案例

分析了下此篇文章的问题,很有趣 0.01% 的概率超时问题

我的回答是:

两个包还有个区别

正常的:server 会给 client 发 zerowindow 随后又发 window update,server 处理的慢但节奏在 server 这里。

超时的:没看到窗口更新的包 都是 client 给 server 发送,2136 包到 2149包能看到重试 15次。

要说 server 处理的慢,只看到一次超时后面全部正常,你说中间设备处理的有问题吧 它还只有0.01的超时概率

作者给了回复:

zero window 在这里其实是一个好的现象。

数据进入的处理路径是:

NIC -> Kernel process -> tcp connection buffer -> 应用程序读取

正常的:

正是因为 kernel 处理的速度够快,才能填满 buffer,应用程序处理的不够快,导致 buffer 填满了,接收端发送 zero window 让发送端暂停发送。

超时的:

因为 kernel 处理的带宽(由于没有开启 LRO)变慢,导致无法填满 buffer,所以不会出现 zero window。同时,由于 NIC 收包比较快,很可呢是 kernel 处理不过来,导致了丢包。

当时看到这个分析又重新对比了两个包,确实是像关了网卡 Offload 功能,关于 Offload 之前测试时碰到过在这里 TCP 数据的发送和接收,同时又能看到数据包里都是 vxlan 封装了一层发送的数据,分析下来就会认为是 server 的处理能力不够导致,后面作者回复后又想了想,为什么 server 处理能力不够呢 其实并不是,因为之前没替换设备是正常的,所以是因为 NIC 收包快 kernal 处理变慢,才会以为是 server 处理能力不够。

借助 zero window 看到的现象,直接进行分析得出的结论还是太草率,不是根本原因。

Kubeadm 部署

$ kubeadm init --cri-socket=unix:///run/cri-dockerd.sock --ignore-preflight-errors=mem --pod-network-cidr=10.244.0.0/16

$ kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Service NodePort

入网

nginx deployment 单副本,service 为 nodeport

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: docker.1ms.run/nginx:stable
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
          name: web
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
  labels:
    app: nginx
spec:
  type: NodePort
  ports:
  - port: 80
    nodePort: 31532
  selector:
    app: nginx
$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE    IP           NODE        NOMINATED NODE   READINESS GATES
web-5846888f49-q4f9c   1/1     Running   0          151m   10.244.0.4   orange723   <none>           <none>

$ kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP        150m
nginx        NodePort    10.106.224.38   <none>        80:31532/TCP   146m

直接访问 service 地址 10.106.224.38 和 pod 地址 10.244.0.4,正常返回

$ curl -I 10.106.224.38
HTTP/1.1 200 OK

$ curl -I 10.244.0.4
HTTP/1.1 200 OK

从外部访问测试并分别抓 本机 eth0 cni0 和 pod 的 eth0 网卡

$ tcpdump -i eth0 -s0 -X -nn "tcp port 31532" -w eth0.pcap --print

eth0.pcap

$ tcpdump -s0 -X -nn -i cni0 -w cni0.pcap --print

cni0.pcap

$ tcpdump -i eth0 -s0 -X -nn -w nginx.pcap --print

nginx.pcap

先看 eth0 抓包文件,能看到本地和主机网卡建立连接,三次握手后直接发了 GET / 请求,cni0 抓包中是 10.244.0.1 向 0.4 发送的 GET /,0.1 ip 是 cni0 网卡地址 0.4 是容器的地址,然后 0.4 直接把内容返回给本地 socket 地址也能对应上 37475,最后 nginx 中的抓包则是只和 cni0 网卡通信。

根据抓包内容看出网 iptables:-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES >> -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully 因为不涉及跨主机,后面的 FLANNEL-POSTRTG 正常走一遍流量就返回了

贴下当前机器的 iptables 规则

$ iptables -S -t nat > rules.txt
-----------------------------------------------------------------
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-N FLANNEL-POSTRTG
-N KUBE-EXT-2CMXP7HKUVJN7L6M
-N KUBE-KUBELET-CANARY
-N KUBE-MARK-MASQ
-N KUBE-NODEPORTS
-N KUBE-POSTROUTING
-N KUBE-PROXY-CANARY
-N KUBE-SEP-6E7XQMQ4RAYOWTTM
-N KUBE-SEP-AAOCRVJBUI2XUHEI
-N KUBE-SEP-C3WRBSQHCDQ7BT6J
-N KUBE-SEP-IT2ZTR26TO4XFPTO
-N KUBE-SEP-N4G2XR5TDX7PQE7P
-N KUBE-SEP-YIL6JZP7A3QYXJU2
-N KUBE-SEP-ZP3FB6NMPNCO4VBJ
-N KUBE-SEP-ZXMNUKOKXUTL2MK2
-N KUBE-SERVICES
-N KUBE-SVC-2CMXP7HKUVJN7L6M
-N KUBE-SVC-ERIFXISQEP7F7OF4
-N KUBE-SVC-JD5MR3NA4I4DYORP
-N KUBE-SVC-NPX46M4PTMTKRN6Y
-N KUBE-SVC-TCOU7JCQXEZGVUNU
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG
-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A KUBE-EXT-2CMXP7HKUVJN7L6M -m comment --comment "masquerade traffic for default/nginx external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-2CMXP7HKUVJN7L6M -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -m nfacct --nfacct-name  localhost_nps_accepted_pkts -j KUBE-EXT-2CMXP7HKUVJN7L6M
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -j KUBE-EXT-2CMXP7HKUVJN7L6M
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A KUBE-SEP-6E7XQMQ4RAYOWTTM -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-6E7XQMQ4RAYOWTTM -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.244.0.3:53
-A KUBE-SEP-AAOCRVJBUI2XUHEI -s 10.244.0.4/32 -m comment --comment "default/nginx" -j KUBE-MARK-MASQ
-A KUBE-SEP-AAOCRVJBUI2XUHEI -p tcp -m comment --comment "default/nginx" -m tcp -j DNAT --to-destination 10.244.0.4:80
-A KUBE-SEP-C3WRBSQHCDQ7BT6J -s 172.22.7.89/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-C3WRBSQHCDQ7BT6J -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 172.22.7.89:6443
-A KUBE-SEP-IT2ZTR26TO4XFPTO -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-IT2ZTR26TO4XFPTO -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.244.0.2:53
-A KUBE-SEP-N4G2XR5TDX7PQE7P -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-N4G2XR5TDX7PQE7P -p tcp -m comment --comment "kube-system/kube-dns:metrics" -m tcp -j DNAT --to-destination 10.244.0.2:9153
-A KUBE-SEP-YIL6JZP7A3QYXJU2 -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-YIL6JZP7A3QYXJU2 -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.244.0.2:53
-A KUBE-SEP-ZP3FB6NMPNCO4VBJ -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZP3FB6NMPNCO4VBJ -p tcp -m comment --comment "kube-system/kube-dns:metrics" -m tcp -j DNAT --to-destination 10.244.0.3:9153
-A KUBE-SEP-ZXMNUKOKXUTL2MK2 -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZXMNUKOKXUTL2MK2 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.244.0.3:53
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-2CMXP7HKUVJN7L6M ! -s 10.244.0.0/16 -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SVC-2CMXP7HKUVJN7L6M -m comment --comment "default/nginx -> 10.244.0.4:80" -j KUBE-SEP-AAOCRVJBUI2XUHEI
-A KUBE-SVC-ERIFXISQEP7F7OF4 ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp -> 10.244.0.2:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-IT2ZTR26TO4XFPTO
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp -> 10.244.0.3:53" -j KUBE-SEP-ZXMNUKOKXUTL2MK2
-A KUBE-SVC-JD5MR3NA4I4DYORP ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SVC-JD5MR3NA4I4DYORP -m comment --comment "kube-system/kube-dns:metrics -> 10.244.0.2:9153" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-N4G2XR5TDX7PQE7P
-A KUBE-SVC-JD5MR3NA4I4DYORP -m comment --comment "kube-system/kube-dns:metrics -> 10.244.0.3:9153" -j KUBE-SEP-ZP3FB6NMPNCO4VBJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 10.244.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https -> 172.22.7.89:6443" -j KUBE-SEP-C3WRBSQHCDQ7BT6J
-A KUBE-SVC-TCOU7JCQXEZGVUNU ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns -> 10.244.0.2:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-YIL6JZP7A3QYXJU2
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns -> 10.244.0.3:53" -j KUBE-SEP-6E7XQMQ4RAYOWTTM

看下 31532 端口是否有监听,查看是空的

$ netstat -lnpt|grep 31532

先看 PREROUTING,跳转到 KUBE-SERVICES

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER

KUBE-SERVICES 从外部访问目的地址不是 10.106.224.38,只有最后一条匹配

-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

KUBE-NODEPORTS,第二条规则能看到设置的 nodeport 端口 31532

-A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -m nfacct --nfacct-name  localhost_nps_accepted_pkts -j KUBE-EXT-2CMXP7HKUVJN7L6M

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -j KUBE-EXT-2CMXP7HKUVJN7L6M

KUBE-EXT-2CMXP7HKUVJN7L6M,先看第二条跳转到 KUBE-SVC-2CMXP7HKUVJN7L6M

-A KUBE-EXT-2CMXP7HKUVJN7L6M -m comment --comment "masquerade traffic for default/nginx external destinations" -j KUBE-MARK-MASQ

-A KUBE-EXT-2CMXP7HKUVJN7L6M -j KUBE-SVC-2CMXP7HKUVJN7L6M

KUBE-SVC-2CMXP7HKUVJN7L6M,第一条是源地址不是 10.244.0.0 目的地址是 10.106.224.38 走到 KUBE-MARK-MASQ,看第二条跳转到 KUBE-SEP-AAOCRVJBUI2XUHEI

-A KUBE-SVC-2CMXP7HKUVJN7L6M ! -s 10.244.0.0/16 -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ

-A KUBE-SVC-2CMXP7HKUVJN7L6M -m comment --comment "default/nginx -> 10.244.0.4:80" -j KUBE-SEP-AAOCRVJBUI2XUHEI

KUBE-SEP-AAOCRVJBUI2XUHEI,第一条源地址是 10.244.0.4(容器的地址)跳到 KUBE-MARK-MASQ 去做标记,第二条做了一个 dnat 目的地址是 10.244.0.4:80 就是容器 nginx 的地址

-A KUBE-SEP-AAOCRVJBUI2XUHEI -s 10.244.0.4/32 -m comment --comment "default/nginx" -j KUBE-MARK-MASQ

-A KUBE-SEP-AAOCRVJBUI2XUHEI -p tcp -m comment --comment "default/nginx" -m tcp -j DNAT --to-destination 10.244.0.4:80

出网

这里把 iptables 规则里关于出网的规则整理下。

首先第一条 POSTROUTING 跳转到 KUBE-POSTROUTING,从 pod 到外部网络的包会匹配后两条规则 -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0 标记,-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully 走 snat 变更源地址和随机端口,后面走 FLANNEL-POSTRTG 都没有匹配就正常返回到本机。

-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG

-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully

-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

这里关于 --set-xmark 不清晰,也不去管这个到底是做什么的,后面有查到相关资料在补充,直接通过外网访问,看具体哪个 POSTROUTING 规则的包会增加就能知道流量走的那个规则。

深入剖析 Kubernetes 中有解释

-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE 这条规则设置在 POSTROUTING 检查点,也就是说,它给即将离开这台主机的 IP 包,进行了一次 SNAT 操作,将这个 IP 包的源地址替换成了这台宿主机上的 CNI 网桥地址,或者宿主机本身的 IP 地址(如果 CNI 网桥不存在的话)。当然,这个 SNAT 操作只需要对 Service 转发出来的 IP 包进行(否则普通的 IP 包就被影响了)。而 iptables 做这个判断的依据,就是查看该 IP 包是否有一个“0x4000”的“标志”。你应该还记得,这个标志正是在 IP 包被执行 DNAT 操作之前被打上去的。

$ while true;do curl 101.200.150.26:31532;done

前后对比,能看到 KUBE-POSTROUTING 的 bytes 有变化,FLANNEL-POSTRTG 包很少,pkts 包总数在 1054 分别在 KUBE-MARK-MASQ 和 KUBE-POSTROUTING 两个 chain 里有对应数值。

$ iptables -t nat -L -nvx

-----------------------------------------------------------------
Chain POSTROUTING (policy ACCEPT 132636 packets, 8059995 bytes)
    pkts      bytes target     prot opt in     out     source               destination
  127270  7723952 KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
       0        0 MASQUERADE  0    --  *      !docker0  172.17.0.0/16        0.0.0.0/0
  120845  7318694 FLANNEL-POSTRTG  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* flanneld masq */

Chain FLANNEL-POSTRTG (1 references)
    pkts      bytes target     prot opt in     out     source               destination
       0        0 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4000/0x4000 /* flanneld masq */
   11999   720208 RETURN     0    --  *      *       10.244.0.0/24        10.244.0.0/16        /* flanneld masq */
       0        0 RETURN     0    --  *      *       10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
       0        0 RETURN     0    --  *      *      !10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
      14      908 MASQUERADE  0    --  *      *       10.244.0.0/16       !224.0.0.0/4          /* flanneld masq */ random-fully
       0        0 MASQUERADE  0    --  *      *      !10.244.0.0/16        10.244.0.0/16        /* flanneld masq */ random-fully

Chain KUBE-MARK-MASQ (14 references)
    pkts      bytes target     prot opt in     out     source               destination
       1       64 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
    pkts      bytes target     prot opt in     out     source               destination
    3513   212636 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
       1       64 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
       1       64 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully
     
-----------------------------------------------------------------
Chain POSTROUTING (policy ACCEPT 133177 packets, 8092725 bytes)
    pkts      bytes target     prot opt in     out     source               destination
  128864  7824074 KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
       0        0 MASQUERADE  0    --  *      !docker0  172.17.0.0/16        0.0.0.0/0
  121386  7351424 FLANNEL-POSTRTG  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* flanneld masq */

Chain FLANNEL-POSTRTG (1 references)
    pkts      bytes target     prot opt in     out     source               destination
       0        0 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4000/0x4000 /* flanneld masq */
   12051   723328 RETURN     0    --  *      *       10.244.0.0/24        10.244.0.0/16        /* flanneld masq */
       0        0 RETURN     0    --  *      *       10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
       0        0 RETURN     0    --  *      *      !10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
      14      908 MASQUERADE  0    --  *      *       10.244.0.0/16       !224.0.0.0/4          /* flanneld masq */ random-fully
       0        0 MASQUERADE  0    --  *      *      !10.244.0.0/16        10.244.0.0/16        /* flanneld masq */ random-fully

Chain KUBE-MARK-MASQ (14 references)
    pkts      bytes target     prot opt in     out     source               destination
    1054    67456 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
    pkts      bytes target     prot opt in     out     source               destination
    4054   245366 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    1054    67456 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    1054    67456 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Kube-Proxy ipvs

$ apt install ipvsadm ipset
# 手动加载常用模块
modprobe ip_vs
modprobe ip_vs_rr    # 轮询调度算法
modprobe ip_vs_wrr   # 加权轮询
modprobe ip_vs_sh    # 源哈希
modprobe nf_conntrack_ipv4  # 连接跟踪(若用 IPv4)

直接用 kubeadm 初始化

apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
  criSocket: unix:///run/cri-dockerd.sock
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
networking:
  serviceSubnet: 10.244.0.0/16
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

$ kubeadm init --config=init.yml --ignore-preflight-errors=mem
$ ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  172.22.7.89:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.0:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.1:443 rr
  -> 172.22.7.89:6443             Masq    1      3          0
TCP  10.244.0.1:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0
  -> 10.244.0.3:53                Masq    1      0          0
TCP  10.244.0.10:9153 rr
  -> 10.244.0.2:9153              Masq    1      0          0
  -> 10.244.0.3:9153              Masq    1      0          0
TCP  10.244.142.146:80 rr
  -> 10.244.0.4:80                Masq    1      0          0
UDP  10.244.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0
  -> 10.244.0.3:53                Masq    1      0          0

在看下改为 ipvs 后的 iptables 规则有什么变化

$ iptables -S -t nat > ipvsrules.txt
-----------------------------------------------------------------
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N FLANNEL-POSTRTG
-N KUBE-KUBELET-CANARY
-N KUBE-LOAD-BALANCER
-N KUBE-MARK-MASQ
-N KUBE-NODE-PORT
-N KUBE-POSTROUTING
-N KUBE-SERVICES
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG
-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A KUBE-SERVICES -s 127.0.0.0/8 -j RETURN
-A KUBE-SERVICES -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP src,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT

Awkee/Netfilter-IPTables-Diagrams.md

comparing-kube-proxy-modes-iptables-or-ipvs

Flannel vxlan

两个 node 的集群,nginx 启动了两个

$ kubectl get node -o wide
NAME             STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
orange723        Ready    control-plane   3h24m   v1.34.2   172.22.7.89   <none>        Ubuntu 24.04.3 LTS   6.8.0-87-generic   docker://29.0.1
orange723-node   Ready    <none>          30m     v1.34.2   172.22.7.90   <none>        Ubuntu 24.04.3 LTS   6.8.0-87-generic   docker://29.0.1

-----------------------------------------------------------------
$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE    IP           NODE             NOMINATED NODE   READINESS GATES
web-5846888f49-bd5lx   1/1     Running   0          12m    10.244.1.2   orange723-node   <none>           <none>
web-5846888f49-fx7bc   1/1     Running   0          3h6m   10.244.0.4   orange723        <none>           <none>

master 主机抓取 eth0 网卡

$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-vxlan.pcap --print

flannel-1-vxlan.pcap

能看到在建立连接没有发送数据时,一个容器的包被 udp 包了一层里面是 vxlan,最外层是宿主机正常的 ip 包。

贴下 route -n 的内容后面可以和 host-gw 模式做对比

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.22.15.253   0.0.0.0         UG    100    0        0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
100.100.2.136   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
100.100.2.138   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.22.0.0      0.0.0.0         255.255.240.0   U     100    0        0 eth0
172.22.15.253   0.0.0.0         255.255.255.255 UH    100    0        0 eth0

Ethernet head 默认是 14 bytes Ethernet_frame

flannel vxlan 模式可以参照这篇文章 深入解析容器跨主机网络

图片来源:深入剖析 Kubernetes

Flannel host-gw

直接修改 kube-flannel namespace 里 kube-flannel-cfg 的 configmap,然后重启 daemonset

$ kubectl get cm kube-flannel-cfg -o yaml|grep host-gw
        "Type": "host-gw"
$ kubectl rollout restart ds kube-flannel-ds -n kube-flannel

可以从 log 里看到 Backend type 是 host-gw

I1117 09:10:11.315220       1 main.go:523] Found network config - Backend type: host-gw

在看路由信息,去往另一台机器的 10.244.1.0 网关成了 172.22.7.90 网卡是 eth0,这个 ip 就是另一台 node 的 ip

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.22.15.253   0.0.0.0         UG    100    0        0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      172.22.7.90     255.255.255.0   UG    0      0        0 eth0
100.100.2.136   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
100.100.2.138   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.22.0.0      0.0.0.0         255.255.240.0   U     100    0        0 eth0
172.22.15.253   0.0.0.0         255.255.255.255 UH    100    0        0 eth0

这里去 ping 10.244.1.3 另一台的机器上的 pod 发现不通,抓包下,能看到目的 mac 是 ee:ff:ff:ff:ff:ff,云上主机二层网络是不通的,ping 10.244.1.3 走的路由是 10.244.1.0 172.22.7.90 255.255.255.0 UG 0 0 0 eth0 搜索发现 host-gw-in-aliyun,通过在 vpc 自定义路由添加两条规则 本机的 pod cidr 下一跳是这台 ecs,有几个主机添加几个,之后就可以直接连通了。

$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-ping-noresponse.pcap --print

flannel-1-ping-noreponse.pcap

$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.244.0.5               ether   46:33:68:6a:43:bb   C                     cni0
10.244.0.6               ether   46:e3:5c:54:ea:35   C                     cni0
10.244.0.7               ether   02:dd:96:54:61:7f   C                     cni0
172.22.7.90              ether   ee:ff:ff:ff:ff:ff   C                     eth0
172.22.15.253            ether   ee:ff:ff:ff:ff:ff   C                     eth0
$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-hostgw-ping.pcap --print

flannel-1-hostgw-ping.pcap