Kubernetes Service 流量路径

Kubeadm 部署

$ kubeadm init --cri-socket=unix:///run/cri-dockerd.sock --ignore-preflight-errors=mem --pod-network-cidr=10.244.0.0/16

$ kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Service NodePort

入网

nginx deployment 单副本,service 为 nodeport

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: docker.1ms.run/nginx:stable
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
          name: web
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
  labels:
    app: nginx
spec:
  type: NodePort
  ports:
  - port: 80
    nodePort: 31532
  selector:
    app: nginx
$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE    IP           NODE        NOMINATED NODE   READINESS GATES
web-5846888f49-q4f9c   1/1     Running   0          151m   10.244.0.4   orange723   <none>           <none>

$ kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP        150m
nginx        NodePort    10.106.224.38   <none>        80:31532/TCP   146m

直接访问 service 地址 10.106.224.38 和 pod 地址 10.244.0.4,正常返回

$ curl -I 10.106.224.38
HTTP/1.1 200 OK

$ curl -I 10.244.0.4
HTTP/1.1 200 OK

从外部访问测试并分别抓 本机 eth0 cni0 和 pod 的 eth0 网卡

$ tcpdump -i eth0 -s0 -X -nn "tcp port 31532" -w eth0.pcap --print

eth0.pcap

$ tcpdump -s0 -X -nn -i cni0 -w cni0.pcap --print

cni0.pcap

$ tcpdump -i eth0 -s0 -X -nn -w nginx.pcap --print

nginx.pcap

先看 eth0 抓包文件,能看到本地和主机网卡建立连接,三次握手后直接发了 GET / 请求,cni0 抓包中是 10.244.0.1 向 0.4 发送的 GET /,0.1 ip 是 cni0 网卡地址 0.4 是容器的地址,然后 0.4 直接把内容返回给本地 socket 地址也能对应上 37475,最后 nginx 中的抓包则是只和 cni0 网卡通信。

根据抓包内容看出网 iptables:-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES >> -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully 因为不涉及跨主机,后面的 FLANNEL-POSTRTG 正常走一遍流量就返回了

贴下当前机器的 iptables 规则

$ iptables -S -t nat > rules.txt
-----------------------------------------------------------------
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-N FLANNEL-POSTRTG
-N KUBE-EXT-2CMXP7HKUVJN7L6M
-N KUBE-KUBELET-CANARY
-N KUBE-MARK-MASQ
-N KUBE-NODEPORTS
-N KUBE-POSTROUTING
-N KUBE-PROXY-CANARY
-N KUBE-SEP-6E7XQMQ4RAYOWTTM
-N KUBE-SEP-AAOCRVJBUI2XUHEI
-N KUBE-SEP-C3WRBSQHCDQ7BT6J
-N KUBE-SEP-IT2ZTR26TO4XFPTO
-N KUBE-SEP-N4G2XR5TDX7PQE7P
-N KUBE-SEP-YIL6JZP7A3QYXJU2
-N KUBE-SEP-ZP3FB6NMPNCO4VBJ
-N KUBE-SEP-ZXMNUKOKXUTL2MK2
-N KUBE-SERVICES
-N KUBE-SVC-2CMXP7HKUVJN7L6M
-N KUBE-SVC-ERIFXISQEP7F7OF4
-N KUBE-SVC-JD5MR3NA4I4DYORP
-N KUBE-SVC-NPX46M4PTMTKRN6Y
-N KUBE-SVC-TCOU7JCQXEZGVUNU
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG
-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A KUBE-EXT-2CMXP7HKUVJN7L6M -m comment --comment "masquerade traffic for default/nginx external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-2CMXP7HKUVJN7L6M -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -m nfacct --nfacct-name  localhost_nps_accepted_pkts -j KUBE-EXT-2CMXP7HKUVJN7L6M
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -j KUBE-EXT-2CMXP7HKUVJN7L6M
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A KUBE-SEP-6E7XQMQ4RAYOWTTM -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-6E7XQMQ4RAYOWTTM -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.244.0.3:53
-A KUBE-SEP-AAOCRVJBUI2XUHEI -s 10.244.0.4/32 -m comment --comment "default/nginx" -j KUBE-MARK-MASQ
-A KUBE-SEP-AAOCRVJBUI2XUHEI -p tcp -m comment --comment "default/nginx" -m tcp -j DNAT --to-destination 10.244.0.4:80
-A KUBE-SEP-C3WRBSQHCDQ7BT6J -s 172.22.7.89/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-C3WRBSQHCDQ7BT6J -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 172.22.7.89:6443
-A KUBE-SEP-IT2ZTR26TO4XFPTO -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-IT2ZTR26TO4XFPTO -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.244.0.2:53
-A KUBE-SEP-N4G2XR5TDX7PQE7P -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-N4G2XR5TDX7PQE7P -p tcp -m comment --comment "kube-system/kube-dns:metrics" -m tcp -j DNAT --to-destination 10.244.0.2:9153
-A KUBE-SEP-YIL6JZP7A3QYXJU2 -s 10.244.0.2/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-YIL6JZP7A3QYXJU2 -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.244.0.2:53
-A KUBE-SEP-ZP3FB6NMPNCO4VBJ -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZP3FB6NMPNCO4VBJ -p tcp -m comment --comment "kube-system/kube-dns:metrics" -m tcp -j DNAT --to-destination 10.244.0.3:9153
-A KUBE-SEP-ZXMNUKOKXUTL2MK2 -s 10.244.0.3/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZXMNUKOKXUTL2MK2 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.244.0.3:53
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-2CMXP7HKUVJN7L6M ! -s 10.244.0.0/16 -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SVC-2CMXP7HKUVJN7L6M -m comment --comment "default/nginx -> 10.244.0.4:80" -j KUBE-SEP-AAOCRVJBUI2XUHEI
-A KUBE-SVC-ERIFXISQEP7F7OF4 ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp -> 10.244.0.2:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-IT2ZTR26TO4XFPTO
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp -> 10.244.0.3:53" -j KUBE-SEP-ZXMNUKOKXUTL2MK2
-A KUBE-SVC-JD5MR3NA4I4DYORP ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SVC-JD5MR3NA4I4DYORP -m comment --comment "kube-system/kube-dns:metrics -> 10.244.0.2:9153" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-N4G2XR5TDX7PQE7P
-A KUBE-SVC-JD5MR3NA4I4DYORP -m comment --comment "kube-system/kube-dns:metrics -> 10.244.0.3:9153" -j KUBE-SEP-ZP3FB6NMPNCO4VBJ
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 10.244.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https -> 172.22.7.89:6443" -j KUBE-SEP-C3WRBSQHCDQ7BT6J
-A KUBE-SVC-TCOU7JCQXEZGVUNU ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns -> 10.244.0.2:53" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-YIL6JZP7A3QYXJU2
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns -> 10.244.0.3:53" -j KUBE-SEP-6E7XQMQ4RAYOWTTM

看下 31532 端口是否有监听,查看是空的

$ netstat -lnpt|grep 31532

先看 PREROUTING,跳转到 KUBE-SERVICES

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER

KUBE-SERVICES 从外部访问目的地址不是 10.106.224.38,只有最后一条匹配

-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

KUBE-NODEPORTS,第二条规则能看到设置的 nodeport 端口 31532

-A KUBE-NODEPORTS -d 127.0.0.0/8 -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -m nfacct --nfacct-name  localhost_nps_accepted_pkts -j KUBE-EXT-2CMXP7HKUVJN7L6M

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx" -m tcp --dport 31532 -j KUBE-EXT-2CMXP7HKUVJN7L6M

KUBE-EXT-2CMXP7HKUVJN7L6M,先看第二条跳转到 KUBE-SVC-2CMXP7HKUVJN7L6M

-A KUBE-EXT-2CMXP7HKUVJN7L6M -m comment --comment "masquerade traffic for default/nginx external destinations" -j KUBE-MARK-MASQ

-A KUBE-EXT-2CMXP7HKUVJN7L6M -j KUBE-SVC-2CMXP7HKUVJN7L6M

KUBE-SVC-2CMXP7HKUVJN7L6M,第一条是源地址不是 10.244.0.0 目的地址是 10.106.224.38 走到 KUBE-MARK-MASQ,看第二条跳转到 KUBE-SEP-AAOCRVJBUI2XUHEI

-A KUBE-SVC-2CMXP7HKUVJN7L6M ! -s 10.244.0.0/16 -d 10.106.224.38/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ

-A KUBE-SVC-2CMXP7HKUVJN7L6M -m comment --comment "default/nginx -> 10.244.0.4:80" -j KUBE-SEP-AAOCRVJBUI2XUHEI

KUBE-SEP-AAOCRVJBUI2XUHEI,第一条源地址是 10.244.0.4(容器的地址)跳到 KUBE-MARK-MASQ 去做标记,第二条做了一个 dnat 目的地址是 10.244.0.4:80 就是容器 nginx 的地址

-A KUBE-SEP-AAOCRVJBUI2XUHEI -s 10.244.0.4/32 -m comment --comment "default/nginx" -j KUBE-MARK-MASQ

-A KUBE-SEP-AAOCRVJBUI2XUHEI -p tcp -m comment --comment "default/nginx" -m tcp -j DNAT --to-destination 10.244.0.4:80

出网

这里把 iptables 规则里关于出网的规则整理下。

首先第一条 POSTROUTING 跳转到 KUBE-POSTROUTING,从 pod 到外部网络的包会匹配后两条规则 -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0 标记,-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully 走 snat 变更源地址和随机端口,后面走 FLANNEL-POSTRTG 都没有匹配就正常返回到本机。

-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG

-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully

-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

这里关于 --set-xmark 不清晰,也不去管这个到底是做什么的,后面有查到相关资料在补充,直接通过外网访问,看具体哪个 POSTROUTING 规则的包会增加就能知道流量走的那个规则。

深入剖析 Kubernetes 中有解释

-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE 这条规则设置在 POSTROUTING 检查点,也就是说,它给即将离开这台主机的 IP 包,进行了一次 SNAT 操作,将这个 IP 包的源地址替换成了这台宿主机上的 CNI 网桥地址,或者宿主机本身的 IP 地址(如果 CNI 网桥不存在的话)。当然,这个 SNAT 操作只需要对 Service 转发出来的 IP 包进行(否则普通的 IP 包就被影响了)。而 iptables 做这个判断的依据,就是查看该 IP 包是否有一个“0x4000”的“标志”。你应该还记得,这个标志正是在 IP 包被执行 DNAT 操作之前被打上去的。

$ while true;do curl 101.200.150.26:31532;done

前后对比,能看到 KUBE-POSTROUTING 的 bytes 有变化,FLANNEL-POSTRTG 包很少,pkts 包总数在 1054 分别在 KUBE-MARK-MASQ 和 KUBE-POSTROUTING 两个 chain 里有对应数值。

$ iptables -t nat -L -nvx

-----------------------------------------------------------------
Chain POSTROUTING (policy ACCEPT 132636 packets, 8059995 bytes)
    pkts      bytes target     prot opt in     out     source               destination
  127270  7723952 KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
       0        0 MASQUERADE  0    --  *      !docker0  172.17.0.0/16        0.0.0.0/0
  120845  7318694 FLANNEL-POSTRTG  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* flanneld masq */

Chain FLANNEL-POSTRTG (1 references)
    pkts      bytes target     prot opt in     out     source               destination
       0        0 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4000/0x4000 /* flanneld masq */
   11999   720208 RETURN     0    --  *      *       10.244.0.0/24        10.244.0.0/16        /* flanneld masq */
       0        0 RETURN     0    --  *      *       10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
       0        0 RETURN     0    --  *      *      !10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
      14      908 MASQUERADE  0    --  *      *       10.244.0.0/16       !224.0.0.0/4          /* flanneld masq */ random-fully
       0        0 MASQUERADE  0    --  *      *      !10.244.0.0/16        10.244.0.0/16        /* flanneld masq */ random-fully

Chain KUBE-MARK-MASQ (14 references)
    pkts      bytes target     prot opt in     out     source               destination
       1       64 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
    pkts      bytes target     prot opt in     out     source               destination
    3513   212636 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
       1       64 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
       1       64 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully
     
-----------------------------------------------------------------
Chain POSTROUTING (policy ACCEPT 133177 packets, 8092725 bytes)
    pkts      bytes target     prot opt in     out     source               destination
  128864  7824074 KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
       0        0 MASQUERADE  0    --  *      !docker0  172.17.0.0/16        0.0.0.0/0
  121386  7351424 FLANNEL-POSTRTG  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* flanneld masq */

Chain FLANNEL-POSTRTG (1 references)
    pkts      bytes target     prot opt in     out     source               destination
       0        0 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4000/0x4000 /* flanneld masq */
   12051   723328 RETURN     0    --  *      *       10.244.0.0/24        10.244.0.0/16        /* flanneld masq */
       0        0 RETURN     0    --  *      *       10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
       0        0 RETURN     0    --  *      *      !10.244.0.0/16        10.244.0.0/24        /* flanneld masq */
      14      908 MASQUERADE  0    --  *      *       10.244.0.0/16       !224.0.0.0/4          /* flanneld masq */ random-fully
       0        0 MASQUERADE  0    --  *      *      !10.244.0.0/16        10.244.0.0/16        /* flanneld masq */ random-fully

Chain KUBE-MARK-MASQ (14 references)
    pkts      bytes target     prot opt in     out     source               destination
    1054    67456 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
    pkts      bytes target     prot opt in     out     source               destination
    4054   245366 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    1054    67456 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    1054    67456 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Kube-Proxy ipvs

$ apt install ipvsadm ipset
# 手动加载常用模块
modprobe ip_vs
modprobe ip_vs_rr    # 轮询调度算法
modprobe ip_vs_wrr   # 加权轮询
modprobe ip_vs_sh    # 源哈希
modprobe nf_conntrack_ipv4  # 连接跟踪(若用 IPv4)

直接用 kubeadm 初始化

apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
  criSocket: unix:///run/cri-dockerd.sock
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
networking:
  serviceSubnet: 10.244.0.0/16
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

$ kubeadm init --config=init.yml --ignore-preflight-errors=mem
$ ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  172.22.7.89:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.0:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.1:443 rr
  -> 172.22.7.89:6443             Masq    1      3          0
TCP  10.244.0.1:31532 rr
  -> 10.244.0.4:80                Masq    1      0          0
TCP  10.244.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0
  -> 10.244.0.3:53                Masq    1      0          0
TCP  10.244.0.10:9153 rr
  -> 10.244.0.2:9153              Masq    1      0          0
  -> 10.244.0.3:9153              Masq    1      0          0
TCP  10.244.142.146:80 rr
  -> 10.244.0.4:80                Masq    1      0          0
UDP  10.244.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0
  -> 10.244.0.3:53                Masq    1      0          0

在看下改为 ipvs 后的 iptables 规则有什么变化

$ iptables -S -t nat > ipvsrules.txt
-----------------------------------------------------------------
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N FLANNEL-POSTRTG
-N KUBE-KUBELET-CANARY
-N KUBE-LOAD-BALANCER
-N KUBE-MARK-MASQ
-N KUBE-NODE-PORT
-N KUBE-POSTROUTING
-N KUBE-SERVICES
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG
-A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
-A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE --random-fully
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A KUBE-SERVICES -s 127.0.0.0/8 -j RETURN
-A KUBE-SERVICES -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP src,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT

Awkee/Netfilter-IPTables-Diagrams.md

comparing-kube-proxy-modes-iptables-or-ipvs

Flannel vxlan

两个 node 的集群,nginx 启动了两个

$ kubectl get node -o wide
NAME             STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
orange723        Ready    control-plane   3h24m   v1.34.2   172.22.7.89   <none>        Ubuntu 24.04.3 LTS   6.8.0-87-generic   docker://29.0.1
orange723-node   Ready    <none>          30m     v1.34.2   172.22.7.90   <none>        Ubuntu 24.04.3 LTS   6.8.0-87-generic   docker://29.0.1

-----------------------------------------------------------------
$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE    IP           NODE             NOMINATED NODE   READINESS GATES
web-5846888f49-bd5lx   1/1     Running   0          12m    10.244.1.2   orange723-node   <none>           <none>
web-5846888f49-fx7bc   1/1     Running   0          3h6m   10.244.0.4   orange723        <none>           <none>

master 主机抓取 eth0 网卡

$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-vxlan.pcap --print

flannel-1-vxlan.pcap

能看到在建立连接没有发送数据时,一个容器的包被 udp 包了一层里面是 vxlan,最外层是宿主机正常的 ip 包。

贴下 route -n 的内容后面可以和 host-gw 模式做对比

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.22.15.253   0.0.0.0         UG    100    0        0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
100.100.2.136   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
100.100.2.138   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.22.0.0      0.0.0.0         255.255.240.0   U     100    0        0 eth0
172.22.15.253   0.0.0.0         255.255.255.255 UH    100    0        0 eth0

Ethernet head 默认是 14 bytes Ethernet_frame

flannel vxlan 模式可以参照这篇文章 深入解析容器跨主机网络

图片来源:深入剖析 Kubernetes

Flannel host-gw

直接修改 kube-flannel namespace 里 kube-flannel-cfg 的 configmap,然后重启 daemonset

$ kubectl get cm kube-flannel-cfg -o yaml|grep host-gw
        "Type": "host-gw"
$ kubectl rollout restart ds kube-flannel-ds -n kube-flannel

可以从 log 里看到 Backend type 是 host-gw

I1117 09:10:11.315220       1 main.go:523] Found network config - Backend type: host-gw

在看路由信息,去往另一台机器的 10.244.1.0 网关成了 172.22.7.90 网卡是 eth0,这个 ip 就是另一台 node 的 ip

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.22.15.253   0.0.0.0         UG    100    0        0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      172.22.7.90     255.255.255.0   UG    0      0        0 eth0
100.100.2.136   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
100.100.2.138   172.22.15.253   255.255.255.255 UGH   100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.22.0.0      0.0.0.0         255.255.240.0   U     100    0        0 eth0
172.22.15.253   0.0.0.0         255.255.255.255 UH    100    0        0 eth0

这里去 ping 10.244.1.3 另一台的机器上的 pod 发现不通,抓包下,能看到目的 mac 是 ee:ff:ff:ff:ff:ff,云上主机二层网络是不通的,ping 10.244.1.3 走的路由是 10.244.1.0 172.22.7.90 255.255.255.0 UG 0 0 0 eth0 搜索发现 host-gw-in-aliyun,通过在 vpc 自定义路由添加两条规则 本机的 pod cidr 下一跳是这台 ecs,有几个主机添加几个,之后就可以直接连通了。

$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-ping-noresponse.pcap --print

flannel-1-ping-noreponse.pcap

$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.244.0.5               ether   46:33:68:6a:43:bb   C                     cni0
10.244.0.6               ether   46:e3:5c:54:ea:35   C                     cni0
10.244.0.7               ether   02:dd:96:54:61:7f   C                     cni0
172.22.7.90              ether   ee:ff:ff:ff:ff:ff   C                     eth0
172.22.15.253            ether   ee:ff:ff:ff:ff:ff   C                     eth0
$ tcpdump -s0 -X -nn -i eth0 "tcp port not 22" -w flannel-1-hostgw-ping.pcap --print

flannel-1-hostgw-ping.pcap