Flannel of Kubernetes Network Analysis

  container, kubernetes

Flannel is an open source CNI network plug-in of cereos. as shown in the figure below, Flannel’s website provides a schematic diagram of a data packet through encapsulation, transmission and unpacking. from this picture, it can be seen that the docker0 of the two machines are in different segments: 10.1.20.1/24 and 10.1.15.1/24 respectively. If the Backend Service2 pod(10.1.20.3) on another host is connected from the Web App Frontend1 pod(10.1.15.2), the network packet is sent from the host 192.168.0.100 to 192.168.0.200, and the packet in the inner container is encapsulated into UDP of the host, and the IP and mac addresses of the host are wrapped in the outer layer. This is a classic overlay network, because the IP of the container is an internal IP and cannot communicate from the cross-host computer, so the network of the container needs to be loaded onto the network of the host computer.

Flannel supports a variety of network modes, commonly used are vxlan, UDP, hostgw, ipip, gce and Aliyun, etc. The difference between vxlan and UDP is that vxlan is a kernel packet, while UDP is a flanneld user-mode program packet, so the mode performance of UDP will be slightly poor. Hostgw mode is a host gateway mode. The gateway from the container to the container on another host is set as the network card address of the host. This is very similar to calico, except that calico is declared through BGP and hostgw is distributed through etcd in the center. Therefore, hostgw is a direct connection mode and does not need to pack and unpack through overlay. Its performance is relatively high. However, the biggest disadvantage of hostgw mode is that it must be in a layer 2 network. After all, the next hop route needs to be in the neighbor table, otherwise it cannot pass.

In the actual production environment, vxlan mode is the most commonly used mode. We first look at the working principle and then implement the process through source code analysis.

The installation process is very simple, mainly divided into two steps:

The first step is to install flannel.

Yum install flannel or through kubernetes’ daemonset to configure the etcd address for flannel

The second step is to configure the cluster network

curl -L http://etcdurl:2379/v2/keys/flannel/network/config -XPUT -d value="{\"Network\":\"172.16.0.0/16\",\"SubnetLen\":24,\"Backend\":{\"Type\":\"vxlan\",\"VNI\":1}}"

Then start the flanned program for each node.

I. working principle

1. How to allocate the address of the container

When the Docker container starts up, it assigns an IP address through docker0. flannel assigns an IP segment to each machine and configures it on docker0. After the container starts up, it selects an unoccupied IP within this segment. How can flannel modify docker0 segment?

First look at flannel’s startup file/usr/lib/systemd/system/Flannel.service.

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/flanneld
ExecStart=/usr/bin/flanneld-start $FLANNEL_OPTIONS
ExecStartPost=/opt/flannel/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker

The file specifies flannel environment variable and mk-docker-opts.sh set by startup script and post-startup execution script ExecStartPost. The role of this script is to generate /run/flannel/docker. The file content is as follows:

DOCKER_OPT_BIP="--bip=10.251.81.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=false"
DOCKER_OPT_MTU="--mtu=1450"
DOCKER_NETWORK_OPTIONS=" --bip=10.251.81.1/24 --ip-masq=false --mtu=1450"

This file is associated with docker startup file/USR/LIB/SystemD/System/Docker.Service.

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/flannel/docker
EnvironmentFile=-/etc/sysconfig/docker

This will set up the bridge for docker0.

In the development environment, there are three machines that are assigned the following network segments:

host-139.245 10.254.44.1/24

host-139.246 10.254.60.1/24

host-139.247 10.254.50.1/24

2. How do containers communicate

The above describes the allocation of IP to each container, so how do containers on different hosts communicate? Let’s take the most common vxlan as an example. Here are three key points, one route, one arp and one FDB. We will analyze the functions of the above three elements one by one according to the process of container delivery. First, the data packets from the container will pass through docker0. Then, will the following be directly sent out from the host network or forwarded through vxlan packets? This is the routing setting on each machine.

 #ip route  show dev flannel.1
10.254.50.0/24 via 10.254.50.0 onlink
10.254.60.0/24 via 10.254.60.0 onlink

It can be seen that each host has a route to the other two machines. This route is an onlink route. The onlink parameter indicates that this gateway is forced to be “on the link” (although there is no link layer route). Otherwise, it is impossible to add routes of different network segments to linux. In this way, the data packet can know that if it is directly accessed by the container, it will be handled by flannel.1 device.

Flannel.1, a virtual network device, will process data packets, but here comes another question. What is the mac address of this gateway? Because this gateway is set up through onlink, flannel will issue this mac address and check the arp table.

# ip neig show dev flannel.1
10.254.50.0 lladdr ba:10:0e:7b:74:89 PERMANENT
10.254.60.0 lladdr 92:f3:c8:b2:6e:f0 PERMANENT

You can see the mac address corresponding to this gateway, so that the data packets in the inner layer are encapsulated.

This is the last question. What is the destination IP of outgoing packets? In other words, which machine should this encapsulated data packet be sent to? Maybe every packet will be broadcast. Vxlan’s default implementation was indeed broadcast for the first time, but flannel once again issued the forwarding table FDB directly in a hack mode.

# bridge fdb show dev flannel.1
92:f3:c8:b2:6e:f0 dst 10.100.139.246 self permanent
ba:10:0e:7b:74:89 dst 10.100.139.247 self permanent

In this way, the corresponding mac address forwarding destination IP can be obtained.

There is one more thing to note here.Both arp table and FDB table are permanent, which indicates that the write record is maintained manually. The traditional way of arp obtaining neighbors is through broadcasting. If the arp received from the opposite end is corresponding, the opposite end will be marked as reachable. After the set time of reachable is exceeded, if the opposite end failure is found, it will be marked as stale, and then the delay and probe will be transferred to the detection state. If the detection fails, it will be marked as Failed state. The reason why the basic content of arp is introduced is that the old version of flannel does not use the above method of this article, but uses a temporary arp scheme. The arp issued at this time indicates the reachable state. This means that if flannel goes down more than the reachable timeout, the network of the container on this machine will be interrupted. We will briefly review and try the previous (0.7.x) version. In order to obtain the opposite arp address, the kernel will first send arp inquiry. If we try

/proc/sys/net/ipv4/neigh/$NIC/ucast_solicit

After that, arp inquiry will be sent to the user space.

/proc/sys/net/ipv4/neigh/$NIC/app_solicit

Previous versions of flannel used this feature to set

# cat   /proc/sys/net/ipv4/neigh/flannel.1/app_solicit
3

Thus flanneld can obtain L3MISS sent by the kernel to the user space, and cooperate with etcd to return the mac address corresponding to this IP address and set it as reachable. From the analysis, it can be seen that if flanneld program exits, communication between containers will be interrupted, which requires attention. Flannel’s startup process is shown in the following figure:

Flannel starts the execution of newSubnetManager. He creates a background data store. Currently, there are two kinds of back ends supported. The default is etcd store. If Flannel starts to specify “kube-subnet-mgr” parameter, kubernetes interface is used to store data.

The specific code is as follows:

func newSubnetManager() (subnet.Manager, error) {
    if opts.kubeSubnetMgr {
       return kube.NewSubnetManager(opts.kubeApiUrl, opts.kubeConfigFile)
    }
  
    cfg := &etcdv2.EtcdConfig{
       Endpoints: strings.Split(opts.etcdEndpoints, ","),
       Keyfile:   opts.etcdKeyfile,
       Certfile:  opts.etcdCertfile,
       CAFile:    opts.etcdCAFile,
       Prefix:    opts.etcdPrefix,
       Username:  opts.etcdUsername,
       Password:  opts.etcdPassword,
    }
  
    // Attempt to renew the lease for the subnet specified in the subnetFile
    prevSubnet := ReadCIDRFromSubnetFile(opts.subnetFile, "FLANNEL_SUBNET")
  
    return etcdv2.NewLocalManager(cfg, prevSubnet)
 }

Through SubnetManager, combined with etcd data configured during deployment as described above, network configuration information can be obtained, mainly referring to backend and network segment information. For vxlan, a corresponding network manager is created through NewManager. Simple engineering mode is used here. First, each network mode manager will initialize registration through init.

Such as vxlan

func init() {
    backend.Register("vxlan", New)

If it is udp

  func init() {
    backend.Register("udp", New)
 }

The other is similar, registering the construction methods into a map, thus setting and enabling the corresponding network manager according to the network mode configured by etcd.

3. Registration Network

RegisterNetwork, first of all, will create a flannel.vxlanID network card. The default vxlanID is 1. Then register the lease with etcd and obtain the corresponding network segment information. In this way, there is a detail. The old version of flannel will get a new network segment every time it starts. The new version of flannel will traverse the registered etcd information in etcd to obtain the previously allocated network segment and continue to use it.

Finally, write the local subnet file through WriteSubnetFile.

    # cat /run/flannel/subnet.env 
FLANNEL_NETWORK=10.254.0.0/16
FLANNEL_SUBNET=10.254.44.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

Use this document to set up docker’s network. Careful readers may find that the MTU here is not 1500 as stipulated by Ethernet, because the outer vxlan packet still occupies 50 Byte.

Of course, flannel needs the data in watch etcd continuously after it is started. These are the three tables that other flannel nodes can dynamically update when new flannel nodes are added or changed. The main processing methods are in handleSubnetEvents.

    func (nw *network) handleSubnetEvents(batch []subnet.Event) {
 . . .
  
       switch event.Type {//如果是有新的网段加入(新的主机加入)
       case subnet.EventAdded:
  . . .//更新路由表
if err := netlink.RouteReplace(&directRoute); err != nil {
    log.Errorf("Error adding route to %v via %v: %v", sn, attrs.PublicIP, err)
    continue
 } 
//添加arp表
log.V(2).Infof("adding subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
             if err := nw.dev.AddARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
                log.Error("AddARP failed: ", err)
                continue
             }
 //添加FDB表
             if err := nw.dev.AddFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
                log.Error("AddFDB failed: ", err)
  
                              if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
                   log.Error("DelARP failed: ", err)
                }
  
                continue
             }//如果是删除实践
      case subnet.EventRemoved:
//删除路由
             if err := netlink.RouteDel(&directRoute); err != nil {
                log.Errorf("Error deleting route to %v via %v: %v", sn, attrs.PublicIP, err)
             
          } else {
             log.V(2).Infof("removing subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
  
           //删除arp            if err := nw.dev.DelARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
                log.Error("DelARP failed: ", err)
             }
 //删除FDB
             if err := nw.dev.DelFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
                log.Error("DelFDB failed: ", err)
             }
  
             if err := netlink.RouteDel(&vxlanRoute); err != nil {
                log.Errorf("failed to delete vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
             }
          }
       default:
          log.Error("internal error: unknown event type: ", int(event.Type))
       }
    }
 }

In this way, the addition and deletion of any host in flannel can be sensed by other nodes, thus updating the local kernel forwarding table.

Author: Chen Xiaoyu

Source:Yixin Institute of Technology