Background of Container Cloud
With the popularization of micro-service architecture and the combination of open source micro-service frameworks such as Dubbo and Spring Cloud, many business lines within Yixin have gradually shifted from the original single architecture to micro-service architecture. Applications change from stateful to stateless, specifically storing business state data such as session and user data into services in middleware.
Although the separation of micro-services reduces the complexity of each service, the number of service instances has shown explosive growth, which makes operation and maintenance more difficult. On the one hand, service deployment and upgrade, on the other hand, service monitoring and fault recovery, etc.
In 2016, container technology, especially Docker, became rapidly popular. The company began to try to put containers into containers for operation. Although the issue of service publishing was solved through containers, many containers still faced difficulties in operation and maintenance. Yixin is a financial technology company. When introducing open source components, stability and reliability are the most important criteria for consideration. Kubernes gradually matured in early 2017 and became the management standard for containers, and was adopted by many companies at home and abroad. Under this background, Yixin drew lessons from open source community and commercial PAAS platform products and developed a container management platform based on Kubernes.
The whole architecture is built around kubernetes and is divided into four levels. The bottom layer is mainly basic resources, including network, calculation and storage. All containers are deployed on physical servers. The containers mount commercial NAS storage and the networks are interconnected through vxlan. The core of the middle layer is the resource scheduling layer, which mainly completes the management of multi-clusters, release and deployment, intelligent scheduling, automatic scaling, etc. This layer is mainly the resource management and service scheduling. The left side is to provide system security, mainly for system security and container mirror image security. The right side is a set of code automatic compilation, automatic construction and automatic deployment system. Middleware layer mainly provides common middleware services, Nginx configuration and monitoring alarms, etc. The top layer is the user access layer, which mainly provides user access. The overall architecture is shown in the following figure:
Nginx self-service management
Most of the company’s services are provided through Nginx reverse agents. For service isolation and load balancing, there are a total of more than ten sets of Nginx clusters. These nginx have different versions and configuration methods, resulting in very high cost and error-prone operation and maintenance by manual operation and maintenance. Moreover, the IP address of the container is not fixed and cannot be directly configured to the nginx back end. We have developed a nginx management system, mainly to solve the template configuration of nginx, as shown in the following figure:
Nginx-mgr provides HTTP requests and is responsible for receiving and updating nginx configuration requests to etcd. each nginx-agent refreshes the configuration of Nginx in batch through watch Etcd. In the actual production environment, Ali’s open source Tengine is deployed instead of nginx, and no distinction is made because the configurations are basically the same. Health checks are configured for each service to ensure automatic switching in the event of back-end failures. If there is a virtual machine scene that needs to be manually switched, the following figure shows the page for manually switching nginx:
Since many services are mixed running of virtual machines and containers, if the back end is a container, we can dynamically refresh the IP address of the container through kubernetes API.
Although kubernetes itself uses a highly available deployment architecture to avoid single point of failure, this is far from enough. On the one hand, a single kubernetes cluster is deployed in one computer room, which will lead to service interruption if a computer room-level failure occurs. On the other hand, a single kubernetes cluster’s own failure, such as the failure of the entire network caused by a cluster’s network configuration error, will affect the normal use of business. kubernetes is deployed in multiple computer rooms in Yixin, and the computer rooms are interconnected by dedicated lines. Then the management of many clusters will become the main difficulty: first, how to allocate resources. when users choose to deploy multiple clusters, the system determines the number of containers allocated by each cluster according to the resource usage of each cluster, and ensures that each cluster has at least one container. When the cluster automatically scales, containers will also be created and recycled according to this ratio. The second is fault migration. The cluster controller in the figure is mainly used to solve the automatic scaling of multiple clusters and container migration in the event of cluster failure. The controller detects multiple nodes of the cluster regularly. If multiple failures occur, the cluster container migration operation will be triggered to ensure reliable service operation.
The third is the interconnection of network and storage. Because the network across computer rooms needs interconnection, we adopt vxlan network scheme to realize the interconnection. Storage is also interconnected through dedicated lines. The mirror warehouse of the container adopts Harbor, and synchronization policies are set among multiple clusters, and each cluster is set with its own domain name resolution to resolve to different mirror warehouses.
Because business personnel still have doubts about container technology, most applications are mixed deployment of virtual machines and containers. It is common for containers to access virtual machines through domain names and virtual machines to access containers through domain names. In order to uniformly manage domain names, we do not use kube-dns(coreDns) provided by kubernetes but use bind to provide domain name resolution. Through the Default DNS policy supported by kubernetes, the domain name of the container is pointed to the company’s DNS server, and the API for domain name management is configured for dynamic addition.
Kubernetes CNI has many network schemes, mainly divided into two-layer, three-layer and overlay schemes. On the one hand, the computer rooms are not allowed to run BGP protocol, and the hosts across the computer rooms need to be interconnected, so we have adopted flannel’s vxlan scheme. In order to achieve interoperability across the computer rooms, flannel of the two clusters is connected to the same etcd cluster, thus ensuring the consistency of network configuration. The old version of Flannel has many problems, including: too many routes, ARP table cache failure, etc. It is suggested to modify to the form of network segment routing, and set ARP rules to be permanent to avoid cluster network paralysis caused by etcd and other failures.
Flannel also needs to pay attention to some configuration optimization. By default, Etcd lease will be applied every day. If the application fails, etcd network segment information will be deleted. In order to avoid network segment changes, the ttl of etcd data node can be set to 0 (it will never expire); Docker defaults to masq for all packets leaving the host, causing flannel to be unable to obtain the IP address of the source container. By setting the Ipmasq to add an exception, exclude packets with flannel segment as the destination address; Because flannel uses vxlan, vxlan offloading, which turns on the network card, has a high performance improvement. Flannel itself has no network isolation. In order to implement kubernetes’ network policy, we adopt canal, which is a plug-in for calico to implement kubernetes’ network policy.
In order to support the Devops process, we tried to use Jenkins to compile the code in the original version, but Jenkins’ support for multi-tenancy was relatively poor. In the second edition, through kubernetes’s Job mechanism, each user’s compilation will start a compiled Job. First, the user code will be downloaded, and the corresponding compilation image will be selected according to the compilation language. After compilation, the execution program will be generated, if jar or war file is available. Through the Dockerfile, it is made into a Docker image and pushed to the image warehouse, and the rolling upgrade process is triggered through the webhook of the image warehouse.
The system designs the logical concept of application. kubernetes has the concept of service, but it lacks the relation of service. A complete application usually includes several services such as front-end, back-end API, middleware, etc. These services have the relation of mutual calling and restriction. By defining the concept of application, it can not only control the starting sequence of services, but also uniformly plan the starting and stopping of a group of services.
The log collection of containers uses the watchdog log system developed by the company. Each host computer deploys a log collection Agent through DaemonSet. The Agent obtains the containers and log paths to be collected through the Docker API, collects logs and sends them to the log center, which is developed based on elasticsearch and provides multidimensional log retrieval and export.
The performance monitoring of the container’s own resource monitoring is implemented through Cadvisor+Prometheus. The monitoring of the business in the container is integrated with the open source APM monitoring system uav (https://github.com/uavorg/uav …) to complete application performance monitoring. The link tracking of uav is based on JavaAgent technology. If the user deployment application checks the use of uav monitoring, the system will implant the uav agent into the mirror and modify the startup parameters when constructing the mirror.
In addition to the above modules, the system also integrates Harbor to complete multi-tenant management and image scanning functions of container image. Log audit is to record the user’s operation in the management interface. webshell provides the user with access to the web console. In order to support security audit, the background will intercept all the user’s operation commands in webshell and record them in storage. Storage management mainly integrates the commercial NAS storage of the company and directly provides data sharing and persistence for containers. The application store mainly provides scenario middleware services for development and testing through kubernetes’s operator.
Docker is not a virtual machine
At the initial stage of container promotion, business developers were not very familiar with containers, and subconsciously thought that containers were virtual machines. In fact, they were not only differences in usage, but also differences in implementation methods and principles. Virtual machines simulated hardware instructions to simulate the hardware environment of the operating system, while containers provided isolation and restriction of resources on the premise of sharing the kernel. The following figure shows the 7 namespace supported by linux in the 4.8 kernel.
In other words, there is no difference in other things, for example, clock, all containers and operating systems share the same clock, if the operating system time is modified, all containers will change in time. In addition, the proc file system in the container is not isolated, and all it sees is host information, which brings troubles to many applications. the initial heap size of the JVM is 1/4 of the total memory. if the container is limited to the memory limit of 2G, and the host is usually 200+G memory, the JVM can easily trigger OOM. the solution is usually to set the JVM according to the memory and CPU limit at startup, or use lxcfs, etc.
Cgroup’s resource restrictions are currently relatively weak on network and disk IO. v1 cgroup only supports direct IO restrictions, but the actual production environment is all cached. At present, we are also testing cgroup v2’s IO restrictions. When the latest CNI already supports network speed limit, combining tc can achieve this effect well.
Kubernetes comes with many scheduling algorithms. Before starting the container, these algorithms will pass the scheduling algorithm. These algorithms all need to filter all nodes and score and sort, greatly increasing the deployment time of the container. By deleting some useless scheduling algorithms, the deployment speed will be improved. The container adopts anti-affinity strategy to reduce the impact of physical machine failure on service.
Although kubernetes opens RBAC, kubernetes token is not recommended to be mounted in the business container, and the security of the system is improved by closing ServiceAccountToken.
Docker mirror storage uses direct-lvm, which has better performance. Separate vg is divided during deployment to avoid the influence of Docker problem on the operating system. The system disk of each container is limited to 10G through devicemapper storage, so as to prevent the business container from running out of disk space of the host computer, and the container needs to limit the maximum number of processes per container when running, thus avoiding fork bomb.
Etcd records kubernetes core data, so etcd high availability and scheduled backups are necessary. After kubernetes cluster has more than 100 nodes, the query speed will decrease, and SSD can effectively improve the speed. This system saves the service and data through the database outside kubernetes.
Pay attention to the validity period of certificates. When deploying kubernetes cluster, many certificates are self-signed. openssl defaults to a one-year validity period without specifying. It is very cautious to update certificates because the entire kubernetes API is built on certificates and all associated services need to be modified.
Docker container plus K8S layout is one of the mainstream practices of container cloud, and Yixin container cluster management platform also adopts this scheme. This article mainly shares some exploration and practice of Yixin in container cloud platform technology. This article mainly includes Nginx self-service management, multi-cluster management, DNS resolution, network scheme, CICD service arrangement, log monitoring, kubernetes optimization and some thoughts on the platform of inner container cloud in Yixin. Of course, we still have many deficiencies. All heroes are welcome to Yixin for in-depth communication and exchange!