Enterprises continue to introduce new technologies in the process of cloud access. According to the survey report of China Academy of Information and Communications Technology, 43.9% of the surveyed enterprises in 2019 have used container technology to deploy business applications, and 40.8% of the enterprises plan to use container technology to deploy business applications. 28.9% of enterprises already use microservices architecture for application development, and another 46.8% plan to use it. Following SDN, the cloud native technologies represented by container, micro service and DevOps greatly improve the agility, elasticity and portability of cloud on enterprises, which inevitably leads to the black box problem of virtual network when its infrastructure is continuously clouded. The inherent fluctuation characteristics of container network have become a challenge for monitoring and diagnosis. CNCF report points out that the container network and security have become the most important challenges for the construction of container cloud platform. When enterprises migrate important core applications to container platform, they must obtain the traffic data of the whole network and draw the network knowledge graph on this basis to realize the visualization of the whole network state.
In order to solve the isolation and cross-node container communication, Overlay Network has become a solution for many enterprises to construct container networks. The new container network is divided into two categories: tunnel scheme and routing scheme. Currently, common Container network solutions are usually based on CNI implementation, including Flannel, Calico, Weave Net, Contiv, NSX-T Container Plugin (NCP), OpenShift-SDN, etc. Among them, Flannel-VXLAN, Calico-IPIP, Weave Net, Contiv-VXLAN, NCP and OpenShift-SDN are implemented based on Overlay tunnel. While Flannel-HostGW, Calico-BGP and Contiv-BGP are all based on routing. In addition, there are network solutions that rely entirely on Underlay implementations, such as SR-IOV, MACVLAN, IPVLAN, etc. The tunnel solution does not have high requirements on the underlying network. However, as the number of nodes increases, the complexity of network troubleshooting becomes more difficult.
After enterprises go to the cloud, their business application architecture gradually moves towards micro-service/containerization, and the combination of business and network becomes closer and closer. Container cloud platforms are closer to the business than any previous infrastructure platform, but also contain more layers and components, and therefore bring more risks. Within container cloud platforms, the default network model lacks the necessary security in terms of east-west access isolation. In microservice architecture, network monitoring between services is an important part of business security. In container networks, the network traffic between Pods urgently needs to be monitored by tools. To obtain the complete network traffic, especially the traffic of virtual network and container network, is an important prerequisite to solve the problem of virtual network black box and ensure the continuity and security of business cloud. Enterprises need to build a unified cloud monitoring platform that has the capability of collecting and visualizing the entire network traffic, so that services can be visible, manageable, controllable, and quickly remove obstacles behind the cloud.
For the introduction of the construction of the whole network traffic monitoring and visualization scheme, please refer to the discussion of the Spruce Network hybrid cloud network monitoring and diagnosis scheme. Traffic monitoring and visualization of container networks will be expanded in the following sections. The following principles should be paid attention to in the construction of container network traffic monitoring and visualization practice:
a) Functional integration: The virtual network that covers container resource pools must be compatible with heterogeneous resource pools, such as KVM, VMware, public cloud, and bare metal, and provide multi-tenant services to avoid repeated construction and complex management. In addition, the scheme must have the ability to comb and sketch the container business, laying the foundation for traffic visualization.
b) Cloud native architecture: The monitoring platform itself must be a cloud native architecture to fully consider and meet the elastic requirements of the cloud on the enterprise, and ensure unified monitoring of hybrid cloud environments such as mainstream private cloud and public cloud while adapting to different container environments. When an enterprise flexibly deploys container services across resource pools, the solution must have the automatic follow along capability.
c) Less intrusive deployment: The deployment of the entire solution should avoid the impact on the existing production environment as much as possible. Different container environments should adopt matching technical solutions to ensure smooth deployment and uninterrupted service during container traffic collection and deployment, and ensure safe and controllable consumption of computing resources.
d) Data standardization: The enterprise IT environment is often complex and heterogeneous, and the tools and data in its monitoring system are also diverse. Traffic data in a container network is inevitably consumed by multiple types of terminals or platforms, so monitoring data must adhere to open standards to ensure that existing analytics tools in the enterprise can be used seamlessly.
Virtualization is inferior to containers in terms of resource utilization, flexibility, and elasticity. Containers have inherent advantages in micro-services, DevOps, distributed, etc., so they become the choice of next-generation cloud infrastructure for data centers. Kubernetes, with its excellent architecture, flexible expansion ability and rich application orchestration model, has become the de facto standard in the field of container orchestration, as well as the construction of container cloud platform by enterprises.
Common failures in container environments generally fall into three categories. Application faults usually show that the execution status of the application is inconsistent with the expected status. Container failure usually manifests as an inability to properly create, stop, or update a container. Cluster faults are usually inconsistent or disconnected. Enterprises focused more on Prometheus for system monitoring and alarm in container environment deployment and management solutions, and combined Grafana, Zabbix and other open source tools to solve the problem of container network monitoring and assurance, but the metrics and presentation dimensions available were relatively limited, especially as the size of the container resource pool continued to expand. The scalability and deployment issues of the above tools will make it difficult to meet the needs of in-depth analysis. Take the container Host mode as an example. Usually, each node runs 100 to 200 pods. It is not easy to obtain the network traffic of each Pod and realize the second-granularity query analysis by combining with the traffic data of the whole network.
Spruce Network for many years focused on cloud data center network monitoring, management, control solutions and SDN software products research and development. DeepFlow®, the flagship product, provides customers with hybrid cloud network traffic collection and distribution solutions and hybrid cloud network performance monitoring and diagnosis solutions based on efficient hybrid cloud traffic collection and time-series data storage and retrieval technologies. In a container network environment, there are technical difficulties in horizontal application expansion, network accessibility, configuration management, service dependency, cluster consistency, etc. DeepFlow® platform actively learns relevant information in the container environment by connecting with container platforms such as Kubernetes. They include a Cluster, Node, Pod, Service, and Ingress. Collector Based on the container environment, there are container OnVM collector and container OnHost collector specifications, which meet traffic collection and filtering requirements in different container resource pools. In the Service Portrait function, you can create a service, add it to related resource groups, classify IP addresses, functions, services, and links, and describe the network access paths of service applications. The collector filters the traffic in the resource pool based on service profile rules to implement end-to-end network monitoring and diagnosis for service applications.
Key business applications in containers need to be included in the monitoring system view for continuous attention. In the network diagram, container services are queried and displayed from multiple dimensions, such as area, node, Pod, and IP. In the entire service path, check the network status in segments to quickly narrow the problem scope and locate the fault cause. Locate network flows and data packets retrospectively for analysis and forensics.
The DeepFlow® platform innovates container network monitoring from the following perspectives:
First, a unified collection abstraction layer is constructed for heterogeneous resource pools. DeepFlow® supports environments such as mainstream container vendors' products, Kubernetes, KVM, ESXi, public cloud Workload, and dedicated servers to meet the goal of building an enterprise integrated monitoring platform.
Second, DeepFlow® is designed with an open architecture that is scalable and compatible. A set of platforms can solve the network performance monitoring, infrastructure monitoring, application performance monitoring requirements on the cloud. It is conducive to the realization of data standardization of enterprise monitoring system.
Third, DeepFlow® independently designed a visual resource knowledge map, which can dynamically correlate monitoring data attributes from more than a dozen dimensions to provide a panoramic view of the operating state of the container network. These dimensions include container node, namespace, service, Deployment, ReplicaSet, POD, zone, availability zone, VPC, subnet, and IP.
As the earliest SDN innovative enterprise in China, Spruce Network takes the lead in realizing the monitoring solution of container network in the industry, and has reached cooperation with container solution suppliers and business application analysis manufacturers, and has launched the enterprise cloud joint solution to meet the needs of the market and customers. Fourth, DeepFlow® is based on pure software deployment in cloud native mode. The monitoring capability truly follows the cloud and minimizes the intrusion into the production environment. The automatic follow mechanism ensures complete monitoring of the elastic deployment of container resources.