Container Monitoring Simplified with the Netsil AOC
Containers are increasingly becoming a critical component of modern cloud applications. Containers deliver speed and portability for software development and align closely with the prominent industry trends of DevOps and microservices. While containers are great for software developers, they create new challenges for operations teams running containerized applications in production. Particularly the traditional techniques for application monitoring such as logging, using code-instrumented agents or tracking infrastructure metrics have hit a wall due to the fundamentally new abstraction layer of containers and their short life spans.
Netsil is pioneering an innovative approach that leverages service-interactions as the source of truth for monitoring containerized applications. This approach greatly simplifies container monitoring. The Netsil Application Operations Center (AOC) enables operations teams to gain complete visibility into containers and frameworks such as Kubernetes and Mesosphere DC/OS, without any code change to containers or applications. Let’s do a quick overview of Netsil Application Operations Center (AOC) and then take a closer look at some of the most challenging aspects of container monitoring along with Netsil’s approach for addressing them.
Side note: Netsil has recently announced single container manifests of the AOC. In less than 10mins, you can start mapping and monitoring your container clusters with the AOC. Manifests are available for Docker, Kubernetes and Mesosphere DC/OS environments.
Netsil AOC Overview
The Netsil Application Operations Center (AOC) is the converged application monitoring platform for operations teams. SREs and DevOps engineers use the AOC to monitor and deliver on Service Level Objectives (SLOs) for business critical applications. The AOC delivers the industry’s first auto-discovered, real-time topology map of the entire application. The application topology map is combined with per-second, real-time metrics from the entire stack. The operations teams leverage AOC to:
- Create multiple maps for various components of applications and understand dependencies among services. The AOC doesn’t require any code change to application or container images. It listens to service interactions and auto-discovers the application map. For e.g., starting with the initial auto-discovered topology map, operations teams can zoom-in on specific frameworks such as Kubernetes and create a map using kubernetes tags such as pod names (see fig 1 below).
- Alert and diagnose production issues using AOC’s real-time metrics and analytics. For the metrics, the AOC leverages not only the data from service interactions but also obtains metrics from system specific sources such as cAdvisor for Docker and Kubernetes metrics. For e.g., starting with service-level health, operations teams can progressively drill down and investigate container, host and infrastructure-level metrics to identify the root cause.
Fig 1: Auto-discovered Kubernetes Application Topology Grouped by Pods
Netsil AOC Addresses Container Monitoring Challenges
Before we dive into container monitoring challenges, it is worth noting that containers and microservices are closely associated with each other. Containers are a crucial piece for packaging, delivery and run-time execution of services. So, a lot of challenges of microservices are relevant to containers and vice-versa. Following is a condensed list of some of the most pressing challenges with container monitoring. As an additional resource we would strongly recommend this talk by Adrian Cockroft from Gluecon as well as ‘Mastering the Chaos’ talk by Josh Evans from Netflix.
Containers are expected to have much shorter lifespan than VMs. Reduced lifespan could be due to increased frequency of production roll-outs causing previous versions of containers to be replaced by new ones. Or it can be auto-scaling that dynamically creates and deletes containers as service load changes. Irrespective of the cause, the reduced lifespan and fleeting nature of containers makes it hard to track them and investigate them for root cause analysis.
Fig 2: Kubernetes Metadata for Pods in Netsil AOC
Portability is one of the most attractive features of containers. But as the containers move, their run-time environments may change. For example, a perfectly happy set of containers could start experiencing issues if bunch of “noisy neighbor” containers are spun-up on their hosts. Or the same happy set of containers could see performance degradation if they are moved to a different server with storage or network issues. The portability makes it very hard to correlate service level issues to container, host and infrastructure issues.
Fig 3: Netsil AOC Automatically Ties Service, Container and Infrastructure Level Metrics
Containers are closely tied to the architectural paradigm of microservices. Microservices applications often have fault-tolerance and redundancy built in the architecture. This means that multiple redundant containers, running identical copies of code, could be powering a service within an application. In such an architecture, the failure of one container does not need to generate a paging-alert waking up SREs in the middle of the night. Most legacy monitoring tools are not equipped to handle such situations and they end up flooding operations teams with low-level, low-impact alerts.
In modern applications, multiple components interact among each other to fulfill transactions. It is well understood that failure of one component in the dependency chain will have cascading effect on the entire application. Another type of cascading failure happens when few containers that are part of a load balanced service start experiencing issues. When few containers fail to handle their share of load, then load starts to increase on other functional containers. If the situation is left unchecked, then, eventually, all the containers that are part of the service start performing poorly resulting in a spike in latency, throughput and/or error rates for the service. This can be viewed as an initial lateral propagation of failure within the service component, before the failure starts impacting other services in the dependency chain.
Fig 4: Golden Signals for Kubernetes Service
In order to manage collections of containers specialized frameworks such as Kubernetes, Mesosphere DC/OS and Docker swarm are used. These frameworks have their own abstractions such as pods, services, namespaces, etc. which are valuable for operations teams to understand and monitor containerized applications.
Fig 5: Auto-discovered Maps of Mesosphere DC/OS and Kubernetes Applications
Containers are great productivity booster for software industry. Using Netsil Application Operations Center (AOC), operations teams can successfully embrace containers in production environments. Starting with the real-time topology map, AOC provides complete visibility into health and performance of services, containers and infrastructure.
You can get started with AOC in less than 10 minutes using our single container manifests that work with Mesosphere DC/OS, Kubernetes and Docker. Get started here.