A Comparison of Mapping Approaches for Distributed Cloud Applications(AppDynamics vs OpenTracing vs Netsil)
One of the most fundamental challenges of monitoring modern cloud applications is the ability to see all the components and their dependencies. The lack of visibility is a critical problem that is worsening as the instance lifespan reduces, components span private and public clouds, and external service dependency increases. As the pace of software development and complexity of application continues to increase, the visibility challenge for operations teams can be summarized as “driving at 100mph with blindfolds!”.
An emerging solution to help with visibility is application maps. In this post, we will describe application maps, their use cases and cover some popular techniques used to generate application maps.
What is an Application Map?
An application map is a topology map comprising of nodes and edges where:
- The nodes represent groups of processes or compute instances
- The edges represent the network or IPC interactions between nodes (i.e between groups of compute instances).
There are multiple characteristics that need to be highlighted for application maps:
- Grouping of instances is a crucial aspect because otherwise a map at individual server, VM or container level can become overwhelming. Grouping of compute instances serves similar purpose as “resolution” on a Google map. The below map examples from Netflix are good illustrations of the notion of groups and resolution. Figure 1a, is grouping of multiple instances and its “resolution” is low. It is like you zoomed-out on Google map and are seeing the map of a region at state or country level. Figure 1b, is the zoomed-in view which shows specific services.
- Application-level details should be present in the application maps rather than merely presenting infrastructure map of hosts, VMs or containers. That is, the user should be able to visualize the services such as databases, DNS, Service Discovery, REST/HTTP endpoints, etc. on the application map.
- Golden signals such as latency, error rates, throughput and saturation metrics should be captured and displayed for nodes and edges. These metrics enable operations teams to quickly understand the health and performance of application components.
Application Mapping Benefits
- Visibility: Naturally, the biggest value of maps is the ability to see all the components and their dependencies.
- Incident Response: Application maps can greatly expedite incident response. The dependency chain on a map shows all services participating to fulfill transactions. In the absence of maps, incident response is greatly hampered in trying to first identify all the services involved in a transaction and then often manually correlating metrics to find root cause.
- Monitoring and Alerting: Using application maps, operations teams can easily identify the services that are on critical path such as those serving end user requests. Operations teams can then define the Service-level Objectives (SLOs) for the critical services and set alerts/paging for them. This can greatly reduce the well known problem of alert fatigue.
- Capacity Planning and Forecasting: With the knowledge of critical services, operations teams can ensure appropriate capacity allocation for them. Application maps also highlight potential single points of failures and congestion hotspots.
- Auto-documentation: Application maps provide automated documentation of all components, their dependencies and capture changes over time. Without maps, the information is scattered in configuration manifests, CI/CD systems, and very often inside operators’ heads!
- Security: Maps are beneficial for identifying security vulnerabilities in an application. For example, a map can be used to identify if two services that are not supposed to talk to each other are doing so.
Application Mapping Techniques
Application mapping approaches can be categorized into static and dynamic approaches. The static maps are essentially modeling techniques such as CloudCraft, Yipee.io, Spotify’s System-Z. Figures 2 shows an example of static application map generated using CloudCraft.
In this post we will focus on the dynamic application mapping approach which fall under two categories (1) end-to-end tracing (2) ingress and egress (individual) tracing.
End-to-end Tracing Techniques:
- APMs: Application performance management (APM) techniques require code embedded agents on all processes that tracks code execution path. For some languages, agents can get end to end trace by dynamically injecting a trace ID (i.e. custom HTTP headers, Thrift fields, gRPC) to piece together requests and responses across services. AppDynamics, New Relic, and Dynatrace are the popular products in this category that leverage code profiling and transaction tracing to generate maps. Figures 4, shows example of application map generated from AppDynamics. APM techniques are hard to keep up with newer technologies and require an N*M support matrix. For example, for APM techniques to support MySQL tracing, they need to track in all languages. As new programming languages are released, APM techniques need to go and support all the combinations. For example, to release Node.JS APM, vendors need to support all HTTP frameworks, MySQL clients, Postgresql clients and so on. This provides an example of an N*M support matrix for AppDynamics.
- Tracing SDKs and Proxies: These techniques allow developers to embed tracing SDKs in the application code and use them to track entry points and exit calls. These SDKs don’t look at code execution but instead just inject headers in requests to correlate. Some techniques apply sampling to help scale in production. SDKs emit spans, which contain the unique trace ID and other metadata/tags. Some popular products in this category include OpenTracing, Datadog APM, AWS X-Ray, Finagle, linkerd and Envoy. Figure 5 is an example of an application map generated from AWS X-Ray.
Pros and Cons of End-to-end Tracing Techniques:
- Help in root cause analysis: Few SDKs help with needle in the haystack (root cause) analysis, e.g. provide rules to record very specific transactions (e.g. for a user). Practically, sampling is enabled in production and heavy recording rules are often avoided unless root cause analysis is being performed.
- Trace exact path of requests: With tracing techniques, we can track requests as they pass through multiple services, and get the timing and other metadata throughout. This information can then be reassembled to provide a complete picture of the application’s behavior at runtime. Tracing exact path also helps understand request concurrency better and to re-architect the services to make parallel or asynchronous requests, if needed.
- Overheads: Tracing techniques need to store individual traces, which can be challenging in production unless sampling is applied.
- SDKs or agents needs to be embedded everywhere in the stack in order to get coverage. This can be tricky when calls are made to legacy services or OSS components. Also tricky when different languages are used, e.g. mix and match Node.JS, Java, Python, Go across services.
- Some techniques use tracing proxies (e.g. linkerd) to inject headers, but application still needs to be aware and has to pass on the context (i.e. headers) when making further calls to other services for the entire glue to work. For more details refer this post.
- Individual traces don’t add much value as no one has time to go through millions of recordings. All tools ultimately aggregate traces to build a more meaningful cloud application map. In the following section, we describe how aggregation of traces results in exactly the same map as generated by individual tracing techniques.
- End to end trace is often misleading as it does not capture the load on services (i.e. what other traffic was present) when a trace was recorded. The slow performance on services is often due to traffic load. Hence, aggregating traces is the only way to see something of value.
Ingress and Egress (individual) Tracing:
- Logs: Some practitioners have built maps using logs gathered from application stacks or proxies. Some technologies such as Apache web server and Nginx proxies can provide detailed logs for each request. Splunk and Elasticsearch have general purpose graph interface to plot all kinds of relationships. However, this technique is very impractical and requires emitting standardized logs on each service request on each service and OSS. Logs also have huge storage overhead.
- OS Tracing: Operating systems provide various tracers that allow tracing not just the syscalls or packets, but also any kernel or application software. For example, tcpdump is a network tracer in Linux. Other popular tracers are eBPF, DTrace and Sysdig. Figure 6, shows an example of application map generated from Netsil’s Application Operations Center (AOC) using packet capture and service interaction analysis.
Pros and Cons of Ingress and Egress (individual) Tracing Techniques:
- Ingress and Egress techniques provide universal coverage as protocols don’t evolve as often as programming languages and frameworks.
- Ingress and Egress techniques yield exactly the same map as tracing techniques do after aggregating a large number of end to end traces without having to inject and carry forward trace IDs.
- Ingress and Egress techniques can map anything that talks over the network, even those technologies where trace ID injection is impossible – e.g. DNS calls, MySQL, Postgresql, Cassandra, Memcached, Redis etc.
- Raw data collection is lighter weight than code embedded APMs (done inside OS kernel). Though there are overheads when the collected data is processed locally (often in user space).
- New technologies are relatively easy to support (no need for N*M support matrix), making this approach more pervasive and future proof.
- More accurate and representative of real behavior in production. Packets are often said to be the ultimate source of truth.
- Ingress and Egress techniques need to deal with reconstruction of application context from network data. Some protocols such as MySQL have complex session state machines.
- Ingress and Egress techniques don’t work when encryption is employed within the cloud. But this not a problem when SSL termination happens at the load balancer or when using IPSec.
- Some techniques can have high storage overheads when the reconstructed data is stored in form of events rather than rolled up time series – e.g. PacketBeat.
- Ingress and Egress techniques can’t tie together related requests and exact fan-out behavior of entry points. Though with modern microservices patterns this is less of a problem as fewer API endpoints exist on services compared to monolithic applications.
- Hard to track specific business transactions end to end without automatic trace ID correlation. Though possible in some solutions by triggering recording by doing deeper payload analysis using regexes or certain behavior such as a 500 server error.
End-to-end trace map(s) are hard to gather (across languages, teams, large codebases) in real-world apps and the only valuable information they uniquely provide is the exact fan-out pattern of very specific calls. The maps that provide actionable insights and are useful for DevOps workflows are the ones that aggregate (individual and end to end) traces to build a holistic view. The least friction way to collect individual traces is either via logs or OS tracing tools. From the use case perspective, if there are limited number of services that you are able to instrument using APM or tracing SDKs, then you can use end-to-end tracing. This is seen in practice with APM tools being deployed on few user facing services. Often, though, it is impractical to instrument lot of services with either APM or tracing SDKs. For e.g., hard to instrument external SaaS services (AWS RDS, DynamoDB), components such as proxies, DNS, Load Balancers, and Databases. For ubiquitous coverage and complete visibility, you can leverage OS based tracing techniques which will capture individual communications and dependencies. Netsil’s approach falls into the category of individual tracing based on packet capture analysis. You can learn more about Netsil’s map and monitor approach here.