Kubernetes Tutorial: Monitoring HTTP Service Health

HTTP API calls are the backbone of modern cloud applications especially Kubernetes based microservices applications. Yet, very little is done to understand the health of HTTP communications. Other than services connected to the load balancer, it has been rather difficult to measure the key performance indicators (KPIs) of latency, throughput and error rates for HTTP calls.HTTP Health Monitoring

The Netsil Application Operations Center (AOC) captures and analyzes service interactions to deliver the complete picture of HTTP service health. The AOC does a deep analysis of application level protocols such as HTTP and gathers all the KPIs along with HTTP attributes (URIs, status codes, etc.). In this tutorial, we will provide a step-by-step guide for using various HTTP datasources, and grouping and filtering the HTTP data based on HTTP attributes.

This blog is meant as a follow-along tutorial, all you need is a Kubernetes cluster and kubectl. You can easily setup the sock-shop app and Netsil AOC.

Topics Covered

  1. Defining HTTP Latency, Throughput and Error Rates
  2. Comparing Latency of HTTP Success and Errors

Setup

We will be using the sock-shop app running on a Kubernetes cluster as our target application for mapping and monitoring. The AOC is installed as a pod and the collectors are installed as DaemonSet pods on each of the Kubernetes worker nodes (see figure below). You can easily get this setup going in your Kubernetes cluster using our installer.

Netsil AOC Setup to Monitor HTTP Services in Kubernetes Cluster

Netsil AOC Setup to Monitor HTTP Services in Kubernetes Cluster

What HTTP Service to Monitor?

Your application probably has a lot of HTTP services. The Netsil maps help you understand the dependencies among services and pick HTTP calls that you should monitor. From the Maps blog, we have the following picture of HTTP interactions in the sock-shop app.

We will pick the HTTP communication between front-end and catalogue for this tutorial (see figure below).

Using Maps to Understand Service Dependencies and Select Services to Monitor

Using Maps to Understand Service Dependencies and Select Services to Monitor

Getting A List of the HTTP Interactions

There might be multiple HTTP calls going on between front-end and catalogue pods. We can understand these calls by using the AOC Analytics Sandbox. All we need to do is select the client and server pod names and groupby http.uri, easy!

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.count as the Datasource
  3. Select count as the Aggregation function to apply
  4. Set http.uri as the GroupBy
  5. Now let’s set the Filters so that we restrict the client and server to specific pods
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
  6. Change the chart type to Bar
Using Netsil Analytics to Get List of HTTP Interactions Between Specific Pods

Using Netsil Analytics to Get List of HTTP Interactions Between Specific Pods

We can see the http URI associated with the communication between front-end and catalogue. As expected, the calls are for URI of the form /catalogue/<catalogue_id>.

Defining HTTP Avg Latency

We will define the HTTP Avg Latency for the calls to URI : /catalogue.*. Additionally we will restrict to measure the GET requests coming from front-end to catalogue.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.latency as the Datasource
  3. Select avg as the Aggregation function to apply
  4. Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]
    • pod_name(server) : sock-shop/catalogue... [set the server using pod_name]
    •  

    • http.uri : /catalogue.* (regex) [HTTP URIs matching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]

And we have the chart measuring latency of front-end to catalogue HTTP interaction! We selected the HTTP latency datasource. Then we applied the client/server filters and restricted the metrics to specific URI, that is /catalogue.*, and specific method GET.

All this was made easy because Netsil gathers the HTTP metrics along with all the key attributes such as URIrequest method, etc. automatically from analyzing service interactions.

Measuring HTTP Latency Using Netsil Analytics

Measuring HTTP Latency Using Netsil Analytics

Defining HTTP Throughput

This is very similar to defining the latency. All we need is to change the datasource from http.request_response.latency to http.request_response.throughput. Below we have repeated the steps and also highlighted in the resulting chart.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.throughput as the Datasource
  3. Select throughput as the Aggregation function to apply
  4. Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
    • http.uri : /catalogue.* (regex)
    • http.request_method : GET
Measuring HTTP Throughput Using Netsil Analytics

Measuring HTTP Throughput Using Netsil Analytics

Defining HTTP Error Rates

For simplicity, let’s focus on the HTTP 5xx and 4xx errors (for e.g., status code 500, 404, etc.). Then the error rate will be defined as:

(Throughput of HTTP 5xx or 4xx requests) / (Total Throughput) * 100

Continuing, from the previous section, we have already defined the overall throughput. Below is the screenshot of that query. Note the query statement name A. So, A represents the total throughput. We will see how to use this name and combine query to generate the error rate. We will create another query statement and use filters to restrict the throughput metrics to HTTP 5xx and 4xx status codes.

Using HTTP Status Codes to Obtain Throughput of HTTP Errors

Using HTTP Status Codes to Obtain Throughput of HTTP Errors

  1. Create another query statement by clicking the + METRIC button. Note this creates new statement named B.
  2. Select http.request_response.throughput as the Datasource
  3. Select throughput as the Aggregation function to apply
  4. Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
    • http.uri : /catalogue.* (regex)
    • http.request_method : GET
    • http.status_code : (4\d\d|5\d\d)(regex) [We filter on status code and select only those requests that are getting 4xx or 5xx errors].
      Query statement B has throughput of the 4xxand 5xx errors. Next we will use the EXPRESSION feature to combine and obtain the error rate i.e B/A*100.
Measuring HTTP Error Rates Using Netsil Analytics

Measuring HTTP Error Rates Using Netsil Analytics

 

  1. Create an expression statement by clicking the +EXPRESSION button
  2. Select Eval as the operator to combine queries using arithmetic
  3. Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($B/$A)*100We have the error rates. We created two query statements and combined them to obtain error rates.

Comparing Latency of HTTP Errors and Success

If an HTTP service is failing, it better fail fast. Otherwise the end users not only end up waiting longer but in the end are frustrated to recieve HTTP errors. A good approach to measure this is the ratio of Avg Latency of HTTP Errors / Avg Latency of HTTP Success.

Let’s learn how to define this metric in Netsil.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.latency as the Datasource
  3. Select avg as the Aggregation function to apply
  4. Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]
    • pod_name(server) : sock-shop/catalogue... [set the server using pod_name]
    • http.uri : /catalogue.* (regex) [HTTP URIsmatching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]
    • http.status.code : (4\d\d|5\d\d) [regex matching 4xx and 5xx errors]

Note the query statement name A. This query statement is returning the average latency of HTTP requests resulting in 4xx and 5xx errors.

Measuring Latency of HTTP Errors

Measuring Latency of HTTP Errors

  1. Create another query statement by clicking the + METRIC button. Note this creates new statement named B.
  2. From the left navigation box, select Analytics Sandbox
  3. Select http.request_response.latency as the Datasource
  4. Select avg as the Aggregation function to apply
  5. Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]
    •  

    • pod_name(server) : sock-shop/catalogue... [set the server using pod_name]
    • http.uri : /catalogue.* (regex) [HTTP URIsmatching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]
    • http.status.code : 200 [This is the latency of success]

Note the query statement name B. This query statement is returning the average latency of HTTP requests resulting in success. Now, we just need to calculate A/B to get the ratio comparing the latency of errors and success

  1. Create an expression statement by clicking the +EXPRESSION button
  2. Select Eval as the operator to combine queries using arithmetic
  3. Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($A/$B)
    Note that the Eval statement name is C. The plot of C reveals that the latency of error requests is a small fraction of successful requests. This is how it should be! As mentioned earlier, this is a good metric to track and set alerts as it greatly impacts end user experience.
Ratio of Latency of HTTP Errors & HTTP Success

Ratio of Latency of HTTP Errors & HTTP Success

Conclusion

Monitoring the health of HTTP API calls is critical to ensure the reliability of modern microservices applications. Latency, error rates and throughput are key health indicators for HTTP calls. There is a need to understand these calls and monitor them along multiple attributes such as client id, server id, status codes, URI patterns, etc.

The Netsil Application Operation Center (AOC) provides deep insights into HTTP API health by doing a real-time analysis of service interactions. By leveraging Netsil, the operations teams can get complete visibility into the health and performance of the HTTP APIs. You can get valuable insights into your API health right away by using Netsil in your Kubernetes cluster.

Get Netsil Updates

Get good content for good karma!

Copyright © 2015 - 2017 Netsil Inc. All Rights Reserved. | Privacy Policy

Share This