Monitoring Cassandra Health and Performance Metrics
What is Cassandra?
Apache Cassandra is an open-source, distributed NoSQL database system whose design and data model is inspired by Amazon’s Dynamo and Google’s Bigtable respectively. Cassandra gained popularity because of it’s scalability and high-availability with no single point of failure. There is no concept of a master node, with all nodes communicating with each other for consensus and data partitioning. Cassandra also allows workloads to run across multiple datacenters with support for low latency replication making it a great platform for mission-critical data.
Golden Signals of Cassandra Health and Performance
For engineering teams using Cassandra to deliver workloads at scale, it is critical to monitor the cluster health in real-time to avoid performance issues. Cassandra is a Java-based system that can be managed and monitored via JMX. Some of the key metrics, known as Golden Signals of application health, that are important to monitor include:
- Throughput of read and write request queries
- Latency of slowest queries
- Error rates
Netsil’s Approach for Cassandra Monitoring
- Interaction Analytics: The Netsil Application Operations Center (AOC) uses a network-centric approach to monitor query level performance for databases like Cassandra. Without instrumenting either the server or the client side, just by looking at the wire protocol through TCP packet capture, the AOC provides information about how each and every query is doing. This approach has low overheads and gives real-time visibility into latency, throughput, error code, distribution of requests, response sizes etc. for every query. You do not need to install the Netsil traffic collectors on the database server and you can look at the interactions from the client side.
- Polling: The AOC also uses a polling technique to give a complete picture of the database performance. Basic polling allows us to look at the saturation metrics of database (e.g., thread counts, IOPS issues, connections/sec, etc.).
The protocol datasources related to request/response are available out-of-the-box in the AOC. Please look at the pre-canned dashboards for Cassandra or use the Analytics Sandbox to plot charts without any additional configuration. The infrastructure datasources are available if you configure the Cassandra integration. More information on configuring the infrastructure integration can be found in the documentation.
Monitoring the throughput of queries between a client and server provides a real-time snapshot of how the database is performing. Using the AOC’s drill-down feature, you can easily separate throughput numbers based on dimensions such as query string, query type, server error code, server instance, server port, etc.
For example, as shown in the image below, you can sort the throughput of the top-K queries by the client name. All of this is without any code instrumentation. This really helps DevOps efforts when you want to know what client or server are making the most requests and receiving the most responses.
By monitoring the throughput numbers you can track your cluster’s overall health and watch for spikes or dips that might need further investigation.
Monitoring the latency of a read or write query is critical no matter what your use case is. By focusing on the latency numbers you can identify potential problems or shifts in usage patterns and adjust your cluster size accordingly. In the AOC you get latency information about the top-K most requested queries as well as the slowest queries.
DevOps teams can use the real-time latency information to figure out where traffic bottlenecks might be building up and, for example, figure out which host might be contributing to the latency the most.
The AOC tracks the server error codes and error strings, and monitors the error rates. Alerting on high number of errors is very important for DevOps teams. If your Cassandra cluster is unable to handle incoming requests adequately then it is something definitely worth paging your team.
Netsil’s interaction analytics combined with polling gives complete visibility into the performance of the Cassandra database, without any instrumentation on client application or database side. By interpreting the network interactions, Netsil is able to track the performance of queries with no overhead to database servers.
If you are using Cassandra for your cloud application, we encourage you to get started free with the AOC today and gain completely visibility into the health of all your service interactions.