Friday, November 20, 2015

Monitor The Hell out of Cassandra

One of the most important lessons that I’ve learned from dealing with Apache Cassandra in the past time is that if you don’t have enough visibility of your cluster, you won’t be able to know what’s the real reason for things happening in the cluster,
You’ll probably end up looking something like this when a problem comes around the corner:

At first all was well, 
but around a couple of weeks after using Cassandra with the DataStax Community AMI on AWS, with the data growth and with more Apache Spark applications running against our new Cassandra cluster, we’ve encountered some performance issues and once I started investigating the problems, I saw that I don’t have enough visibility of the cluster and I’m not saying that the we were completely “blind”, we had the DataStax OpsCenter that was installed with the cluster,
but it wasn’t quite enough.

Our Cassandra server and service topology is described in the next diagram:

We have our C* (=Cassandra in short) ring running with the “cassandra” as a linux service on each node and the “datastax-agent” service is communicating with it locally on the same machine.
Metrics are sent to the “opscenterd” process running on one of the C* machines all shown on the OpsCenter WebUI.

How to Install OpsCenter in case you don’t have it:
Just in case you don’t have “datastax-agent” and “opscenterd” installed, It’s really easy to get it once you follow the next few steps (It’s specified for debian derived OSs).
  1. Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list
    1. “echo "deb stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list”
  2. Add the DataStax repository key to your aptitude trusted keys.
    1. “curl -L | sudo apt-key add -”
  3. Run:sudo apt-get update
  4. Run: “sudo apt-get install datastax-agent”
  5. This command should be installed only on the node running OpsCenter:
    1. “sudo apt-get install opscenter”
  6. Checking which versions are installed:
    1. “sudo apt-cache policy datastax-agent”
    2. “sudo apt-cache policy opscenter”
    3. “sudo apt-cache policy cassandra”
  7. Create the file named: /var/lib/datastax-agent/conf/address.yaml
    1. Add the address of the OpsCenter server (In our case it’s the private ip of the EC2 instance).

  1. Useful extra tools that might be handy:
    1. “sudo apt-get install cassandra-tools”
OK, if you were lagging back, you’re at the same position we were at,
You have OpsCenter up and running!

The important question right now is what we actually get from that?
Which are the essential metrics for me as a Developer / DevOps to see?
I’ll divide it to 4 groups of metrics:
  1. Cassandra - All of the functioning metrics of the Cassandra process, Key space specific information, Latency, etc.
  1. JVM - JVM thread pools and GC times of the running Cassandra process
  1. Operation System - CPU load, Disk utilization, Storage size, etc.
  1. Application Level Metrics - Which application queries are triggered and executed. - Different applications that are trying to access the cluster and adding to the load of the cluster.
Actually DataStax did a great job, 
Out of the box the are giving us 3 out of 4!
That gives us very powerful tools without even breaking a sweat, but it’s really not enough,
because the real deal unveils only when you have (1+2+3) and 4, the combination of Cassandra Metrics + Application level metrics.
Only then you’ll be able to see where your HotSpots are and what is the correlation between Latency peaks of Cassandra and the Application side workload (Can be Apache Spark Job running or many NodeJs servers accessing the database).

Now for the next step, How do we do that???

The solution has two parts:
  1. I believe in redundancy, I still wanted to leave the OpsCenter as one of the metrics visualization layers, it gives you a lot of power by giving you 3 out of 4 right out of the box.
    Small disclaimer, keep in mind that leaving OpsCenter might put more load on your network and you all know how sensitive we should be with our networks.
  2. Plugable Metrics (since C* 2.0.x) -> Reporting to Graphite + Grafana dashboards:
You can see the elaboration in the next diagram.
Since C* 2.x we have an option to add a plugin Jar to the C* node and configure reporting to a Graphite sink, we chose to visualize all of the graphs on Grafana, which is the monitoring visualization for our application metrics as well.

Many people have asked me how do you configure C* to report metrics to Graphite,
so here’s the “How”: (Originally taken from the DataStax Developer Blog post, but I elaborated about it a bit more)

Plug & Play the metrics to Graphite
  1. On each node, place the metrics-graphite-2.2.0.jar in the “/usr/share/cassandra/lib” directory. (You can be lazy and copy/paste the next command, and it will be downloaded from the Maven central repository)
    1. sudo wget -O /usr/share/cassandra/lib/metrics-graphite-2.2.0.jar”

  1. You’ll need to place the metrics.yaml file that configures the sink that you’ll be reporting to, this is an example gist that you can take as a template:

Note: 3 important things you should notice in the configuration file:
  • prefix: 'cassandra-cluster-prefix.#'
    I’ve set the prefix to being “cassandra-cluster-prefix.” as the prefix and “#” would be different on each node, to be able to aggregate all of the result via Graphite / Grafana later.
  • patterns: - These are all of the metrics that I would like the node to send to the sink, you can add multiple entries and with the wanted granularity level.
  • period: Remember if you report lots of metrics in a really frequent manner, you monitoring will become “Big Data” :).
  1. Adding the relevant entry to “”:
    1. sudo vim /etc/cassandra/
    2. Add this to the end of the file: “JVM_OPTS="$JVM_OPTS -Dcassandra.metricsReporterConfigFile=/etc/cassandra/metrics.yaml"
  2. Now you’ll need to restart the nodes that you’ve made the changes to.
    1. “nodetool drain” - You’d want to flush the commitlog and stop accepting connections first.
    2. “sudo service cassandra restart”

Now after all of the adjustments you already have you’re own monitoring solution,
combined DataStax OpsCenter + Graphite / Grafana dashboards, seeing both Server and application metrics (Solution with C* + JVM + Application metrics - 1+ 2 + 4).

The only one thing that you would be missing in the Grafana dashboard is the OS metrics of the C* server, and that will be available via StatsD or Collectd agents that would be installed on the nodes, you’ll need to wire them up in order to report metrics (you’ll probably want the same prefixes as the C* metrics).
Then you’ll have a combined solution of: “C* + JVM + OS + Application” metrics.

And now for the next step,
we actually are thinking that because of performance issues we are getting with the Graphite backbone of our monitoring solution, we need to change the container, and we will probably change it to InfluxDB that can be the backbone datastore of the Grafana dashboards.
It will all be pretty much the same as described in the next diagram:

To sum things up,
All we did in this post is give us better visibility of the cluster, I haven’t even started to describe “what to do when there are problems?”.
In the next blog posts I’ll write more about useful tools and methods to try find the problems in your clusters and I’ll talk more about possible “gotchas” and thing that might be the root cause of some nasty performance issues in C* and some JVM tuning options as well.

This is the first battle in our fight with C* performance tuning - “Get ready for the war”.

Thanks to Call of duty for the nice picture :)

Hope this helps someone else with his problems.