Monitoring Cassandra with DataDog

Jun - 10 2016 | By

There are a few things to set up correctly to get this working,

I’ve spent a long time playing around with monitoring Cassandra and the datadog option seems to be the best approach.

The datastax alternative OpsCenter (http://www.datastax.com/products/datastax-enterprise-visual-admin) is very limited by the number of key-spaces you create and can easily halt every node on the cluster – if this numbers gets too high. I believe they only recommend 20 – 50 key-spaces?

 

(There are some screen captures of what you can achieve from datadog at the end)

 


 

 

We are going to assume the following.

  • Cassandra is installed on a CentOS machine
  • Your aware how to start / stop Cassandra
  • You know the folder structure  for log, config files, data files, etc….
  • You know how to install datadogs client onto the Centos Machine
  • (Guide for this can be found on your datadog account page)

 


 

 

Cassandra by default opens up the JMX port to local reads only – but can be open to be read externally if required.

Below are the steps required to do so if you require

 

 


 

 

 

Once Cassandra is setup and running we can now go ahead and install datadog

Datadog by default limited the number of metrics published to 350.

We can easily remove this by following the below steps

 


Next we need to add in our Cassandra monitoring file “cassandra.yaml”

This will give us some key metric data to monitor – we could monitor moe – but this give us a great high level over view. Note that we exclude our system related key-spaces, this is so they don’t reflect incorrect data on our existing key-spaces

instances:
    - host: 127.0.0.1
        port: 7199

init_config:
conf:
    - include:
        domain: org.apache.cassandra.db
        attribute:
            - LiveSSTableCount
            - ReadCount
            - WriteCount
            - CompletedTasks
            - PendingTasks
    exclude:
        keyspace:
            - system
            - system_traces

 


 

Next we need to add in our “jmx.jaml” this will allow us to monitor our cassandra memory management and see how it handling it.

Great data if you think we over using our resources.

init_config:

instances:
    - host: 127.0.0.1
        port: 7199
conf:
    - include:
        domain: java.lang
            bean:
                - 'java.lang:type=MemoryPool,name=CMS Old Gen'
            attribute:
                CollectionUsage:
                metric_type: gauge
                alias: jmx.old_gen.collection_usage

    - include:
        domain: java.lang
            bean:
                - 'java.lang:type=MemoryPool,name=Par Eden Space'
            attribute:
                CollectionUsage:
                metric_type: gauge
                alias: jmx.eden.collection_usage

    - include:
        domain: java.lang
            bean:
                - 'java.lang:type=MemoryPool,name=Par Survivor Space'
            attribute:
                CollectionUsage:
                metric_type: gauge
                alias: jmx.survivor.collection_usage

 


 

Give datadog a restart with

sudo /etc/init.d/datadog-agent restart

then check what its processing with

sudo /etc/init.d/datadog-agent info

and you should see something like this

Checks
  ======
  
    ntp
    ---
      - Collected 0 metrics, 0 events & 1 service check
  
    disk
    ----
      - instance #0 [OK]
      - Collected 40 metrics, 0 events & 1 service check
  
    network
    -------
      - instance #0 [OK]
      - Collected 15 metrics, 0 events & 1 service check
  
    jmx
    ---
      - instance #jmx-127.0.0.1-7199 [OK] collected 25 metrics
      - Collected 25 metrics, 0 events & 0 service checks
  
    cassandra
    ---------
      - instance #cassandra-127.0.0.1-7199 [OK] collected 2137 metrics
      - Collected 2137 metrics, 0 events & 0 service checks

 


 

 

So we know know we are broadcasting data to DataDog.

Lets create some metric to read it.

 

In the JSON code,
$Cassandra_jmx1  = instance:jmx:127.0.0.1-7199
$Cassandra1  = instance:cassandra:127.0.0.1-7199

 

The below metrics are create in a Datadog Dashboard.

I wont explain the data – this should be self explanatory (for those that under Cassandra and want to monitor these metrics)

 

Heap Data Metrics

 

Screen Shot 2016-06-14 at 17.29.39

 

Here is the JSON metric needed to create this metric using the fields we broadcast from the jmx.yaml file

{
  "viz": "timeseries",
  "requests": [
    {
      "q": "sum:jvm.heap_memory_max{$Cassandra_jmx1} by {instance}",
      "aggregator": "avg",
      "conditional_formats": [],
      "type": "area"
    },
    {
      "q": "sum:jvm.heap_memory{$Cassandra_jmx1} by {instance}",
      "type": "area"
    },
    {
      "q": "sum:jmx.old_gen.usage_used{$Cassandra_jmx1} by {instance}",
      "style": {
        "palette": "warm"
      },
      "type": "area"
    },
    {
      "q": "sum:jmx.eden.usage_used{$Cassandra_jmx1} by {instance}",
      "style": {
        "palette": "cool"
      },
      "type": "area"
    }
    ]
}

 

 

 

Cassandra Read Vs Writes

Screen Shot 2016-06-14 at 17.30.10

 

{
  "viz": "timeseries",
  "requests": [
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra1})",
      "aggregator": "avg",
      "conditional_formats": [],
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra2s})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra3})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra4})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra6})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.write_count{$Cassandra5})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra1})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra2})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra3})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra4})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra5})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.read_count{$Cassandra6})",
      "type": "line"
    }
  ]
}

 

 

 

Cassandra Compaction Counts

Screen Shot 2016-06-14 at 17.55.28

 


{
  "viz": "timeseries",
  "requests": [
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra1})",
      "aggregator": "avg",
      "conditional_formats": [],
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra2})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra3})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra4})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra5})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra6})",
      "type": "line"
    }
  ]
}

 

 

 

Cassandra SSTable Count

 

Screen Shot 2016-06-14 at 17.29.58

 

{
  "viz": "timeseries",
  "requests": [
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra1})",
      "aggregator": "avg",
      "conditional_formats": [],
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra2})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra3})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra4})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra5})",
      "type": "line"
    },
    {
      "q": "per_second(sum:cassandra.db.completed_tasks{$Cassandra6})",
      "type": "line"
    }
  ]
}

 

 

Cassandra Write Activity By Node

Screen Shot 2016-06-14 at 17.30.23

 

{
  "viz": "toplist",
  "requests": [
    {
      "q": "top(sum:cassandra.db.write_count{role:cassandra,region:us-east-1} by {host}, 10, 'sum', 'desc')",
      "style": {
        "palette": "dog_classic"
      },
      "conditional_formats": []
    }
  ]
}

 

Comments are closed. Please see front page on how to contact me