updating docs

This commit is contained in:
drewkerrigan 2015-11-19 13:49:24 -05:00
parent fb4e58b635
commit 8adcf2ff07
3 changed files with 452 additions and 163 deletions

269
README.md
View File

@ -2,49 +2,15 @@
This is a generic plugin for Nagios which checks json values from a given HTTP endpoint against argument specified rules and determines the status and performance data for that service.
### Installation
## Links
#### Requirements
* [CLI Usage](#cli-usage)
* [Examples](#examples)
* [Riak Stats](docs/RIAK.md)
* [Docker](docs/DOCKER.md)
* [Nagios Installation](#nagios-installation)
* Nagios
* Python
### Nagios Configuration
Assuming a standard installation of Nagios, the plugin can be executed from the machine that Nagios is running on.
```bash
cp check_http_json.py /usr/local/nagios/libexec/plugins/check_http_json.py
chmod +x /usr/local/nagios/libexec/plugins/check_http_json.py
```
Add the following service definition to your server config (`localhost.cfg`):
```
define service {
use local-service
host_name localhost
service_description <command_description>
check_command <command_name>
}
```
Add the following command definition to your commands config (`commands.config`):
```
define command{
command_name <command_name>
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H <host>:<port> -p <path> [-e|-q|-w|-c <rules>] [-m <metrics>]
}
```
More info about options in Usage.
### CLI Usage
## CLI Usage
Executing `./check_http_json.py -h` will yield the following details:
@ -114,154 +80,135 @@ optional arguments:
(key[>alias],UnitOfMeasure,WarnRange,CriticalRange).
```
Access a specific JSON field by following this syntax: `alpha.beta.gamma(3).theta.omega(0)`
Dots are field separators (changeable), parentheses are for entering arrays.
## Examples
If the root of the JSON data is itself an array like the following:
### Key Naming
**Data**:
{ "value": 1000 }
**Key**: `value`
**Data**:
{
"capacity": {
"value": 1000
}
}
**Key**: `capacity.value`
**Data**:
```
[
{ "gauges": { "jvm.buffers.direct.capacity": {"value": 215415}}}
{
"capacity": {
"value": 1000
}
}
]
```
The beginning of the key should start with ($index) as in this example:
**Key**: `(0).capacity.value`
```
./check_http_json.py -H localhost:8081 -p metrics --key_exists "(0)_gauges_jvm.buffers.direct.capacity_value" -f _
```
**Data**:
[
{
"gauges": {
"jvm.buffers.direct.capacity":
"value": 1000
}
}
}
]
**Key**: `(0)_gauges_jvm.buffers.direct.capacity_value`
**Separator**: `-f _`
**Data**:
{
"ring_members": [
"riak1@127.0.0.1",
"riak2@127.0.0.1",
"riak3@127.0.0.1"
]
}
**Keys**: `ring_members(0)`, `ring_members(1)`, `ring_members(2)`
### Thresholds and Ranges
**Data**:
{ "metric": 1000 }
#### Relevant Commands
**Warning:** `./check_http_json.py -H <host>:<port> -p <path> -w "metric,RANGE"`
**Critical:** `./check_http_json.py -H <host>:<port> -p <path> -c "metric,RANGE"`
**Metrics with Warning:** `./check_http_json.py -H <host>:<port> -p <path> -w "metric,RANGE"`
**Metris with Critical:**
./check_http_json.py -H <host>:<port> -p <path> -w "metric,,,RANGE"
./check_http_json.py -H <host>:<port> -p <path> -w "metric,,,,MIN,MAX"
#### Range Definitions
**Format:** [@]START:END
**Generates a Warning or Critical if...**
**Value is less than 0 or greater than 1000:** `1000` or `0:1000`
**Value is greater than or equal to 1000, or less than or equal to 0:** `@1000` or `@0:1000`
**Value is less than 1000:** `1000:`
**Value is greater than 1000:** `~:1000`
**Value is greater than or equal to 1000:** `@1000:`
More info about Nagios Range format and Units of Measure can be found at [https://nagios-plugins.org/doc/guidelines.html](https://nagios-plugins.org/doc/guidelines.html).
### Docker Info Example Plugin
## Nagios Installation
#### Description
### Requirements
Let's say we want to use `check_http_json.py` to read from Docker's `/info` HTTP API endpoint with the following parameters:
* Python
##### Connection information
### Configuration
* Host = 127.0.0.1:4243
* Path = /info
Assuming a standard installation of Nagios, the plugin can be executed from the machine that Nagios is running on.
##### Rules for "aliveness"
```bash
cp check_http_json.py /usr/local/nagios/libexec/plugins/check_http_json.py
chmod +x /usr/local/nagios/libexec/plugins/check_http_json.py
```
* Verify that the key `Containers` exists in the outputted JSON
* Verify that the key `IPv4Forwarding` has a value of `1`
* Verify that the key `Debug` has a value less than or equal to `2`
* Verify that the key `Images` has a value greater than or equal to `1`
* If any of these criteria are not met, report a WARNING to Nagios
##### Gather Metrics
* Report value of the key `Containers` with a MinValue of 0 and a MaxValue of 1000 as performance data
* Report value of the key `Images` as performance data
* Report value of the key `NEventsListener` as performance data
* Report value of the key `NFd` as performance data
* Report value of the key `NGoroutines` as performance data
* Report value of the key `SwapLimit` as performance data
#### Service Definition
`localhost.cfg`
Add the following service definition to your server config (`localhost.cfg`):
```
define service {
use local-service
host_name localhost
service_description Docker info status checker
check_command check_docker
service_description <command_description>
check_command <command_name>
}
```
#### Command Definition with Arguments
`commands.cfg`
```
Add the following command definition to your commands config (`commands.config`):
```
define command{
command_name check_docker
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p info -e Containers -q IPv4Forwarding,1 -w Debug,2:2 -c Images,1:1 -m Containers,0:250,0:500,0,1000 Images NEventsListener NFd NGoroutines SwapLimit
command_name <command_name>
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H <host>:<port> -p <path> [-e|-q|-w|-c <rules>] [-m <metrics>]
}
```
#### Sample Output
```
OK: Status OK.|'Containers'=1;0;1000 'Images'=11;0;0 'NEventsListener'=3;0;0 'NFd'=10;0;0 'NGoroutines'=14;0;0 'SwapLimit'=1;0;0
```
### Docker Container Monitor Example Plugin
`check_http_json.py` is generic enough to read and evaluate rules on any HTTP endpoint that returns JSON. In this example we'll get the status of a specific container using it's ID which camn be found by using the list containers endpoint (`curl http://127.0.0.1:4243/containers/json?all=1`).
##### Connection information
* Host = 127.0.0.1:4243
* Path = /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json
##### Rules for "aliveness"
* Verify that the key `ID` exists and is equal to the value `2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615`
* Verify that the key `State.Running` has a value of `True`
#### Service Definition
`localhost.cfg`
```
define service {
use local-service
host_name localhost
service_description Docker container liveness check
check_command check_my_container
}
```
#### Command Definition with Arguments
`commands.cfg`
```
define command{
command_name check_my_container
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json -q ID,2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615 State.Running,True
}
```
#### Sample Output
```
WARNING: Status check failed, reason: Value True for key State.Running did not match.
```
The plugin threw a warning because the Container ID I used on my system has the following State object:
```
u'State': {...
u'Running': False,
...
```
If I change the command to have the parameter -q parameter `State.Running,False`, the output becomes:
```
OK: Status OK.
```
### Dropwizard / Fieldnames Containing '.' Example
Simply choose a separator to deal with data such as this:
```
{ "gauges": { "jvm.buffers.direct.capacity": {"value": 215415}}}
```
In this example I've chosen `_` to separate `guages` from `jvm` and `capacity` from `value`. The CLI invocation then becomes:
```
./check_http_json.py -H localhost:8081 -p metrics --key_exists gauges_jvm.buffers.direct.capacity_value -f _
```
More info about options in Usage.
## License

115
docs/DOCKER.md Normal file
View File

@ -0,0 +1,115 @@
### Docker Info Example Plugin
#### Description
Let's say we want to use `check_http_json.py` to read from Docker's `/info` HTTP API endpoint with the following parameters:
##### Connection information
* Host = 127.0.0.1:4243
* Path = /info
##### Rules for "aliveness"
* Verify that the key `Containers` exists in the outputted JSON
* Verify that the key `IPv4Forwarding` has a value of `1`
* Verify that the key `Debug` has a value less than or equal to `2`
* Verify that the key `Images` has a value greater than or equal to `1`
* If any of these criteria are not met, report a WARNING to Nagios
##### Gather Metrics
* Report value of the key `Containers` with a MinValue of 0 and a MaxValue of 1000 as performance data
* Report value of the key `Images` as performance data
* Report value of the key `NEventsListener` as performance data
* Report value of the key `NFd` as performance data
* Report value of the key `NGoroutines` as performance data
* Report value of the key `SwapLimit` as performance data
#### Service Definition
`localhost.cfg`
```
define service {
use local-service
host_name localhost
service_description Docker info status checker
check_command check_docker
}
```
#### Command Definition with Arguments
`commands.cfg`
```
define command{
command_name check_docker
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p info -e Containers -q IPv4Forwarding,1 -w Debug,2:2 -c Images,1:1 -m Containers,0:250,0:500,0,1000 Images NEventsListener NFd NGoroutines SwapLimit
}
```
#### Sample Output
```
OK: Status OK.|'Containers'=1;0;1000 'Images'=11;0;0 'NEventsListener'=3;0;0 'NFd'=10;0;0 'NGoroutines'=14;0;0 'SwapLimit'=1;0;0
```
### Docker Container Monitor Example Plugin
`check_http_json.py` is generic enough to read and evaluate rules on any HTTP endpoint that returns JSON. In this example we'll get the status of a specific container using it's ID which camn be found by using the list containers endpoint (`curl http://127.0.0.1:4243/containers/json?all=1`).
##### Connection information
* Host = 127.0.0.1:4243
* Path = /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json
##### Rules for "aliveness"
* Verify that the key `ID` exists and is equal to the value `2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615`
* Verify that the key `State.Running` has a value of `True`
#### Service Definition
`localhost.cfg`
```
define service {
use local-service
host_name localhost
service_description Docker container liveness check
check_command check_my_container
}
```
#### Command Definition with Arguments
`commands.cfg`
```
define command{
command_name check_my_container
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json -q ID,2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615 State.Running,True
}
```
#### Sample Output
```
WARNING: Status check failed, reason: Value True for key State.Running did not match.
```
The plugin threw a warning because the Container ID I used on my system has the following State object:
```
u'State': {...
u'Running': False,
...
```
If I change the command to have the parameter -q parameter `State.Running,False`, the output becomes:
```
OK: Status OK.
```

227
docs/RIAK.md Normal file
View File

@ -0,0 +1,227 @@
# Riak Stats Example
## Description
For this example we're going to use `check_http_json.py` as a pure CLI tool to read Riak's `/stats` endpoint
## Connection information
* Host = 127.0.0.1:8098
* Path = /stats
## JSON Stats Data
* Full Riak HTTP Stats information can be found here: [http://docs.basho.com/riak/latest/dev/references/http/status/](http://docs.basho.com/riak/latest/dev/references/http/status/)
* Information related to specific interesting stats can be found here: [http://docs.basho.com/riak/latest/ops/running/stats-and-monitoring/](http://docs.basho.com/riak/latest/ops/running/stats-and-monitoring/)
## Connectivity Check
* `ring_members`: We can use an existence check to monitor the number of ring members
* `connected_nodes`: Similarly we can check the number of nodes that are in communication with this node, but this list will be empty in a 1 node cluster
#### Sample Command
For a single node dev "cluster", you might have a `ring_members` value like this:
```
"ring_members": [
"riak@127.0.0.1"
],
```
To validate that we have a single node, we can use this check:
```
$ ./check_http_json.py -H localhost -P 8098 -p stats -E "ring_members(0)"
OK: Status OK.
```
If we were expecting at least 2 nodes in the cluster, we would use this check:
```
$ ./check_http_json.py -H localhost -P 8098 -p stats -E "ring_members(1)"
CRITICAL: Status CRITICAL. Key ring_members(1) did not exist.
```
Obviously this fails because we only had a single `ring_member`. If we prefer to only get a warning instead of a critical for this check, we just use the correct flag:
```
$ ./check_http_json.py -H localhost -P 8098 -p stats -e "ring_members(1)"
WARNING: Status WARNING. Key ring_members(1) did not exist.
```
## Gather Metrics
The thresholds for acceptable values for these metrics will vary from system to system, following are the stats we'll be checking:
### Throughput Metrics:
* `node_gets`
* `node_puts`
* `vnode_counter_update`
* `vnode_set_update`
* `vnode_map_update`
* `search_query_throughput_one`
* `search_index_throughtput_one`
* `consistent_gets`
* `consistent_puts`
* `vnode_index_reads`
#### Sample Command
```
./check_http_json.py -H localhost -P 8098 -p stats -m \
"node_gets" \
"node_puts" \
"vnode_counter_update" \
"vnode_set_update" \
"vnode_map_update" \
"search_query_throughput_one" \
"search_index_throughtput_one" \
"consistent_gets" \
"consistent_puts" \
"vnode_index_reads"
```
#### Sample Output
```
OK: Status OK.|'node_gets'=0 'node_puts'=0 'vnode_counter_update'=0 'vnode_set_update'=0 'vnode_map_update'=0 'search_query_throughput_one'=0 'consistent_gets'=0 'consistent_puts'=0 'vnode_index_reads'=0
```
### Latency Metrics:
* `node_get_fsm_time_mean,_median,_95,_99,_100`
* `node_put_fsm_time_mean,_median,_95,_99,_100`
* `object_counter_merge_time_mean,_median,_95,_99,_100`
* `object_set_merge_time_mean,_median,_95,_99,_100`
* `object_map_merge_time_mean,_median,_95,_99,_100`
* `search_query_latency_median,_min,_95,_99,_999`
* `search_index_latency_median,_min,_95,_99,_999`
* `consistent_get_time_mean,_median,_95,_99,_100`
* `consistent_put_time_mean,_median,_95,_99,_100`
#### Sample Command
```
./check_http_json.py -H localhost -P 8098 -p stats -m \
"node_get_fsm_time_mean,,0:100,0:1000" \
"node_get_fsm_time_median,,0:100,0:1000" \
"node_get_fsm_time_95,,0:100,0:1000" \
"node_get_fsm_time_99,,0:100,0:1000" \
"node_get_fsm_time_100,,0:100,0:1000" \
"node_put_fsm_time_mean,,0:100,0:1000" \
"node_put_fsm_time_median,,0:100,0:1000" \
"node_put_fsm_time_95,,0:100,0:1000" \
"node_put_fsm_time_99,,0:100,0:1000" \
"node_put_fsm_time_100,,0:100,0:1000" \
"object_counter_merge_time_mean,,0:100,0:1000" \
"object_counter_merge_time_median,,0:100,0:1000" \
"object_counter_merge_time_95,,0:100,0:1000" \
"object_counter_merge_time_99,,0:100,0:1000" \
"object_counter_merge_time_100,,0:100,0:1000" \
"object_set_merge_time_mean,,0:100,0:1000" \
"object_set_merge_time_median,,0:100,0:1000" \
"object_set_merge_time_95,,0:100,0:1000" \
"object_set_merge_time_99,,0:100,0:1000" \
"object_set_merge_time_100,,0:100,0:1000" \
"object_map_merge_time_mean,,0:100,0:1000" \
"object_map_merge_time_median,,0:100,0:1000" \
"object_map_merge_time_95,,0:100,0:1000" \
"object_map_merge_time_99,,0:100,0:1000" \
"object_map_merge_time_100,,0:100,0:1000" \
"consistent_get_time_mean,,0:100,0:1000" \
"consistent_get_time_median,,0:100,0:1000" \
"consistent_get_time_95,,0:100,0:1000" \
"consistent_get_time_99,,0:100,0:1000" \
"consistent_get_time_100,,0:100,0:1000" \
"consistent_put_time_mean,,0:100,0:1000" \
"consistent_put_time_median,,0:100,0:1000" \
"consistent_put_time_95,,0:100,0:1000" \
"consistent_put_time_99,,0:100,0:1000" \
"consistent_put_time_100,,0:100,0:1000" \
"search_query_latency_median,,0:100,0:1000" \
"search_query_latency_min,,0:100,0:1000" \
"search_query_latency_95,,0:100,0:1000" \
"search_query_latency_99,,0:100,0:1000" \
"search_query_latency_999,,0:100,0:1000" \
"search_index_latency_median,,0:100,0:1000" \
"search_index_latency_min,,0:100,0:1000" \
"search_index_latency_95,,0:100,0:1000" \
"search_index_latency_99,,0:100,0:1000" \
"search_index_latency_999,,0:100,0:1000"
```
#### Sample Output
```
OK: Status OK.|'node_get_fsm_time_mean'=0;0:100;0:1000 'node_get_fsm_time_median'=0;0:100;0:1000 'node_get_fsm_time_95'=0;0:100;0:1000 'node_get_fsm_time_99'=0;0:100;0:1000 'node_get_fsm_time_100'=0;0:100;0:1000 'node_put_fsm_time_mean'=0;0:100;0:1000 'node_put_fsm_time_median'=0;0:100;0:1000 'node_put_fsm_time_95'=0;0:100;0:1000 'node_put_fsm_time_99'=0;0:100;0:1000 'node_put_fsm_time_100'=0;0:100;0:1000 'object_counter_merge_time_mean'=0;0:100;0:1000 'object_counter_merge_time_median'=0;0:100;0:1000 'object_counter_merge_time_95'=0;0:100;0:1000 'object_counter_merge_time_99'=0;0:100;0:1000 'object_counter_merge_time_100'=0;0:100;0:1000 'object_set_merge_time_mean'=0;0:100;0:1000 'object_set_merge_time_median'=0;0:100;0:1000 'object_set_merge_time_95'=0;0:100;0:1000 'object_set_merge_time_99'=0;0:100;0:1000 'object_set_merge_time_100'=0;0:100;0:1000 'object_map_merge_time_mean'=0;0:100;0:1000 'object_map_merge_time_median'=0;0:100;0:1000 'object_map_merge_time_95'=0;0:100;0:1000 'object_map_merge_time_99'=0;0:100;0:1000 'object_map_merge_time_100'=0;0:100;0:1000 'consistent_get_time_mean'=0;0:100;0:1000 'consistent_get_time_median'=0;0:100;0:1000 'consistent_get_time_95'=0;0:100;0:1000 'consistent_get_time_99'=0;0:100;0:1000 'consistent_get_time_100'=0;0:100;0:1000 'consistent_put_time_mean'=0;0:100;0:1000 'consistent_put_time_median'=0;0:100;0:1000 'consistent_put_time_95'=0;0:100;0:1000 'consistent_put_time_99'=0;0:100;0:1000 'consistent_put_time_100'=0;0:100;0:1000 'search_query_latency_median'=0;0:100;0:1000 'search_query_latency_min'=0;0:100;0:1000 'search_query_latency_95'=0;0:100;0:1000 'search_query_latency_99'=0;0:100;0:1000 'search_query_latency_999'=0;0:100;0:1000 'search_index_latency_median'=0;0:100;0:1000 'search_index_latency_min'=0;0:100;0:1000 'search_index_latency_95'=0;0:100;0:1000 'search_index_latency_99'=0;0:100;0:1000 'search_index_latency_999'=0;0:100;0:1000
```
### Erlang Resource Usage Metrics:
* `sys_process_count`
* `memory_processes`
* `memory_processes_used`
#### Sample Command
```
./check_http_json.py -H localhost -P 8098 -p stats -m \
"sys_process_count,,0:5000,0:10000" \
"memory_processes,,0:50000000,0:100000000" \
"memory_processes_used,,0:50000000,0:100000000"
```
#### Sample Output
```
OK: Status OK.|'sys_process_count'=1637;0:5000;0:10000 'memory_processes'=46481112;0:50000000;0:100000000 'memory_processes_used'=46476880;0:50000000;0:100000000
```
### General Riak Load / Health Metrics:
* `node_get_fsm_siblings_mean,_median,_95,_99,_100`
* `node_get_fsm_objsize_mean,_median,_95,_99,_100`
* `riak_search_vnodeq_mean,_median,_95,_99,_100`
* `search_index_fail_one`
* `pbc_active`
* `pbc_connects`
* `read_repairs`
* `list_fsm_active`
* `node_get_fsm_rejected`
* `node_put_fsm_rejected`
#### Sample Command
```
./check_http_json.py -H localhost -P 8098 -p stats -m \
"node_get_fsm_siblings_mean,,0:100,0:1000" \
"node_get_fsm_siblings_median,,0:100,0:1000" \
"node_get_fsm_siblings_95,,0:100,0:1000" \
"node_get_fsm_siblings_99,,0:100,0:1000" \
"node_get_fsm_siblings_100,,0:100,0:1000" \
"node_get_fsm_objsize_mean,,0:100,0:1000" \
"node_get_fsm_objsize_median,,0:100,0:1000" \
"node_get_fsm_objsize_95,,0:100,0:1000" \
"node_get_fsm_objsize_99,,0:100,0:1000" \
"node_get_fsm_objsize_100,,0:100,0:1000" \
"riak_search_vnodeq_mean,,0:100,0:1000" \
"riak_search_vnodeq_median,,0:100,0:1000" \
"riak_search_vnodeq_95,,0:100,0:1000" \
"riak_search_vnodeq_99,,0:100,0:1000" \
"riak_search_vnodeq_100,,0:100,0:1000" \
"search_index_fail_one,,0:100,0:1000" \
"pbc_active,,0:100,0:1000" \
"pbc_connects,,0:100,0:1000" \
"read_repairs,,0:100,0:1000" \
"list_fsm_active,,0:100,0:1000" \
"node_get_fsm_rejected,,0:100,0:1000" \
"node_put_fsm_rejected,,0:100,0:1000"
```
#### Sample Output
```
OK: Status OK.|'node_get_fsm_siblings_mean'=0;0:100;0:1000 'node_get_fsm_siblings_median'=0;0:100;0:1000 'node_get_fsm_siblings_95'=0;0:100;0:1000 'node_get_fsm_siblings_99'=0;0:100;0:1000 'node_get_fsm_siblings_100'=0;0:100;0:1000 'node_get_fsm_objsize_mean'=0;0:100;0:1000 'node_get_fsm_objsize_median'=0;0:100;0:1000 'node_get_fsm_objsize_95'=0;0:100;0:1000 'node_get_fsm_objsize_99'=0;0:100;0:1000 'node_get_fsm_objsize_100'=0;0:100;0:1000 'search_index_fail_one'=0;0:100;0:1000 'pbc_active'=0;0:100;0:1000 'pbc_connects'=0;0:100;0:1000 'read_repairs'=0;0:100;0:1000 'list_fsm_active'=0;0:100;0:1000 'node_get_fsm_rejected'=0;0:100;0:1000 'node_put_fsm_rejected'=0;0:100;0:1000
```