mirror of
https://github.com/drewkerrigan/nagios-http-json.git
synced 2026-02-12 01:51:01 +01:00
Compare commits
41 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9c0c59d6c1 | ||
|
|
e2fce71d5a | ||
|
|
47547951cf | ||
|
|
7858382bbe | ||
|
|
1173420803 | ||
|
|
bcc36a6e95 | ||
|
|
d98d0396b2 | ||
|
|
8437c464e5 | ||
|
|
df2bbdbf51 | ||
|
|
823fc275c9 | ||
|
|
18b0898e72 | ||
|
|
95318954bf | ||
|
|
8e469e3d98 | ||
|
|
29f8d892ee | ||
|
|
cbdb884dc7 | ||
|
|
3a108aef5e | ||
|
|
81522fa9ab | ||
|
|
27eaaf0842 | ||
|
|
9dd6323b85 | ||
|
|
67136a4a2b | ||
|
|
d164a1250c | ||
|
|
89f42c15a0 | ||
|
|
1e707a4b6a | ||
|
|
9656265439 | ||
|
|
e463369671 | ||
|
|
357c2240ba | ||
|
|
42d1e08037 | ||
|
|
4950225393 | ||
|
|
9be6a709a2 | ||
|
|
06fab10fe2 | ||
|
|
7bdc802c2d | ||
|
|
ed7bc7175b | ||
|
|
4180ec2066 | ||
|
|
a0d0773d1a | ||
|
|
fbebf05f76 | ||
|
|
6f9048fc75 | ||
|
|
5bb09cd362 | ||
|
|
568fa6e4d0 | ||
|
|
f63ac180b6 | ||
|
|
070047cf55 | ||
|
|
8adcf2ff07 |
413
README.md
413
README.md
@@ -2,14 +2,198 @@
|
||||
|
||||
This is a generic plugin for Nagios which checks json values from a given HTTP endpoint against argument specified rules and determines the status and performance data for that service.
|
||||
|
||||
### Installation
|
||||
## Links
|
||||
|
||||
#### Requirements
|
||||
* [CLI Usage](#cli-usage)
|
||||
* [Examples](#examples)
|
||||
* [Riak Stats](docs/RIAK.md)
|
||||
* [Docker](docs/DOCKER.md)
|
||||
* [Nagios Installation](#nagios-installation)
|
||||
|
||||
* Nagios
|
||||
* Python
|
||||
## CLI Usage
|
||||
|
||||
### Nagios Configuration
|
||||
Executing `./check_http_json.py -h` will yield the following details:
|
||||
|
||||
```
|
||||
usage: check_http_json.py [-h] [-d] [-s] -H HOST [-k] [-V] [--cacert CACERT]
|
||||
[--cert CERT] [--key KEY] [-P PORT] [-p PATH]
|
||||
[-t TIMEOUT] [-B AUTH] [-D DATA] [-A HEADERS]
|
||||
[-f SEPARATOR]
|
||||
[-w [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]]]
|
||||
[-c [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]]]
|
||||
[-e [KEY_LIST [KEY_LIST ...]]]
|
||||
[-E [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]]]
|
||||
[-q [KEY_VALUE_LIST [KEY_VALUE_LIST ...]]]
|
||||
[-Q [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]]]
|
||||
[-u [KEY_VALUE_LIST_UNKNOWN [KEY_VALUE_LIST_UNKNOWN ...]]]
|
||||
[-y [KEY_VALUE_LIST_NOT [KEY_VALUE_LIST_NOT ...]]]
|
||||
[-Y [KEY_VALUE_LIST_NOT_CRITICAL [KEY_VALUE_LIST_NOT_CRITICAL ...]]]
|
||||
[-m [METRIC_LIST [METRIC_LIST ...]]]
|
||||
|
||||
Check HTTP JSON Nagios Plugin
|
||||
|
||||
Generic Nagios plugin which checks json values from a given endpoint against
|
||||
argument specified rules and determines the status and performance data for
|
||||
that service.
|
||||
|
||||
Version: 1.4.0 (2019-05-09)
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-d, --debug debug mode
|
||||
-s, --ssl use TLS to connect to remote host
|
||||
-H HOST, --host HOST remote host to query
|
||||
-k, --insecure do not check server SSL certificate
|
||||
-V, --version print version of this plugin
|
||||
--cacert CACERT SSL CA certificate
|
||||
--cert CERT SSL client certificate
|
||||
--key KEY SSL client key ( if not bundled into the cert )
|
||||
-P PORT, --port PORT TCP port
|
||||
-p PATH, --path PATH Path
|
||||
-t TIMEOUT, --timeout TIMEOUT
|
||||
Connection timeout (seconds)
|
||||
-B AUTH, --basic-auth AUTH
|
||||
Basic auth string "username:password"
|
||||
-D DATA, --data DATA The http payload to send as a POST
|
||||
-A HEADERS, --headers HEADERS
|
||||
The http headers in JSON format.
|
||||
-f SEPARATOR, --field_separator SEPARATOR
|
||||
JSON Field separator, defaults to "."; Select element
|
||||
in an array with "(" ")"
|
||||
-w [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]], --warning [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]]
|
||||
Warning threshold for these values
|
||||
(key1[>alias],WarnRange key2[>alias],WarnRange).
|
||||
WarnRange is in the format [@]start:end, more
|
||||
information at nagios-plugins.org/doc/guidelines.html.
|
||||
-c [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]], --critical [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]]
|
||||
Critical threshold for these values
|
||||
(key1[>alias],CriticalRange
|
||||
key2[>alias],CriticalRange. CriticalRange is in the
|
||||
format [@]start:end, more information at nagios-
|
||||
plugins.org/doc/guidelines.html.
|
||||
-e [KEY_LIST [KEY_LIST ...]], --key_exists [KEY_LIST [KEY_LIST ...]]
|
||||
Checks existence of these keys to determine status.
|
||||
Return warning if key is not present.
|
||||
-E [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]], --key_exists_critical [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]]
|
||||
Same as -e but return critical if key is not present.
|
||||
-q [KEY_VALUE_LIST [KEY_VALUE_LIST ...]], --key_equals [KEY_VALUE_LIST [KEY_VALUE_LIST ...]]
|
||||
Checks equality of these keys and values
|
||||
(key[>alias],value key2,value2) to determine status.
|
||||
Multiple key values can be delimited with colon
|
||||
(key,value1:value2). Return warning if equality check
|
||||
fails
|
||||
-Q [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]], --key_equals_critical [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]]
|
||||
Same as -q but return critical if equality check
|
||||
fails.
|
||||
-u [KEY_VALUE_LIST_UNKNOWN [KEY_VALUE_LIST_UNKNOWN ...]], --key_equals_unknown [KEY_VALUE_LIST_UNKNOWN [KEY_VALUE_LIST_UNKNOWN ...]]
|
||||
Same as -q but return unknown if equality check fails.
|
||||
-y [KEY_VALUE_LIST_NOT [KEY_VALUE_LIST_NOT ...]], --key_not_equals [KEY_VALUE_LIST_NOT [KEY_VALUE_LIST_NOT ...]]
|
||||
Checks equality of these keys and values
|
||||
(key[>alias],value key2,value2) to determine status.
|
||||
Multiple key values can be delimited with colon
|
||||
(key,value1:value2). Return warning if equality check
|
||||
succeeds
|
||||
-Y [KEY_VALUE_LIST_NOT_CRITICAL [KEY_VALUE_LIST_NOT_CRITICAL ...]], --key_not_equals_critical [KEY_VALUE_LIST_NOT_CRITICAL [KEY_VALUE_LIST_NOT_CRITICAL ...]]
|
||||
Same as -q but return critical if equality check
|
||||
succeeds.
|
||||
-m [METRIC_LIST [METRIC_LIST ...]], --key_metric [METRIC_LIST [METRIC_LIST ...]]
|
||||
Gathers the values of these keys (key[>alias],
|
||||
UnitOfMeasure,WarnRange,CriticalRange,Min,Max) for
|
||||
Nagios performance data. More information about Range
|
||||
format and units of measure for nagios can be found at
|
||||
nagios-plugins.org/doc/guidelines.html Additional
|
||||
formats for this parameter are: (key[>alias]),
|
||||
(key[>alias],UnitOfMeasure),
|
||||
(key[>alias],UnitOfMeasure,WarnRange, CriticalRange).
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Key Naming
|
||||
|
||||
**Data for key** `value`:
|
||||
|
||||
{ "value": 1000 }
|
||||
|
||||
**Data for key** `capacity.value`:
|
||||
|
||||
{
|
||||
"capacity": {
|
||||
"value": 1000
|
||||
}
|
||||
}
|
||||
|
||||
**Data for key** `(0).capacity.value`:
|
||||
|
||||
[
|
||||
{
|
||||
"capacity": {
|
||||
"value": 1000
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
**Data for separator** `-f _` **and key** `(0)_gauges_jvm.buffers.direct.capacity_value`:
|
||||
|
||||
[
|
||||
{
|
||||
"gauges": {
|
||||
"jvm.buffers.direct.capacity":
|
||||
"value": 1000
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
**Data for keys** `ring_members(0)`, `ring_members(1)`, `ring_members(2)`:
|
||||
|
||||
{
|
||||
"ring_members": [
|
||||
"riak1@127.0.0.1",
|
||||
"riak2@127.0.0.1",
|
||||
"riak3@127.0.0.1"
|
||||
]
|
||||
}
|
||||
|
||||
### Thresholds and Ranges
|
||||
|
||||
**Data**:
|
||||
|
||||
{ "metric": 1000 }
|
||||
|
||||
#### Relevant Commands
|
||||
|
||||
* **Warning:** `./check_http_json.py -H <host>:<port> -p <path> -w "metric,RANGE"`
|
||||
* **Critical:** `./check_http_json.py -H <host>:<port> -p <path> -c "metric,RANGE"`
|
||||
* **Metrics with Warning:** `./check_http_json.py -H <host>:<port> -p <path> -w "metric,RANGE"`
|
||||
* **Metrics with Critical:**
|
||||
|
||||
./check_http_json.py -H <host>:<port> -p <path> -w "metric,,,RANGE"
|
||||
./check_http_json.py -H <host>:<port> -p <path> -w "metric,,,,MIN,MAX"
|
||||
|
||||
#### Range Definitions
|
||||
|
||||
* **Format:** [@]START:END
|
||||
* **Generates a Warning or Critical if...**
|
||||
* **Value is less than 0 or greater than 1000:** `1000` or `0:1000`
|
||||
* **Value is greater than or equal to 1000, or less than or equal to 0:** `@1000` or `@0:1000`
|
||||
* **Value is less than 1000:** `1000:`
|
||||
* **Value is greater than 1000:** `~:1000`
|
||||
* **Value is greater than or equal to 1000:** `@1000:`
|
||||
|
||||
More info about Nagios Range format and Units of Measure can be found at [https://nagios-plugins.org/doc/guidelines.html](https://nagios-plugins.org/doc/guidelines.html).
|
||||
|
||||
#### Using Headers
|
||||
|
||||
* `./check_http_json.py -H <host>:<port> -p <path> -A '{"content-type": "application/json"}' -w "metric,RANGE"`
|
||||
|
||||
## Nagios Installation
|
||||
|
||||
### Requirements
|
||||
|
||||
* Python 2.7
|
||||
|
||||
### Configuration
|
||||
|
||||
Assuming a standard installation of Nagios, the plugin can be executed from the machine that Nagios is running on.
|
||||
|
||||
@@ -44,225 +228,6 @@ define command{
|
||||
|
||||
More info about options in Usage.
|
||||
|
||||
### CLI Usage
|
||||
|
||||
Executing `./check_http_json.py -h` will yield the following details:
|
||||
|
||||
```
|
||||
usage: check_http_json.py [-h] [-d] [-s] -H HOST [-P PORT] [-p PATH]
|
||||
[-t TIMEOUT] [-B AUTH] [-D DATA] [-f SEPARATOR]
|
||||
[-w [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]]]
|
||||
[-c [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]]]
|
||||
[-e [KEY_LIST [KEY_LIST ...]]]
|
||||
[-E [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]]]
|
||||
[-q [KEY_VALUE_LIST [KEY_VALUE_LIST ...]]]
|
||||
[-Q [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]]]
|
||||
[-m [METRIC_LIST [METRIC_LIST ...]]]
|
||||
|
||||
Nagios plugin which checks json values from a given endpoint against argument
|
||||
specified rules and determines the status and performance data for that
|
||||
service
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-d, --debug Debug mode.
|
||||
-s, --ssl HTTPS mode.
|
||||
-H HOST, --host HOST Host.
|
||||
-P PORT, --port PORT TCP port
|
||||
-p PATH, --path PATH Path.
|
||||
-t TIMEOUT, --timeout TIMEOUT
|
||||
Connection timeout (seconds)
|
||||
-B AUTH, --basic-auth AUTH
|
||||
Basic auth string "username:password"
|
||||
-D DATA, --data DATA The http payload to send as a POST
|
||||
-f SEPARATOR, --field_separator SEPARATOR
|
||||
Json Field separator, defaults to "." ; Select element
|
||||
in an array with "(" ")"
|
||||
-w [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]], --warning [KEY_THRESHOLD_WARNING [KEY_THRESHOLD_WARNING ...]]
|
||||
Warning threshold for these values
|
||||
(key1[>alias],WarnRange key2[>alias],WarnRange).
|
||||
WarnRange is in the format [@]start:end, more
|
||||
information at nagios-plugins.org/doc/guidelines.html.
|
||||
-c [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]], --critical [KEY_THRESHOLD_CRITICAL [KEY_THRESHOLD_CRITICAL ...]]
|
||||
Critical threshold for these values
|
||||
(key1[>alias],CriticalRange
|
||||
key2[>alias],CriticalRange. CriticalRange is in the
|
||||
format [@]start:end, more information at nagios-
|
||||
plugins.org/doc/guidelines.html.
|
||||
-e [KEY_LIST [KEY_LIST ...]], --key_exists [KEY_LIST [KEY_LIST ...]]
|
||||
Checks existence of these keys to determine status.
|
||||
Return warning if key is not present.
|
||||
-E [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]], --key_exists_critical [KEY_LIST_CRITICAL [KEY_LIST_CRITICAL ...]]
|
||||
Same as -e but return critical if key is not present.
|
||||
-q [KEY_VALUE_LIST [KEY_VALUE_LIST ...]], --key_equals [KEY_VALUE_LIST [KEY_VALUE_LIST ...]]
|
||||
Checks equality of these keys and values
|
||||
(key[>alias],value key2,value2) to determine status.
|
||||
Multiple key values can be delimited with colon
|
||||
(key,value1:value2). Return warning if equality check
|
||||
fails
|
||||
-Q [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]], --key_equals_critical [KEY_VALUE_LIST_CRITICAL [KEY_VALUE_LIST_CRITICAL ...]]
|
||||
Same as -q but return critical if equality check
|
||||
fails.
|
||||
-m [METRIC_LIST [METRIC_LIST ...]], --key_metric [METRIC_LIST [METRIC_LIST ...]]
|
||||
Gathers the values of these keys (key[>alias],UnitOfMe
|
||||
asure,WarnRange,CriticalRange,Min,Max) for Nagios
|
||||
performance data. More information about Range format
|
||||
and units of measure for nagios can be found at
|
||||
nagios-plugins.org/doc/guidelines.html Additional
|
||||
formats for this parameter are: (key[>alias]),
|
||||
(key[>alias],UnitOfMeasure),
|
||||
(key[>alias],UnitOfMeasure,WarnRange,CriticalRange).
|
||||
```
|
||||
|
||||
Access a specific JSON field by following this syntax: `alpha.beta.gamma(3).theta.omega(0)`
|
||||
Dots are field separators (changeable), parentheses are for entering arrays.
|
||||
|
||||
If the root of the JSON data is itself an array like the following:
|
||||
|
||||
```
|
||||
[
|
||||
{ "gauges": { "jvm.buffers.direct.capacity": {"value": 215415}}}
|
||||
]
|
||||
```
|
||||
|
||||
The beginning of the key should start with ($index) as in this example:
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost:8081 -p metrics --key_exists "(0)_gauges_jvm.buffers.direct.capacity_value" -f _
|
||||
```
|
||||
|
||||
More info about Nagios Range format and Units of Measure can be found at [https://nagios-plugins.org/doc/guidelines.html](https://nagios-plugins.org/doc/guidelines.html).
|
||||
|
||||
### Docker Info Example Plugin
|
||||
|
||||
#### Description
|
||||
|
||||
Let's say we want to use `check_http_json.py` to read from Docker's `/info` HTTP API endpoint with the following parameters:
|
||||
|
||||
##### Connection information
|
||||
|
||||
* Host = 127.0.0.1:4243
|
||||
* Path = /info
|
||||
|
||||
##### Rules for "aliveness"
|
||||
|
||||
* Verify that the key `Containers` exists in the outputted JSON
|
||||
* Verify that the key `IPv4Forwarding` has a value of `1`
|
||||
* Verify that the key `Debug` has a value less than or equal to `2`
|
||||
* Verify that the key `Images` has a value greater than or equal to `1`
|
||||
* If any of these criteria are not met, report a WARNING to Nagios
|
||||
|
||||
##### Gather Metrics
|
||||
|
||||
* Report value of the key `Containers` with a MinValue of 0 and a MaxValue of 1000 as performance data
|
||||
* Report value of the key `Images` as performance data
|
||||
* Report value of the key `NEventsListener` as performance data
|
||||
* Report value of the key `NFd` as performance data
|
||||
* Report value of the key `NGoroutines` as performance data
|
||||
* Report value of the key `SwapLimit` as performance data
|
||||
|
||||
#### Service Definition
|
||||
|
||||
`localhost.cfg`
|
||||
|
||||
```
|
||||
define service {
|
||||
use local-service
|
||||
host_name localhost
|
||||
service_description Docker info status checker
|
||||
check_command check_docker
|
||||
}
|
||||
```
|
||||
|
||||
#### Command Definition with Arguments
|
||||
|
||||
`commands.cfg`
|
||||
|
||||
```
|
||||
define command{
|
||||
command_name check_docker
|
||||
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p info -e Containers -q IPv4Forwarding,1 -w Debug,2:2 -c Images,1:1 -m Containers,0:250,0:500,0,1000 Images NEventsListener NFd NGoroutines SwapLimit
|
||||
}
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'Containers'=1;0;1000 'Images'=11;0;0 'NEventsListener'=3;0;0 'NFd'=10;0;0 'NGoroutines'=14;0;0 'SwapLimit'=1;0;0
|
||||
```
|
||||
|
||||
### Docker Container Monitor Example Plugin
|
||||
|
||||
`check_http_json.py` is generic enough to read and evaluate rules on any HTTP endpoint that returns JSON. In this example we'll get the status of a specific container using it's ID which camn be found by using the list containers endpoint (`curl http://127.0.0.1:4243/containers/json?all=1`).
|
||||
|
||||
##### Connection information
|
||||
|
||||
* Host = 127.0.0.1:4243
|
||||
* Path = /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json
|
||||
|
||||
##### Rules for "aliveness"
|
||||
|
||||
* Verify that the key `ID` exists and is equal to the value `2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615`
|
||||
* Verify that the key `State.Running` has a value of `True`
|
||||
|
||||
#### Service Definition
|
||||
|
||||
`localhost.cfg`
|
||||
|
||||
```
|
||||
define service {
|
||||
use local-service
|
||||
host_name localhost
|
||||
service_description Docker container liveness check
|
||||
check_command check_my_container
|
||||
}
|
||||
```
|
||||
|
||||
#### Command Definition with Arguments
|
||||
|
||||
`commands.cfg`
|
||||
|
||||
```
|
||||
define command{
|
||||
command_name check_my_container
|
||||
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json -q ID,2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615 State.Running,True
|
||||
}
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
WARNING: Status check failed, reason: Value True for key State.Running did not match.
|
||||
```
|
||||
|
||||
The plugin threw a warning because the Container ID I used on my system has the following State object:
|
||||
|
||||
```
|
||||
u'State': {...
|
||||
u'Running': False,
|
||||
...
|
||||
```
|
||||
|
||||
If I change the command to have the parameter -q parameter `State.Running,False`, the output becomes:
|
||||
|
||||
```
|
||||
OK: Status OK.
|
||||
```
|
||||
|
||||
### Dropwizard / Fieldnames Containing '.' Example
|
||||
|
||||
Simply choose a separator to deal with data such as this:
|
||||
|
||||
```
|
||||
{ "gauges": { "jvm.buffers.direct.capacity": {"value": 215415}}}
|
||||
```
|
||||
|
||||
In this example I've chosen `_` to separate `guages` from `jvm` and `capacity` from `value`. The CLI invocation then becomes:
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost:8081 -p metrics --key_exists gauges_jvm.buffers.direct.capacity_value -f _
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Copyright 2014-2015 Drew Kerrigan.
|
||||
|
||||
1120
check_http_json.py
1120
check_http_json.py
File diff suppressed because it is too large
Load Diff
115
docs/DOCKER.md
Normal file
115
docs/DOCKER.md
Normal file
@@ -0,0 +1,115 @@
|
||||
### Docker Info Example Plugin
|
||||
|
||||
#### Description
|
||||
|
||||
Let's say we want to use `check_http_json.py` to read from Docker's `/info` HTTP API endpoint with the following parameters:
|
||||
|
||||
##### Connection information
|
||||
|
||||
* Host = 127.0.0.1:4243
|
||||
* Path = /info
|
||||
|
||||
##### Rules for "aliveness"
|
||||
|
||||
* Verify that the key `Containers` exists in the outputted JSON
|
||||
* Verify that the key `IPv4Forwarding` has a value of `1`
|
||||
* Verify that the key `Debug` has a value less than or equal to `2`
|
||||
* Verify that the key `Images` has a value greater than or equal to `1`
|
||||
* If any of these criteria are not met, report a WARNING to Nagios
|
||||
|
||||
##### Gather Metrics
|
||||
|
||||
* Report value of the key `Containers` with a MinValue of 0 and a MaxValue of 1000 as performance data
|
||||
* Report value of the key `Images` as performance data
|
||||
* Report value of the key `NEventsListener` as performance data
|
||||
* Report value of the key `NFd` as performance data
|
||||
* Report value of the key `NGoroutines` as performance data
|
||||
* Report value of the key `SwapLimit` as performance data
|
||||
|
||||
#### Service Definition
|
||||
|
||||
`localhost.cfg`
|
||||
|
||||
```
|
||||
define service {
|
||||
use local-service
|
||||
host_name localhost
|
||||
service_description Docker info status checker
|
||||
check_command check_docker
|
||||
}
|
||||
```
|
||||
|
||||
#### Command Definition with Arguments
|
||||
|
||||
`commands.cfg`
|
||||
|
||||
```
|
||||
define command{
|
||||
command_name check_docker
|
||||
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p info -e Containers -q IPv4Forwarding,1 -w Debug,2:2 -c Images,1:1 -m Containers,0:250,0:500,0,1000 Images NEventsListener NFd NGoroutines SwapLimit
|
||||
}
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'Containers'=1;0;1000 'Images'=11;0;0 'NEventsListener'=3;0;0 'NFd'=10;0;0 'NGoroutines'=14;0;0 'SwapLimit'=1;0;0
|
||||
```
|
||||
|
||||
### Docker Container Monitor Example Plugin
|
||||
|
||||
`check_http_json.py` is generic enough to read and evaluate rules on any HTTP endpoint that returns JSON. In this example we'll get the status of a specific container using it's ID which camn be found by using the list containers endpoint (`curl http://127.0.0.1:4243/containers/json?all=1`).
|
||||
|
||||
##### Connection information
|
||||
|
||||
* Host = 127.0.0.1:4243
|
||||
* Path = /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json
|
||||
|
||||
##### Rules for "aliveness"
|
||||
|
||||
* Verify that the key `ID` exists and is equal to the value `2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615`
|
||||
* Verify that the key `State.Running` has a value of `True`
|
||||
|
||||
#### Service Definition
|
||||
|
||||
`localhost.cfg`
|
||||
|
||||
```
|
||||
define service {
|
||||
use local-service
|
||||
host_name localhost
|
||||
service_description Docker container liveness check
|
||||
check_command check_my_container
|
||||
}
|
||||
```
|
||||
|
||||
#### Command Definition with Arguments
|
||||
|
||||
`commands.cfg`
|
||||
|
||||
```
|
||||
define command{
|
||||
command_name check_my_container
|
||||
command_line /usr/bin/python /usr/local/nagios/libexec/plugins/check_http_json.py -H 127.0.0.1:4243 -p /containers/2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615/json -q ID,2356e8ccb3de8308ccb16cf8f5d157bc85ded5c3d8327b0dfb11818222b6f615 State.Running,True
|
||||
}
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
WARNING: Status check failed, reason: Value True for key State.Running did not match.
|
||||
```
|
||||
|
||||
The plugin threw a warning because the Container ID I used on my system has the following State object:
|
||||
|
||||
```
|
||||
u'State': {...
|
||||
u'Running': False,
|
||||
...
|
||||
```
|
||||
|
||||
If I change the command to have the parameter -q parameter `State.Running,False`, the output becomes:
|
||||
|
||||
```
|
||||
OK: Status OK.
|
||||
```
|
||||
227
docs/RIAK.md
Normal file
227
docs/RIAK.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Riak Stats Example
|
||||
|
||||
## Description
|
||||
|
||||
For this example we're going to use `check_http_json.py` as a pure CLI tool to read Riak's `/stats` endpoint
|
||||
|
||||
## Connection information
|
||||
|
||||
* Host = 127.0.0.1:8098
|
||||
* Path = /stats
|
||||
|
||||
## JSON Stats Data
|
||||
|
||||
* Full Riak HTTP Stats information can be found here: [http://docs.basho.com/riak/latest/dev/references/http/status/](http://docs.basho.com/riak/latest/dev/references/http/status/)
|
||||
* Information related to specific interesting stats can be found here: [http://docs.basho.com/riak/latest/ops/running/stats-and-monitoring/](http://docs.basho.com/riak/latest/ops/running/stats-and-monitoring/)
|
||||
|
||||
## Connectivity Check
|
||||
|
||||
* `ring_members`: We can use an existence check to monitor the number of ring members
|
||||
* `connected_nodes`: Similarly we can check the number of nodes that are in communication with this node, but this list will be empty in a 1 node cluster
|
||||
|
||||
#### Sample Command
|
||||
|
||||
For a single node dev "cluster", you might have a `ring_members` value like this:
|
||||
|
||||
```
|
||||
"ring_members": [
|
||||
"riak@127.0.0.1"
|
||||
],
|
||||
```
|
||||
|
||||
To validate that we have a single node, we can use this check:
|
||||
|
||||
```
|
||||
$ ./check_http_json.py -H localhost -P 8098 -p stats -E "ring_members(0)"
|
||||
OK: Status OK.
|
||||
```
|
||||
|
||||
If we were expecting at least 2 nodes in the cluster, we would use this check:
|
||||
|
||||
```
|
||||
$ ./check_http_json.py -H localhost -P 8098 -p stats -E "ring_members(1)"
|
||||
CRITICAL: Status CRITICAL. Key ring_members(1) did not exist.
|
||||
```
|
||||
|
||||
Obviously this fails because we only had a single `ring_member`. If we prefer to only get a warning instead of a critical for this check, we just use the correct flag:
|
||||
|
||||
```
|
||||
$ ./check_http_json.py -H localhost -P 8098 -p stats -e "ring_members(1)"
|
||||
WARNING: Status WARNING. Key ring_members(1) did not exist.
|
||||
```
|
||||
|
||||
## Gather Metrics
|
||||
|
||||
The thresholds for acceptable values for these metrics will vary from system to system, following are the stats we'll be checking:
|
||||
|
||||
### Throughput Metrics:
|
||||
|
||||
* `node_gets`
|
||||
* `node_puts`
|
||||
* `vnode_counter_update`
|
||||
* `vnode_set_update`
|
||||
* `vnode_map_update`
|
||||
* `search_query_throughput_one`
|
||||
* `search_index_throughtput_one`
|
||||
* `consistent_gets`
|
||||
* `consistent_puts`
|
||||
* `vnode_index_reads`
|
||||
|
||||
#### Sample Command
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost -P 8098 -p stats -m \
|
||||
"node_gets" \
|
||||
"node_puts" \
|
||||
"vnode_counter_update" \
|
||||
"vnode_set_update" \
|
||||
"vnode_map_update" \
|
||||
"search_query_throughput_one" \
|
||||
"search_index_throughtput_one" \
|
||||
"consistent_gets" \
|
||||
"consistent_puts" \
|
||||
"vnode_index_reads"
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'node_gets'=0 'node_puts'=0 'vnode_counter_update'=0 'vnode_set_update'=0 'vnode_map_update'=0 'search_query_throughput_one'=0 'consistent_gets'=0 'consistent_puts'=0 'vnode_index_reads'=0
|
||||
```
|
||||
|
||||
### Latency Metrics:
|
||||
|
||||
* `node_get_fsm_time_mean,_median,_95,_99,_100`
|
||||
* `node_put_fsm_time_mean,_median,_95,_99,_100`
|
||||
* `object_counter_merge_time_mean,_median,_95,_99,_100`
|
||||
* `object_set_merge_time_mean,_median,_95,_99,_100`
|
||||
* `object_map_merge_time_mean,_median,_95,_99,_100`
|
||||
* `search_query_latency_median,_min,_95,_99,_999`
|
||||
* `search_index_latency_median,_min,_95,_99,_999`
|
||||
* `consistent_get_time_mean,_median,_95,_99,_100`
|
||||
* `consistent_put_time_mean,_median,_95,_99,_100`
|
||||
|
||||
#### Sample Command
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost -P 8098 -p stats -m \
|
||||
"node_get_fsm_time_mean,,0:100,0:1000" \
|
||||
"node_get_fsm_time_median,,0:100,0:1000" \
|
||||
"node_get_fsm_time_95,,0:100,0:1000" \
|
||||
"node_get_fsm_time_99,,0:100,0:1000" \
|
||||
"node_get_fsm_time_100,,0:100,0:1000" \
|
||||
"node_put_fsm_time_mean,,0:100,0:1000" \
|
||||
"node_put_fsm_time_median,,0:100,0:1000" \
|
||||
"node_put_fsm_time_95,,0:100,0:1000" \
|
||||
"node_put_fsm_time_99,,0:100,0:1000" \
|
||||
"node_put_fsm_time_100,,0:100,0:1000" \
|
||||
"object_counter_merge_time_mean,,0:100,0:1000" \
|
||||
"object_counter_merge_time_median,,0:100,0:1000" \
|
||||
"object_counter_merge_time_95,,0:100,0:1000" \
|
||||
"object_counter_merge_time_99,,0:100,0:1000" \
|
||||
"object_counter_merge_time_100,,0:100,0:1000" \
|
||||
"object_set_merge_time_mean,,0:100,0:1000" \
|
||||
"object_set_merge_time_median,,0:100,0:1000" \
|
||||
"object_set_merge_time_95,,0:100,0:1000" \
|
||||
"object_set_merge_time_99,,0:100,0:1000" \
|
||||
"object_set_merge_time_100,,0:100,0:1000" \
|
||||
"object_map_merge_time_mean,,0:100,0:1000" \
|
||||
"object_map_merge_time_median,,0:100,0:1000" \
|
||||
"object_map_merge_time_95,,0:100,0:1000" \
|
||||
"object_map_merge_time_99,,0:100,0:1000" \
|
||||
"object_map_merge_time_100,,0:100,0:1000" \
|
||||
"consistent_get_time_mean,,0:100,0:1000" \
|
||||
"consistent_get_time_median,,0:100,0:1000" \
|
||||
"consistent_get_time_95,,0:100,0:1000" \
|
||||
"consistent_get_time_99,,0:100,0:1000" \
|
||||
"consistent_get_time_100,,0:100,0:1000" \
|
||||
"consistent_put_time_mean,,0:100,0:1000" \
|
||||
"consistent_put_time_median,,0:100,0:1000" \
|
||||
"consistent_put_time_95,,0:100,0:1000" \
|
||||
"consistent_put_time_99,,0:100,0:1000" \
|
||||
"consistent_put_time_100,,0:100,0:1000" \
|
||||
"search_query_latency_median,,0:100,0:1000" \
|
||||
"search_query_latency_min,,0:100,0:1000" \
|
||||
"search_query_latency_95,,0:100,0:1000" \
|
||||
"search_query_latency_99,,0:100,0:1000" \
|
||||
"search_query_latency_999,,0:100,0:1000" \
|
||||
"search_index_latency_median,,0:100,0:1000" \
|
||||
"search_index_latency_min,,0:100,0:1000" \
|
||||
"search_index_latency_95,,0:100,0:1000" \
|
||||
"search_index_latency_99,,0:100,0:1000" \
|
||||
"search_index_latency_999,,0:100,0:1000"
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'node_get_fsm_time_mean'=0;0:100;0:1000 'node_get_fsm_time_median'=0;0:100;0:1000 'node_get_fsm_time_95'=0;0:100;0:1000 'node_get_fsm_time_99'=0;0:100;0:1000 'node_get_fsm_time_100'=0;0:100;0:1000 'node_put_fsm_time_mean'=0;0:100;0:1000 'node_put_fsm_time_median'=0;0:100;0:1000 'node_put_fsm_time_95'=0;0:100;0:1000 'node_put_fsm_time_99'=0;0:100;0:1000 'node_put_fsm_time_100'=0;0:100;0:1000 'object_counter_merge_time_mean'=0;0:100;0:1000 'object_counter_merge_time_median'=0;0:100;0:1000 'object_counter_merge_time_95'=0;0:100;0:1000 'object_counter_merge_time_99'=0;0:100;0:1000 'object_counter_merge_time_100'=0;0:100;0:1000 'object_set_merge_time_mean'=0;0:100;0:1000 'object_set_merge_time_median'=0;0:100;0:1000 'object_set_merge_time_95'=0;0:100;0:1000 'object_set_merge_time_99'=0;0:100;0:1000 'object_set_merge_time_100'=0;0:100;0:1000 'object_map_merge_time_mean'=0;0:100;0:1000 'object_map_merge_time_median'=0;0:100;0:1000 'object_map_merge_time_95'=0;0:100;0:1000 'object_map_merge_time_99'=0;0:100;0:1000 'object_map_merge_time_100'=0;0:100;0:1000 'consistent_get_time_mean'=0;0:100;0:1000 'consistent_get_time_median'=0;0:100;0:1000 'consistent_get_time_95'=0;0:100;0:1000 'consistent_get_time_99'=0;0:100;0:1000 'consistent_get_time_100'=0;0:100;0:1000 'consistent_put_time_mean'=0;0:100;0:1000 'consistent_put_time_median'=0;0:100;0:1000 'consistent_put_time_95'=0;0:100;0:1000 'consistent_put_time_99'=0;0:100;0:1000 'consistent_put_time_100'=0;0:100;0:1000 'search_query_latency_median'=0;0:100;0:1000 'search_query_latency_min'=0;0:100;0:1000 'search_query_latency_95'=0;0:100;0:1000 'search_query_latency_99'=0;0:100;0:1000 'search_query_latency_999'=0;0:100;0:1000 'search_index_latency_median'=0;0:100;0:1000 'search_index_latency_min'=0;0:100;0:1000 'search_index_latency_95'=0;0:100;0:1000 'search_index_latency_99'=0;0:100;0:1000 'search_index_latency_999'=0;0:100;0:1000
|
||||
```
|
||||
|
||||
### Erlang Resource Usage Metrics:
|
||||
|
||||
* `sys_process_count`
|
||||
* `memory_processes`
|
||||
* `memory_processes_used`
|
||||
|
||||
#### Sample Command
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost -P 8098 -p stats -m \
|
||||
"sys_process_count,,0:5000,0:10000" \
|
||||
"memory_processes,,0:50000000,0:100000000" \
|
||||
"memory_processes_used,,0:50000000,0:100000000"
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'sys_process_count'=1637;0:5000;0:10000 'memory_processes'=46481112;0:50000000;0:100000000 'memory_processes_used'=46476880;0:50000000;0:100000000
|
||||
```
|
||||
|
||||
### General Riak Load / Health Metrics:
|
||||
|
||||
* `node_get_fsm_siblings_mean,_median,_95,_99,_100`
|
||||
* `node_get_fsm_objsize_mean,_median,_95,_99,_100`
|
||||
* `riak_search_vnodeq_mean,_median,_95,_99,_100`
|
||||
* `search_index_fail_one`
|
||||
* `pbc_active`
|
||||
* `pbc_connects`
|
||||
* `read_repairs`
|
||||
* `list_fsm_active`
|
||||
* `node_get_fsm_rejected`
|
||||
* `node_put_fsm_rejected`
|
||||
|
||||
#### Sample Command
|
||||
|
||||
```
|
||||
./check_http_json.py -H localhost -P 8098 -p stats -m \
|
||||
"node_get_fsm_siblings_mean,,0:100,0:1000" \
|
||||
"node_get_fsm_siblings_median,,0:100,0:1000" \
|
||||
"node_get_fsm_siblings_95,,0:100,0:1000" \
|
||||
"node_get_fsm_siblings_99,,0:100,0:1000" \
|
||||
"node_get_fsm_siblings_100,,0:100,0:1000" \
|
||||
"node_get_fsm_objsize_mean,,0:100,0:1000" \
|
||||
"node_get_fsm_objsize_median,,0:100,0:1000" \
|
||||
"node_get_fsm_objsize_95,,0:100,0:1000" \
|
||||
"node_get_fsm_objsize_99,,0:100,0:1000" \
|
||||
"node_get_fsm_objsize_100,,0:100,0:1000" \
|
||||
"riak_search_vnodeq_mean,,0:100,0:1000" \
|
||||
"riak_search_vnodeq_median,,0:100,0:1000" \
|
||||
"riak_search_vnodeq_95,,0:100,0:1000" \
|
||||
"riak_search_vnodeq_99,,0:100,0:1000" \
|
||||
"riak_search_vnodeq_100,,0:100,0:1000" \
|
||||
"search_index_fail_one,,0:100,0:1000" \
|
||||
"pbc_active,,0:100,0:1000" \
|
||||
"pbc_connects,,0:100,0:1000" \
|
||||
"read_repairs,,0:100,0:1000" \
|
||||
"list_fsm_active,,0:100,0:1000" \
|
||||
"node_get_fsm_rejected,,0:100,0:1000" \
|
||||
"node_put_fsm_rejected,,0:100,0:1000"
|
||||
```
|
||||
|
||||
#### Sample Output
|
||||
|
||||
```
|
||||
OK: Status OK.|'node_get_fsm_siblings_mean'=0;0:100;0:1000 'node_get_fsm_siblings_median'=0;0:100;0:1000 'node_get_fsm_siblings_95'=0;0:100;0:1000 'node_get_fsm_siblings_99'=0;0:100;0:1000 'node_get_fsm_siblings_100'=0;0:100;0:1000 'node_get_fsm_objsize_mean'=0;0:100;0:1000 'node_get_fsm_objsize_median'=0;0:100;0:1000 'node_get_fsm_objsize_95'=0;0:100;0:1000 'node_get_fsm_objsize_99'=0;0:100;0:1000 'node_get_fsm_objsize_100'=0;0:100;0:1000 'search_index_fail_one'=0;0:100;0:1000 'pbc_active'=0;0:100;0:1000 'pbc_connects'=0;0:100;0:1000 'read_repairs'=0;0:100;0:1000 'list_fsm_active'=0;0:100;0:1000 'node_get_fsm_rejected'=0;0:100;0:1000 'node_put_fsm_rejected'=0;0:100;0:1000
|
||||
```
|
||||
Reference in New Issue
Block a user