Node Health Report
The Node Health report display health-related attributes for all selected nodes for a given period. Attributes displayed are: Status, Device, Availability, Interface Availability, %CPU, 95th% CPU, Max %CPU, CPU Exc., %Mem Free, 95th% Mem Used, Max %Mem Used, %Mem Util, %IO/VIR Mem Free, 95th% IO Mem Used, Max %IO Mem Used, %IO/VIR Mem Util.
The report also includes two columns with the detected (abnormal) Conditions and the recommended Actions.
If you pass this report the option exceptions
=true, then only nodes with exceptional conditions present are shown; the default is to show all nodes.
Below shows the outcome of a default Node Health Report or where exceptions=false.
A Node Health Report using the same devices where exceptions=true looks similar to the image below.
The formulas used for calculation of the reporting conditions can be tuned and adjusted by the user:
The section opreports_rules
(in conf/opCommon.nmis
in opReports 3.x, or opReports.nmis
in version 2.x) defines the threshold values for the following conditions:
Device Availability = Condition: "Device has LOW or VERY LOW availability"
Action: Investigate causes for low availability
Formula used for Calculation:
- Very Low device availability less than 99.9
- Low device availability less than 99.999
Interface Availability = Condition: "Device has LOW or VERY LOW interface availability"
Action: Investigate causes for low interface availability
Formula used for Calculation:
- Very Low interface availability less than 80
- Low interface availability less than 95
CPU Utilisation = Condition: "Device has VERY HIGH, HIGH or MODERATE CPU utilisation"
Action: Investigate causes for CPU utilisation
Formula used for Calculation:
- Very High CPU utilisation: greater than 30%
- High CPU utilisation: greater than 20%
- Moderate CPU utilisation: greater than 12%
If the node has multiple CPUs then the utilisation measure is averaged over all CPUs.
CPU Exceptions
The count of times the CPU utilisation exceeded the "CPU Exception Threshold" of 20%. If the node has multiple CPUs then this is the sum of the exception counts of all CPUs.
Memory Utilisation = Condition: "Device has VERY LOW or LOW main memory free"
Action: Investigate causes for free low main memory
Formula used for Calculation:
- Very Low free main memory less than 10
- Low free main memory less than 25
IO or Virtual Memory Utilisation = Condition: "Device has VERY LOW or LOW IO or Virtual memory free"
Action: Investigate causes for low free IO or Virtual memory
Formula used for Calculation:
- Very Low free main memory less than 10
- Low free main memory less than 25