Lessons learned from support cases - common things to look for.

Does DNS function properly?

If not any daemon that's doing name resolution will be very slow. Verify the system has an FQDN and resolves to itself. Also check if it can resolve other hosts.

### Check the local systems fqdn
screen [root@demo: ~]# hostname -f
demo.opmantek.com

### can the local system resolve it's own hostname?
screen [root@demo: ~]# dig +short demo.opmantek.com 
192.168.88.44

### Can the system resolve other hosts?
screen [root@demo: ~]# dig +short freebsd.org
8.8.178.110

Does the system have the correct time? Is it synced with a time server?

[nmis@demo var]$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+cachens2.onqnet 13.64.159.31     3 u  426 1024  377    4.845   -0.126   0.458
+ec2-13-54-31-22 54.252.165.245   3 u  352 1024  377   18.036    1.540   1.008
-node01.au.verbn 192.12.19.20     2 u  514 1024  377   18.966  -16.530   1.176
*ntp3.syrahost.c 218.100.43.70    2 u  422 1024  377   63.642   -1.172   0.852

[nmis@demo var]$ date -u
2017. 02. 16. (?) 22:33:31 UTC

Compare the system UTC time with actual UTC time. A site such as https://time.is/UTC will show current UTC time.

If the system time is not correct it will result in a lot of problems.

Time stamps not correct on events
Graph data not correct
Transactions with other systems fail (e.g. cookies could already be expired at the time of issue.)

NMIS Troubleshooting

Node Troubleshooting

Is the node reachable?

Ping it with a big echo request.

[root@opmantek conf]# ping -c 5 -s 1472 192.168.88.254
PING 192.168.88.254 (192.168.88.254) 1472(1500) bytes of data.
1480 bytes from 192.168.88.254: icmp_seq=1 ttl=63 time=319 ms
1480 bytes from 192.168.88.254: icmp_seq=2 ttl=63 time=323 ms
1480 bytes from 192.168.88.254: icmp_seq=3 ttl=63 time=321 ms
1480 bytes from 192.168.88.254: icmp_seq=4 ttl=63 time=320 ms
1480 bytes from 192.168.88.254: icmp_seq=5 ttl=63 time=322 ms
--- 192.168.88.254 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4330ms
rtt min/avg/max/mdev = 319.542/321.519/323.551/1.450 ms

What does nmap think about it?

[root@opmantek conf]# nmap 10.10.1.1

Starting Nmap 5.51 ( http://nmap.org ) at 2017-04-04 15:05 KST
Nmap scan report for 10.10.1.1
Host is up (0.011s latency).
Not shown: 998 closed ports
PORT   STATE SERVICE
22/tcp open  ssh
23/tcp open  telnet

Nmap done: 1 IP address (1 host up) scanned in 13.53 seconds
[root@opmantek conf]#

Manual Update & Collect Actions

If a node isn't providing the data we think it should sometimes looking at manual update & collect debugs is helpful. Redirect or tee the output to a file in order to review latter.

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update > nodeUpdate.txt

-or-

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update | tee nodeUpdate.txt

###################

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect > nodeCollect.txt

-or-

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect | tee nodeCollect.txt

Email alerts

Contacts.nmis must have the correct DutyTime format.

External Authentication

conf/Config.nmis must have the proper auth_method order as well as that method being provisioned.

If LDAP isnt working tcpdump can be used to see the response code from the LDAP server.

Long collect times

Are we collecting many interfaces that are not necessary?

Check the view.json file for number of interfaces and interface type. Look for common things such as interface type and description. Use models or Config.nmis to disable collection.

opCharts Troubleshooting

TopN

Use the following utility to troubleshoot why charts are being populated into TopN

/usr/local/omk/bin/nmis_topn_export.exe debug=true timing=1 force=1 > topnDebug.txt

opEvents Troubleshooting

General

Grep for the following in opEvents.log:

Event ID
State Object ID

Event not found

Look in the raw log.

If an event is skipped due to old age, but the time looks correct, check to see if the opeventsd was running at the time the event was received.

Need a flow diagram of how and why the many opEvents rule processing files are processed.

State

When troubleshooting state it's important to realize that event.event and event.stateful are two completely different things. event.stateful is referred to as 'State Type' in the node context view. State is tracked based on event.stateful only, state status is generally up or down and may be found is the value of event.state.

EventParserRules.nmis provides the ultimate in flexibility in allowing the user to dictate what event.stateful and event.state will be presented to opEvents. For example event.event can be completely different value then event.stateful.

event.event=Apple; event.stateful=Banana; event.state=up
event.event=Orange; event.stateful=Banana; event.state=down

With this in mine always confirm event.stateful when troubleshooting state inconsistencies.

Poller/Master State Mismatch

If state has been lost between the poller and master servers check to see if a correlation rule has fired suppressing the more specific event.

omkd Troubleshooting

If mongod is not running omkd will never start.

General Troubleshooting Checklist