Lessons learned from support cases - common things to look for

Does DNS function properly?

If not any daemon that's doing name resolution will be very slow. Verify the system has an FQDN and resolves to itself. Also check if it can resolve other hosts.

### Check the local systems fqdn
screen [root@demo: ~]# hostname -f
demo.opmantek.com

### can the local system resolve it's own hostname?
screen [root@demo: ~]# dig +short demo.opmantek.com 
192.168.88.44

### Can the system resolve other hosts?
screen [root@demo: ~]# dig +short freebsd.org
8.8.178.110

DNS is Important

NMIS/OMK applications expect DNS to work. Managing individual /etc/hosts files does not scale. opHA is one module in particular where this is critical. If the customer does not have a local DNS server for internal hosts consider running BIND on the NMIS master server, other NMIS/OMK servers can use it as a name server. This is not difficult to do and will save a lot of troubleshooting time moving forward.

Does the system have the correct time? Is it synced with a time server?

[nmis@demo var]$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+cachens2.onqnet 13.64.159.31     3 u  426 1024  377    4.845   -0.126   0.458
+ec2-13-54-31-22 54.252.165.245   3 u  352 1024  377   18.036    1.540   1.008
-node01.au.verbn 192.12.19.20     2 u  514 1024  377   18.966  -16.530   1.176
*ntp3.syrahost.c 218.100.43.70    2 u  422 1024  377   63.642   -1.172   0.852

[nmis@demo var]$ date -u
2017. 02. 16. (?) 22:33:31 UTC

Compare the system UTC time with actual UTC time. A site such as https://time.is/UTC will show current UTC time.

If the system time is not correct it will result in a lot of problems.

Time stamps not correct on events
Graph data not correct
Transactions with other systems fail (e.g. cookies could already be expired at the time of issue.)

Perl Modules

If NMIS or OMK applications can not locate a perl module it may be missing or it may have the wrong file permissions. Also check directory file permissions.

NMIS Troubleshooting

Node Troubleshooting

Is the node reachable?

Ping it with a big echo request.

[root@opmantek conf]# ping -c 5 -s 1472 192.168.88.254
PING 192.168.88.254 (192.168.88.254) 1472(1500) bytes of data.
1480 bytes from 192.168.88.254: icmp_seq=1 ttl=63 time=319 ms
1480 bytes from 192.168.88.254: icmp_seq=2 ttl=63 time=323 ms
1480 bytes from 192.168.88.254: icmp_seq=3 ttl=63 time=321 ms
1480 bytes from 192.168.88.254: icmp_seq=4 ttl=63 time=320 ms
1480 bytes from 192.168.88.254: icmp_seq=5 ttl=63 time=322 ms
--- 192.168.88.254 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4330ms
rtt min/avg/max/mdev = 319.542/321.519/323.551/1.450 ms

What does nmap think about it?

[root@opmantek conf]# nmap 10.10.1.1

Starting Nmap 5.51 ( http://nmap.org ) at 2017-04-04 15:05 KST
Nmap scan report for 10.10.1.1
Host is up (0.011s latency).
Not shown: 998 closed ports
PORT   STATE SERVICE
22/tcp open  ssh
23/tcp open  telnet

Nmap done: 1 IP address (1 host up) scanned in 13.53 seconds
[root@opmantek conf]#

Node Not Present in GUI

Example Case:

Suddenly the node cannot be found in the GUI. When attempting to re-add the node to NMIS via the GUI we receive a 'node already exists' error.

Issue:

Something has become very corrupt, we need to purge NMIS of all relevant node configuration.

Actions:

Open /usr/local/nmis8/conf/Nodes.nmis with an editor and delete the section for the problem node.
Remove the following files:
- /usr/local/nmis8/var/<node-name>-node.josn
- /usr/local/nmis8/var/<node-name>-view.json
Re-add the problem node via the NMIS GUI
Run the following commands:
- /usr/local/nmis8/bin/nmis.pl type=update node=<node-name> force=true
- /usr/local/nmis8/bin/nmis.pl type=collect node=<node-name> force=true

Verify

The problem node should now be functioning properly in the NMIS GUI.

Manual Update & Collect Actions

If a node isn't providing the data we think it should sometimes looking at manual update & collect debugs is helpful. Redirect or tee the output to a file in order to review latter.

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update > nodeUpdate.txt

-or-

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update | tee nodeUpdate.txt

###################

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect > nodeCollect.txt

-or-

[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect | tee nodeCollect.txt

Email alerts

Contacts.nmis must have the correct DutyTime format.

External Authentication

conf/Config.nmis must have the proper auth_method order as well as that method being provisioned.

If LDAP isnt working tcpdump can be used to see the response code from the LDAP server.

Long collect times

Are we collecting many interfaces that are not necessary?

Check the view.json file for number of interfaces and interface type. Look for common things such as interface type and description. Use models or Config.nmis to disable collection.

Syslog

When troubleshooting syslog issues the following script will gather more rsyslog daemon information then the nmis support tool.

getSyslogData.sh

snmptrapd

When troubleshooting snmptrapd issues the following script will gather more snmptrad daemon information then then nmis support tool.

getSnmpTrapdInfo.sh

Models

When troubleshooting models it's important to know if all the OID's that have a 'friendly name' are referenced within Model files have been defined in /usr/local/nmis8/mibs/nmis_mibs.oid. Some Model files import or call other Model, Graph or Common files. If an OID 'friendly name' has not been defined in nmis_mibs.oid it may not be obvious which model file is causing the problem. In order to validate friendly names more easily the script below has been provided. It will parse all the OID friendly names out of the model files and look for them in nmis_mibs.oid. If they are not found the operator will be notified. At some point this script should be converted to perl; this would make it much faster.

checkOid.sh

opCharts Troubleshooting

TopN

Use the following utility to troubleshoot why charts are being populated into TopN

/usr/local/omk/bin/nmis_topn_export.exe debug=true timing=1 force=1 > topnDebug.txt

RBAC (Role Based Access Control)

General scheme.

Create role.
Create user and assign a role.
Create an object and assign a privilege tag.
Assign the privilege tag to a role.

Based on this the following script was created to pull all the role, user, object and privilege data out of a customer system.

getRbacInfo.sh

opEvents Troubleshooting

General

Grep for the following in opEvents.log:

Event ID
State Object ID

Event not found

Look in the raw log.

If an event is skipped due to old age, but the time looks correct, check to see if the opeventsd was running at the time the event was received.

Event Processing

When troubleshooting event processing it's useful to understand the order that the various opEvents configuration files are processed in and the general function of each one.

State

When troubleshooting state it's important to realize that event.event and event.stateful are two completely different things. event.stateful is referred to as 'State Type' in the node context view. State is tracked based on event.stateful only, state status is generally up or down and may be found in the value of event.state.

EventParserRules.nmis provides the ultimate in flexibility in allowing the user to dictate what event.stateful and event.state will be presented to opEvents. For example event.event can be a completely different value then event.stateful.

event.event=Apple; event.stateful=Banana; event.state=up
event.event=Orange; event.stateful=Banana; event.state=down

With this in mine always confirm event.stateful when troubleshooting state inconsistencies.

Poller/Master State Mismatch

If state has been lost between the poller and master servers check to see if a correlation rule has fired suppressing the more specific event.

If the issue is not related to a correlation rule look for the corresponding event on the poller. In the event context check the 'Actions taken for event' section. Was a script executed that would have sent the event to the master? Was it successful, what was the exit code?

opFlow Troubleshooting

If flows are not rendering in the opFlow GUI take the following actions.

Check Log Files

Review the log files in /usr/local/omk/log.

opFlow.log
common.log
opDaemon.log

Verify Flow Data is Received

using tcpdump we can verify that flow data is being received by the server. This example uses the default opFlow UDP port of 9995. Specify the specific host that needs to be verified.

[root@poller001 nfdump]# tcpdump -nn -i eth2 host 10.10.1.1 and port 9995
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 65535 bytes

13:24:55.767037 IP 10.10.1.1.62757 > 10.215.1.7.9995: UDP, length 168
13:25:07.827152 IP 10.10.1.1.62757 > 10.215.1.7.9995: UDP, length 168

When we see output such as the example above we know this server is receiving flow data from the network device.

Check the Flow Data

The next step is to ensure the host in question is providing valid data that nfdump can process. Move to the /var/lib/nfdump directory and look for nfcapd files that end in a datestamp. The datestamp denotes the time the capture file was started. Select a file that is likely to contain samples from the host we with to verify and execute the following command.

[root@poller001 nfdump]# nfdump -r nfcapd.201707111327 -o raw > ~/raw.txt

Now view the new text file with less or a text editor. It will provide flow records such as the following. The 'ip router' field denotes the source router for this flow sample.

Flow Record: 
  Flags        =              0x00 FLOW, Unsampled
  export sysid =                 1
  size         =                76
  first        =        1499779596 [2017-07-11 22:26:36]
  last         =        1499779596 [2017-07-11 22:26:36]
  msec_first   =               447
  msec_last    =               447
  src addr     =         10.10.1.4
  dst addr     =         10.10.1.1
  src port     =             23232
  dst port     =               179
  fwd status   =                 0
  tcp flags    =              0x02 ....S.
  proto        =                 6 TCP  
  (src)tos     =               192
  (in)packets  =                 1
  (in)bytes    =                44
  input        =                 4
  output       =                 0
  src as       =                 0
  dst as       =                 0
  src mask     =                32 10.10.1.4/32
  dst mask     =                32 10.10.1.1/32
  dst tos      =                 0
  direction    =                 0
  ip next hop  =           0.0.0.0
  ip router    =         10.10.1.1
  engine type  =                 0
  engine ID    =                 0
  received at  =     1499747221750 [2017-07-11 13:27:01.750]

Look for things are are not correct in the flow record. The following issues have been found in past support cases.

input/output: These fields should be the SNMP index number of the input or output interfaces.
first/last: This is a timestamp that the router assigns. It's important that the router time is in sync with opFlow time. opFlow uses this time to calculate statisitcs. For example, if the router time is an hour earlier than the server time, opFlow will not display the data until the server time catches up with the router time.

omkd Troubleshooting

If mongod is not running omkd will never start. Ever.

OMK General

Node synchronization with NMIS

Generally customers trust the node data that NMIS learns dynamically and they use this to automatically update the node data for OMK applications. It's a good idea to install a cron job that automates this synchronization periodically. The following commands work well for opEvents and opConfig respectively.

/usr/local/omk/bin/opevents-cli.exe act=import_from_nmis [overwrite=0/1] [setstate=0/1]

/usr/local/omk/bin/opconfig-cli.exe act=import_from_nmis [node=nodeX|nodes=nodeA,...] [overwrite=0/1]

Configuration Files

If it's suspected that a particular configuration file is causing a problem, one technique to isolate the problem follows.

Backup the suspect configuration file
Copy the default configuration file from omk/install into omk/conf
Restart the associated daemons and test

MongoDB

In-memory Sort Operations

If the following repetitive error is observed in /usr/local/omk/log/opEvents.log it may be related to a MongoDB resource issue.

[Sat Nov  3 20:47:51 2018] [error] supervisor[4683] worker process 5424 exited with code 255
[Sat Nov  3 20:47:51 2018] [info] worker process terminated after only 1s, delaying restart for 32s

Look for a corresponding error in mongod.log

2018-11-03T20:47:51.596+0000 E QUERY    [conn507] Plan executor error during find command: FAILURE, stats: { stage: "SORT", nRetu
rned: 0, executionTimeMillisEstimate: 550, works: 459671, advanced: 0, needTime: 459670, needYield: 0, saveState: 3592, restoreSt
ate: 3592, isEOF: 0, invalidates: 0, sortPattern: { time: -1 }, memUsage: 33554492, memLimit: 33554432, inputStage: { stage: "SOR
T_KEY_GENERATOR", nReturned: 0, executionTimeMillisEstimate: 380, works: 459670, advanced: 0, needTime: 2, needYield: 0, saveStat
e: 3592, restoreState: 3592, isEOF: 0, invalidates: 0, inputStage: { stage: "COLLSCAN", filter: { $and: [] }, nReturned: 459668, 
executionTimeMillisEstimate: 90, works: 459669, advanced: 459668, needTime: 1, needYield: 0, saveState: 3592, restoreState: 3592,
 isEOF: 0, invalidates: 0, direction: "forward", docsExamined: 459668 } } }

This MongoDB error is related to an in-memory sort, default memory limit of 32MB as descibed here:

https://docs.mongodb.com/manual/reference/limits/#Sort-Operations

The memory limit for 'in-memory sort' operations may be increased as described here:

https://jira.mongodb.org/browse/SERVER-23768

General Troubleshooting Checklist