RESOURCES FOR TROUBLESHOOTING
Addition Pages
Troubleshooting Open AudIT (Comunity/Professional/Enterprise)
Lessons learned from support cases - common things to look for
Does DNS function properly?
If not any daemon that's doing name resolution will be very slow. Verify the system has an FQDN and resolves to itself. Also check if it can resolve other hosts.
### Check the local systems fqdn screen [root@demo: ~]# hostname -f demo.opmantek.com ### can the local system resolve it's own hostname? screen [root@demo: ~]# dig +short demo.opmantek.com 192.168.88.44 ### Can the system resolve other hosts? screen [root@demo: ~]# dig +short freebsd.org 8.8.178.110
DNS is Important
NMIS/OMK applications expect DNS to work. Managing individual /etc/hosts files does not scale. opHA is one module in particular where this is critical. If the customer does not have a local DNS server for internal hosts consider running BIND on the NMIS master server, other NMIS/OMK servers can use it as a name server. This is not difficult to do and will save a lot of troubleshooting time moving forward.
Does the system have the correct time? Is it synced with a time server?
[nmis@demo var]$ ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +cachens2.onqnet 13.64.159.31 3 u 426 1024 377 4.845 -0.126 0.458 +ec2-13-54-31-22 54.252.165.245 3 u 352 1024 377 18.036 1.540 1.008 -node01.au.verbn 192.12.19.20 2 u 514 1024 377 18.966 -16.530 1.176 *ntp3.syrahost.c 218.100.43.70 2 u 422 1024 377 63.642 -1.172 0.852 [nmis@demo var]$ date -u 2017. 02. 16. (?) 22:33:31 UTC
Compare the system UTC time with actual UTC time. A site such as https://time.is/UTC will show current UTC time.
If the system time is not correct it will result in a lot of problems.
- Time stamps not correct on events
- Graph data not correct
- Transactions with other systems fail (e.g. cookies could already be expired at the time of issue.)
Perl Modules
If NMIS or OMK applications can not locate a perl module it may be missing or it may have the wrong file permissions. Also check directory file permissions.
NMIS Troubleshooting
Node Troubleshooting
Is the node reachable?
Ping it with a big echo request.
[root@opmantek conf]# ping -c 5 -s 1472 192.168.88.254 PING 192.168.88.254 (192.168.88.254) 1472(1500) bytes of data. 1480 bytes from 192.168.88.254: icmp_seq=1 ttl=63 time=319 ms 1480 bytes from 192.168.88.254: icmp_seq=2 ttl=63 time=323 ms 1480 bytes from 192.168.88.254: icmp_seq=3 ttl=63 time=321 ms 1480 bytes from 192.168.88.254: icmp_seq=4 ttl=63 time=320 ms 1480 bytes from 192.168.88.254: icmp_seq=5 ttl=63 time=322 ms --- 192.168.88.254 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4330ms rtt min/avg/max/mdev = 319.542/321.519/323.551/1.450 ms
What does nmap think about it?
[root@opmantek conf]# nmap 10.10.1.1 Starting Nmap 5.51 ( http://nmap.org ) at 2017-04-04 15:05 KST Nmap scan report for 10.10.1.1 Host is up (0.011s latency). Not shown: 998 closed ports PORT STATE SERVICE 22/tcp open ssh 23/tcp open telnet Nmap done: 1 IP address (1 host up) scanned in 13.53 seconds [root@opmantek conf]#
Node Not Present in GUI
Example Case:
Suddenly the node cannot be found in the GUI. When attempting to re-add the node to NMIS via the GUI we receive a 'node already exists' error.
Issue:
Something has become very corrupt, we need to purge NMIS of all relevant node configuration.
Actions:
- Open /usr/local/nmis8/conf/Nodes.nmis with an editor and delete the section for the problem node.
- Remove the following files:
- /usr/local/nmis8/var/<node-name>-node.josn
- /usr/local/nmis8/var/<node-name>-view.json
- Re-add the problem node via the NMIS GUI
- Run the following commands:
- /usr/local/nmis8/bin/nmis.pl type=update node=<node-name> force=true
- /usr/local/nmis8/bin/nmis.pl type=collect node=<node-name> force=true
Verify
The problem node should now be functioning properly in the NMIS GUI.
Manual Update & Collect Actions
If a node isn't providing the data we think it should sometimes looking at manual update & collect debugs is helpful. Redirect or tee the output to a file in order to review latter.
[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update > nodeUpdate.txt -or- [root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update | tee nodeUpdate.txt ################### [root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect > nodeCollect.txt -or- [root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect | tee nodeCollect.txt
Email alerts
Contacts.nmis must have the correct DutyTime format.
External Authentication
conf/Config.nmis must have the proper auth_method order as well as that method being provisioned.
If LDAP isnt working tcpdump can be used to see the response code from the LDAP server.
Long collect times
Are we collecting many interfaces that are not necessary?
Check the view.json file for number of interfaces and interface type. Look for common things such as interface type and description. Use models or Config.nmis to disable collection.
Syslog
When troubleshooting syslog issues the following script will gather more rsyslog daemon information then the nmis support tool.
snmptrapd
When troubleshooting snmptrapd issues the following script will gather more snmptrad daemon information then then nmis support tool.
Models
When troubleshooting models it's important to know if all the OID's that have a 'friendly name' are referenced within Model files have been defined in /usr/local/nmis8/mibs/nmis_mibs.oid. Some Model files import or call other Model, Graph or Common files. If an OID 'friendly name' has not been defined in nmis_mibs.oid it may not be obvious which model file is causing the problem. In order to validate friendly names more easily the script below has been provided. It will parse all the OID friendly names out of the model files and look for them in nmis_mibs.oid. If they are not found the operator will be notified. At some point this script should be converted to perl; this would make it much faster.
opCharts Troubleshooting
TopN
Use the following utility to troubleshoot why charts are being populated into TopN
/usr/local/omk/bin/nmis_topn_export.exe debug=true timing=1 force=1 > topnDebug.txt
RBAC (Role Based Access Control)
General scheme.
- Create role.
- Create user and assign a role.
- Create an object and assign a privilege tag.
- Assign the privilege tag to a role.
Based on this the following script was created to pull all the role, user, object and privilege data out of a customer system.
OMK General
Node synchronization with NMIS
Generally customers trust the node data that NMIS learns dynamically and they use this to automatically update the node data for OMK applications. It's a good idea to install a cron job that automates this synchronization periodically. The following commands work well for opEvents and opConfig respectively.
/usr/local/omk/bin/opevents-cli.exe act=import_from_nmis [overwrite=0/1] [setstate=0/1] /usr/local/omk/bin/opconfig-cli.exe act=import_from_nmis [node=nodeX|nodes=nodeA,...] [overwrite=0/1]
Configuration Files
If it's suspected that a particular configuration file is causing a problem, one technique to isolate the problem follows.
- Backup the suspect configuration file
- Copy the default configuration file from omk/install into omk/conf
- Restart the associated daemons and test