Lessons learned from support cases - common things to look for.
Table of Contents |
---|
RESOURCES FOR TROUBLESHOOTING
Child pages (Children Display) |
---|
ADDITIONAL RESOURCES
- Troubleshooting opFlow
- Troubleshooting Open AudIT (Comunity/Professional/Enterprise)
- The Opmantek Support Tool
- Simplify Large Scale NMIS/OMK Server Deployments with Domain Wide Standardization Script
TABLE OF CONTENTS
Table of Contents | ||
---|---|---|
|
Lessons Learned from Support Cases
Does DNS function properly?
...
Code Block |
---|
### Check the local systems fqdn screen [root@demo: ~]# hostname -f demo.opmantek.com ### can the local system resolve it's own hostname? screen [root@demo: ~]# dig +short demo.opmantek.com 192.168.88.44 ### Can the system resolve other hosts? screen [root@demo: ~]# dig +short freebsd.org 8.8.178.110 |
Why DNS is Important
NMIS/OMK applications expect DNS to work. Managing individual /etc/hosts files does not scale. opHA is one module in particular where this is critical. If the customer does not have a local DNS server for internal hosts consider running BIND on the NMIS Primary server, other NMIS/OMK servers can use it as a name server. This is not difficult to do and will save a lot of troubleshooting time moving forward.
Does the system have the correct time? Is it synced with a time server?
Code Block |
---|
[nmis@demo var]$ ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +cachens2.onqnet 13.64.159.31 3 u 426 1024 377 4.845 -0.126 0.458 +ec2-13-54-31-22 54.252.165.245 3 u 352 1024 377 18.036 1.540 1.008 -node01.au.verbn 192.12.19.20 2 u 514 1024 377 18.966 -16.530 1.176 *ntp3.syrahost.c 218.100.43.70 2 u 422 1024 377 63.642 -1.172 0.852 [nmis@demo var]$ date -u 2017. 02. 16. (?) 22:33:31 UTC |
Compare the system UTC time with actual UTC time. A site such as https://time.is/UTC will show current UTC time.
...
- Time stamps not correct on events
- Graph data not correct
- Transactions with other systems fail (e.g. cookies could already be expired at the time of issue.)
NMIS Troubleshooting
Node Troubleshooting
Is the node reachable?
Ping it with a big echo request.
Code Block |
---|
[root@opmantek conf]# ping -c 5 -s 1472 192.168.88.254
PING 192.168.88.254 (192.168.88.254) 1472(1500) bytes of data.
1480 bytes from 192.168.88.254: icmp_seq=1 ttl=63 time=319 ms
1480 bytes from 192.168.88.254: icmp_seq=2 ttl=63 time=323 ms
1480 bytes from 192.168.88.254: icmp_seq=3 ttl=63 time=321 ms
1480 bytes from 192.168.88.254: icmp_seq=4 ttl=63 time=320 ms
1480 bytes from 192.168.88.254: icmp_seq=5 ttl=63 time=322 ms
--- 192.168.88.254 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4330ms
rtt min/avg/max/mdev = 319.542/321.519/323.551/1.450 ms |
What does nmap think about it?
Code Block |
---|
[root@opmantek conf]# nmap 10.10.1.1
Starting Nmap 5.51 ( http://nmap.org ) at 2017-04-04 15:05 KST
Nmap scan report for 10.10.1.1
Host is up (0.011s latency).
Not shown: 998 closed ports
PORT STATE SERVICE
22/tcp open ssh
23/tcp open telnet
Nmap done: 1 IP address (1 host up) scanned in 13.53 seconds
[root@opmantek conf]#
|
Manual Update & Collect Actions
If a node isn't providing the data we think it should sometimes looking at manual update & collect debugs is helpful. Redirect or tee the output to a file in order to review latter.
Code Block |
---|
[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update > nodeUpdate.txt
-or-
[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=update | tee nodeUpdate.txt
###################
[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect > nodeCollect.txt
-or-
[root@opmantek ~]# /usr/local/nmis8/bin/nmis.pl node=asgard debug=9 type=collect | tee nodeCollect.txt |
Email alerts
Contacts.nmis must have the correct DutyTime format.
External Authentication
conf/Config.nmis must have the proper auth_method order as well as that method being provisioned.
If LDAP isnt working tcpdump can be used to see the response code from the LDAP server.
Long collect times
Are we collecting many interfaces that are not necessary?
Check the view.json file for number of interfaces and interface type. Look for common things such as interface type and description. Use models or Config.nmis to disable collection.
opCharts Troubleshooting
TopN
Use the following utility to troubleshoot why charts are being populated into TopN
Code Block |
---|
/usr/local/omk/bin/nmis_topn_export.exe debug=true timing=1 force=1 > topnDebug.txt |
opEvents Troubleshooting
General
Grep for the following in opEvents.log:
- Event ID
- State Object ID
Event not found
Look in the raw log.
If an event is skipped due to old age, but the time looks correct, check to see if the opeventsd was running at the time the event was received.
Need a flow diagram of how and why the many opEvents rule processing files are processed.
State
When troubleshooting state it's important to realize that event.event and event.stateful are two completely different things. event.stateful is referred to as 'State Type' in the node context view. State is tracked based on event.stateful only, state status is generally up or down and may be found is the value of event.state.
EventParserRules.nmis provides the ultimate in flexibility in allowing the user to dictate what event.stateful and event.state will be presented to opEvents. For example event.event can be completely different value then event.stateful.
- event.event=Apple; event.stateful=Banana; event.state=up
- event.event=Orange; event.stateful=Banana; event.state=down
With this in mine always confirm event.stateful when troubleshooting state inconsistencies.
Poller/Master State Mismatch
If state has been lost between the poller and master servers check to see if a correlation rule has fired suppressing the more specific event.
omkd Troubleshooting
...
- User logs in, then is kicked back to the login screen; the browser cookie is expired because the server time and workstation time is outside the cookie lifespan.
Perl Modules
If NMIS or OMK applications can not locate a Perl module it may be missing or it may have the wrong file permissions. Also check directory file permissions.
OMK General
Node synchronization with NMIS
Generally customers trust the node data that NMIS learns dynamically and they use this to automatically update the node data for OMK applications. It's a good idea to install a cron job that automates this synchronization periodically. The following commands work well for opEvents and opConfig respectively.
Code Block |
---|
/usr/local/omk/bin/opevents-cli.exe act=import_from_nmis [overwrite=0/1] [setstate=0/1]
/usr/local/omk/bin/opconfig-cli.exe act=import_from_nmis [node=nodeX|nodes=nodeA,...] [overwrite=0/1] |
Configuration Files
If it's suspected that a particular configuration file is causing a problem, one technique to isolate the problem follows.
- Backup the suspect configuration file
- Copy the default configuration file from omk/install into omk/conf
- Restart the associated daemons and test