Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview of the Major Components of NMIS9

**starting directory is "/usr/local/nmis9" for the following examples below **

The NMIS9 daemon bin/nmisd

...

The  nmis daemon is controllable using the typical service interface  with the service name being "nmis9d"; e.g. sudo service nmis9d restart .
The daemon should be running by the end of the initial NMIS9 installation.

The  primary CLI

...

tool bin/nmis-cli

The  nmis-cli tool is your primary tool to interact NMIS on the  command line;  e.g. for querying the status of the nmis daemons, for scheduling new operations and for scheduling outages.
Besides these administrative duties the cli tool is currently the only entity that can create saved reports (which is scheduled using a minimal NMIS 9 cron job).

...

When the Updates and Collects last occurred can be found using :

Fault-recovery

If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort  or service nmis9d stop, because these actions immediately kill the relevant processes and don't take active operations into account.

When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis  under overtime_schedule, with these defaults:

Code Block
"overtime_schedule" => {
		# empty, 0 or negative to not abort stuck overtime jobs
		"abort_collect_after" => 900, # seconds
		"abort_update_after" => 7200,
		"abort_services_after" => 900,
		"abort_configbackup_after" => 900, # seconds
		'abort_purge_after' => 600,
		'abort_dbcleanup_after' => 600,
		'abort_selftest_after' => 120,
		'abort_permission_test_after' => 240,
		'abort_escalations_after' => 300,
		'abort_metrics_after' => 300,
		'abort_thresholds_after' => 300,
	},

NMIS also warns about unexpected queue states, e.g. if there are too many overdue queued jobs or if there are excessively many queued jobs altogether.

Parameters to prevent the queue getting too big

When the server has limited resources and cannot process the jobs in time, there is a risk of the jobs getting stacked in the queue. There are two configuration parameters that can help and can be set in Config.nmis:

  • There was no default abort_plugins_after option in the configuration. This value can be added in Config.nmis:
    'overtime_schedule' => {
        'abort_plugins_after' => 7200, # Seconds
       ...
    }
    
  • The schedule keeps adding these jobs into the queue. The workers can discard these jobs changing the configuration options postpone_clashing_schedule to 0.
    'postpone_clashing_schedule' => 0,
    

After theses two changes, nmis9d daemon needs to be restarted.

Interacting with the daemon using nmis-cli

Just like all other NMIS9 command line tools nmis-cli shows an overview of its arguments and capabilities when you run it with -h  or --help  (or without any arguments whatsoever):

Code Block
./bin/nmis-cli 
Usage: nmis-cli [option=value...] <act=command>

 act=fixperms
 act=config-backup
 act=noderefresh
 act=daemon-status (or act=status)

 act=schedule [at=time] <job.type=activity> [job.priority=0..1] [job.X=....]
  act=schedule-help for more detailed help
 act=list-schedules [verbose=t/f] [only=active|queued] [job.X=...]
 act=delete-schedule id=<schedule_id|ALL> [job.X=...]
 act=abort id=<schedule_id>

 act=purge [simulate=t/f] [info=t/f]
 act=dbcleanup [simulate=t/f] [info=t/f]

 act=run-reports period=<day|week|month> type=<all|times|health|top10|outage|response|avail|port>

 act=list-outages [filter=X...]
 act=create-outage [outage.A=B... outage.X.Y=Z...]
 act=update-outage id=<outid> [outage.A=B... outage.X.Y=Z...]
 act={delete-outage|show-outage} id=<outid>
 act=check-outages [node=X|uuid=Y] [time=T]
  act=outage-help for more detailed help

Process Status

To find out what processes are running and doing what, use act=status  or act=process-status ; it'll provide you with an overview like the following example:

...

the GUI, in the Menu "System > Configuration Check > Node Admin Summary".

Running Collect and Update Jobs Manually

You may need to schedule a collect or update to run immediately, generally if you are doing some modelling activities.

Run an update job on a node called "sol" with debug and log it to a file:

Code Block
/usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.node=sol job.verbosity=9 job.force=true job.output=/tmp/sol

The result will be something like:

Code Block
Job 6142a01930437a20d2084c91 created for node sol (05575270-a4ed-4c79-b992-18218c70ce42) and type update.

If you get this error, change the node name:

Code Block
No nodes found matching your selectors!

The debug logs would be in a file starting with /tmp/sol e.g.

Code Block
keith@kaos:~$ ls -lrt /tmp/sol*
-rw-r--r-- 1 root root 315334 Sep 16 11:38 /tmp/sol-1631756322.04417.log

To run an update on all nodes once you have finished with your new model

Code Block
/usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.force=true

The result will be a list of nodes with jobs scheduled

Code Block
keith@kaos:~$ /usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.force=true
Job 6142a1be9eb635425dd1c211 created for node excalibur (f8653511-9cb5-45a0-a1aa-bef81f4e34b8) and type update.
Job 6142a1be9eb635425dd1c213 created for node sif (46b8e7d2-e2d6-4ea4-8599-349fba105556) and type update.
Job 6142a1be9eb635425dd1c215 created for node sol (05575270-a4ed-4c79-b992-18218c70ce42) and type update.

To view the scheduler

Code Block
keith@kaos:~$ /usr/local/nmis9/bin/nmis-cli act=list-schedules verbose=t
Active Jobs:
Id                        When                      Status                                          What                Parameters
6142a2236301fbc46bb58ee1  Thu Sep 16 11:47:15 2021  In Progress since Thu Sep 16 11:47:16 2021 (Worker 16471) collect             Daemon Role                  {'uuid'='afaea97b-d72d-4ffe-bd09-80df44a8295b','wantsnmp'=1,'wantwmi'=1}
6142a229b45a12c0c4863b87  Thu Sep 16 11:47:21 2021  In Progress since Thu Sep 16 11:47:26 2021 (Worker 16563) update              {'force'=1,'uuid'='9cfed9b9-5395-43a9-a52e-f339e1c69c21'}
6142a229b45a12c0c4863b8b  Thu Sep 16 2408411:47:21 2021  In Progress since Thu Sep 16 11:47:26 2021 nmisd(Worker scheduler16663) update                                
24103   {'force'=1,'uuid'='42bed16d-8029-401e-bf54-fbe6c074c072'}
6142a229b45a12c0c4863b8f  Thu Sep 16 11:47:21 2021  In Progress since Thu Sep 16 11:47:28 2021 (Worker 16303) update        nmisd fping     {'force'=1,'uuid'='3b1a2c57-97e7-449e-bfad-c30c2d0d645a'}

Queued Jobs:
Id                        When     24109           nmisd worker services nodeOne 24111  Priority    What     nmisd worker <idle>         Parameters
6142a229b45a12c0c4863b90  Thu Sep 16 11:47:21 2021  1           update 24113           nmisd worker collect nodeSeven         {'force'=1,'uuid'='7c197b17-2d50-434c-a9d2-b8f685afe75a'}
6142a229b45a12c0c4863b91  Thu Sep 16 11:47:21 2021  1           update 24115           nmisd worker <idle>              {'force'=1,'uuid'='f8653511-9cb5-45a0-a1aa-bef81f4e34b8'}
6142a229b45a12c0c4863b92  Thu Sep 16 11:47:21 2021  1           update     

...

         {'force'=1,'uuid'='46b8e7d2-e2d6-4ea4-8599-349fba105556'}
6142a229b45a12c0c4863b93  Thu Sep 16 11:47:21 2021  1           update              {'force'=1,'uuid'='d51dab62-2d6e-4dba-be31-eff1f496cfcb'}
6142a229b45a12c0c4863b94  Thu Sep 16 11:47:21 2021  1           update              {'force'=1,'uuid'='801c9c70-0c06-47e3-a830-76bcabf07e8a'}
6142a229b45a12c0c4863b95  Thu Sep 16 11:47:21 2021  1           update              {'force'=1,'uuid'='fab72303-93dd-4eb0-a917-02c6c3f20efd'}
6142a229b45a12c0c4863b96  Thu Sep 16 11:47:21 2021  1           update              {'force'=1,'uuid'='05575270-a4ed-4c79-b992-18218c70ce42'}
6142a229b45a12c0c4863b97  Thu Sep 16 11:47:21 2021  1           update              {'force'=1,'uuid'='4550361e-26a8-43d6-b48d-339b986b9534'}

Fault-recovery

If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort  or service nmis9d stop, because these actions immediately kill the relevant processes and don't take active operations into account.

When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis  under overtime_schedule, with these defaults:

Code Block
"overtime_schedule" => {
		# empty, 0 or negative to not abort stuck overtime jobs
		"abort_collect_after" => 900, # seconds
		"abort_update_after" => 7200,
		"abort_services_after" => 900,
		"abort_configbackup_after" => 900, # seconds
		'abort_purge_after' => 600,
		'abort_dbcleanup_after' => 600,
		'abort_selftest_after' => 120,
		'abort_permission_test_after' => 240,
		'abort_escalations_after' => 300,
		'abort_metrics_after' => 300,
		'abort_thresholds_after' => 300,
	},

NMIS also warns about unexpected queue states, e.g. if there are too many overdue queued jobs or if there are excessively many queued jobs altogether.

Parameters to prevent the queue getting too big

When the server has limited resources and cannot process the jobs in time, there is a risk of the jobs getting stacked in the queue. One of the symptoms we can observe in the logs: 

Code Block
Performance warning: N overdue queued jobs!


There are two configuration parameters that can help and can be set in Config.nmis:

  • There was no default abort_plugins_after option in the configuration. This value can be added in Config.nmis:
    'overtime_schedule' => {
        'abort_plugins_after' => 7200, # Seconds
       ...
    }
    
  • The schedule keeps adding these jobs into the queue. The workers can discard these jobs changing the configuration options postpone_clashing_schedule to 0.
    'postpone_clashing_schedule' => 0,
    

After theses two changes, nmis9d daemon needs to be restarted.

Interacting with the daemon using nmis-cli

Just like all other NMIS9 command line tools nmis-cli shows an overview of its arguments and capabilities when you run it with -h  or --help  (or without any arguments whatsoever):

Code Block
./bin/nmis-cli 
Usage: nmis-cli [option=value...] <act=command>

 act=fixperms
 act=config-backup
 act=noderefresh
 act=daemon-status (or act=status)

 act=schedule [at=time] <job.type=activity> [job.priority=0..1] [job.X=....]
  act=schedule-help for more detailed help
 act=list-schedules [verbose=t/f] [only=active|queued] [job.X=...]
 act=delete-schedule id=<schedule_id|ALL> [job.X=...]
 act=abort id=<schedule_id>

 act=purge [simulate=t/f] [info=t/f]
 act=dbcleanup [simulate=t/f] [info=t/f]

 act=run-reports period=<day|week|month> type=<all|times|health|top10|outage|response|avail|port>

 act=list-outages [filter=X...]
 act=create-outage [outage.A=B... outage.X.Y=Z...]
 act=update-outage id=<outid> [outage.A=B... outage.X.Y=Z...]
 act={delete-outage|show-outage} id=<outid>
 act=check-outages [node=X|uuid=Y] [time=T]
  act=outage-help for more detailed help

Process Status

To find out what processes are running and doing what, use act=status  or act=process-status ; it'll provide you with an overview like the following example:

Code Block
./bin/nmis-cli act=status
PID             Daemon Role                                     
24084           nmisd scheduler                                 
24103           nmisd fping                                     
24109           nmisd worker services nodeOne
24111           nmisd worker <idle>                             
24113           nmisd worker collect nodeSeven                   
24115           nmisd worker <idle>                             

Normally you should have one "nmisd scheduler" process, one "nmisd fping" worker and a few workers. The default configuration (see config item nmisd_max_workers) is to start up and maintain 10 workers. In the example above two of these are idle and two are currently processing particular jobs. Please take note of the process id or PID; both are relevant for logging (e.g. finding particulars in the log file as well as adjusting the logging verbosity).

...

Code Block
./bin/nmis-cli act=list-schedules verbose=1
Active Jobs:
Id                        When                      Status                                          What                Parameters
5d3a48fc0a6b3126df1a1a55  Fri Jul 26 10:27:40 2019  In Progress since Fri Jul 26 10:27:40 2019 (Worker 2511) collect                 {'force'=1,'uuid'='286d04c7-149c-4b47-9697-75cf927f3ade','wantsnmp'=1,'wantwmi'=1}
...

The important aspects of this verbose display are the 'uuid', which uniquely identifies the node in question for this particular collect operation, and the job 'Id' which is visible in the logs and can be used to abort a job if problems arise.

How to delete Queued Jobs or abort Active Jobs

You can remove queued jobs individually or wholesale using the act=delete-schedule  option of nmis-cli; either pass in the job's Id, (e.g. id=5d3a48fc0a6b3126df1a1a55) or use the argument id=ALL  with optional further job property filters (e.g. job.type=services job.uuid=<somenodeuuid> ) to delete just the matching jobs.

A similar operation is possible  for aborting active jobs, but please be aware of possible negative consequences: if you abort an active job with act=abort, then the worker process handling that job is forcibly terminated immediately which may result in data corruption.

Manual Scheduling of Jobs

The nmis cli can be used to create new job schedules manually, and the expected arguments for queue management are shown when you run nmis-cli with act=schedule-help  (or act=schedule  without any further parameters):

Code Block
./bin/nmis-cli act=schedule-help
...
Supported Arguments for Schedule Creation:

at: optional time argument for the job to commence, default is now.

job.type: job type, required, one of: collect update services
  thresholds escalations metrics configbackup purge dbcleanup
  selftest permission_test or plugins

job.priority: optional number between 0 (lowest) and 1 (highest) job priority.
 default is 1 for manually scheduled jobs

For collect/update/services:
job.node: node name
job.uuid: node uuid
job.group: group name
  All three are optional and can be repeated. If none are given,
  all active nodes are chosen.

For collect:
job.wantsnmp, job.wantwmi: optional, default is 1.

For plugins:
job.phase: required, one of update or collect
job.uuid: required, one or more node uuids to operate on

job.force: optional, if set to 1 certain job types ignore scheduling policies
 and bypass any cached data.
job.verbosity: optional, verbosity level for just this one job.
 must be one of 1..9, debug, info, warn, error or fatal.
job.output: optional,  if given as /path/name_prefix or name_prefix
 then all log output for this job is saved in a separate file.
 path is relative to log directory, and actual file is
 name_prefix-<timestamp>.log.
job.tag: somerandomvalue
 Optional, used for post-operation plugin grouping.

For example, if you wanted to schedule a forced update  operation for one particular node to be performed five minutes from now, you'd use the following invocation:

Code Block
./bin/nmis-cli act=schedule job.type=update at="now + 5 minutes" job.node=testnode job.force=1 
Job 5d3a5e2d3feeed1f19c46e55 created for node testnode (6204cd3d-3cc1-4a3a-b91e-e269eb5042a4) and type update.

If successful nmis-cli will report the queue Id and the expanded parameters of your new job.

Administrative and Other CLI Operations

...

  {'force'=1,'uuid'='286d04c7-149c-4b47-9697-75cf927f3ade','wantsnmp'=1,'wantwmi'=1}
...

The important aspects of this verbose display are the 'uuid', which uniquely identifies the node in question for this particular collect operation, and the job 'Id' which is visible in the logs and can be used to abort a job if problems arise.

How to delete Queued Jobs or abort Active Jobs

You can remove queued jobs individually or wholesale using the act=delete-schedule  option of nmis-cli; either pass in the job's Id, (e.g. id=5d3a48fc0a6b3126df1a1a55) or use the argument id=ALL  with optional further job property filters (e.g. job.type=services job.uuid=<somenodeuuid> ) to delete just the matching jobs.

A similar operation is possible  for aborting active jobs, but please be aware of possible negative consequences: if you abort an active job with act=abort, then the worker process handling that job is forcibly terminated immediately which may result in data corruption.

Manual Scheduling of Jobs

The nmis cli can be used to create new job schedules manually, and the expected arguments for queue management are shown when you run nmis-cli with act=schedule-help  (or act=schedule  without any further parameters):

Code Block
./bin/nmis-cli act=schedule-help
...
Supported Arguments for Schedule Creation:

at: optional time argument for the job to commence, default is now.

job.type: job type, required, one of: collect update services
  thresholds escalations metrics configbackup purge dbcleanup
  selftest permission_test or plugins

job.priority: optional number between 0 (lowest) and 1 (highest) job priority.
 default is 1 for manually scheduled jobs

For collect/update/services:
job.node: node name
job.uuid: node uuid
job.group: group name
  All three are optional and can be repeated. If none are given,
  all active nodes are chosen.

For collect:
job.wantsnmp, job.wantwmi: optional, default is 1.

For plugins:
job.phase: required, one of update or collect
job.uuid: required, one or more node uuids to operate on

job.force: optional, if set to 1 certain job types ignore scheduling policies
 and bypass any cached data.
job.verbosity: optional, verbosity level for just this one job.
 must be one of 1..9, debug, info, warn, error or fatal.
job.output: optional,  if given as /path/name_prefix or name_prefix
 then all log output for this job is saved in a separate file.
 path is relative to log directory, and actual file is
 name_prefix-<timestamp>.log.
job.tag: somerandomvalue
 Optional, used for post-operation plugin grouping.


For example, if you wanted to schedule a forced update  operation for one particular node to be performed five minutes from now, you'd use the following invocation:

Code Block
./bin/nmis-cli act=schedule job.type=update at="now + 5 minutes" job.node=testnode job.force=1 
Job 5d3a5e2d3feeed1f19c46e55 created for node testnode (6204cd3d-3cc1-4a3a-b91e-e269eb5042a4) and type update.

# or with job.priority, job.verbosity and job.output
bin/nmis-cli act=schedule job.type=update job.priority=1 job.node=testnode job.verbosity=9 job.output=/tmp/localhost.log job.force=1
Job 5e7d67dec6c2b14bd3679101 created for node testnode (3d994eb5-dcba-46de-bb90-914b5dde822f) and type update.

If successful nmis-cli will report the queue Id and the expanded parameters of your new job.

Administrative and Other CLI Operations

  • If you edit or transfer NMIS files across machines then some file permissions may change for the worse, and the NMIS9 selftest may alert you about invalid file permissions.
    The fastest way to resolve this is to use the nmis cli with the act=fixperms argument.
  • The config-backup  argument instructs nmis-cli to produce a backup of your configuration files right now;
    normally configuration backups are performed automatically and daily.

Performance data

  • From nmis 9.2, new actions where introduced in NMIS 9.2 to collect performance data: 
  • collect-top-data will collect and process the information of the top data and place it on a csv file. It will run every 5 minutes by a cron job. The files will be purge every 8 days by default: 
Code Block
./nmis-cli act=collect-top-data
  • collect-performance-data will run a set of commands specified in the file conf/performance.nmis. It will run every hour by a cron job and will be purged every 8 days by default:
Code Block
./nmis-cli act=collect-performance-data


Logging and Verbosity

Standard Log Files

...