Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview of the Major Components of NMIS9

...

Code Block
Operation                     Frequency
Escalations                   1m30s
Metrics Computation           2m
Configuration Backup          1d
Old File Purging
Old File Purging              1h
Database Cleanup              1d
Selftest                      15m
File Permission Test      1h Database Cleanup  2h

The configuration items controlling these activities' scheduling frequencies are grouped in the  schedule  section of Config.nmis, with these defaults:

Code Block
	'schedule' => {
		# empty, 0 or negative to disable automatic  1d
Selftest                      15m
File Permission Test          2h

The configuration items controlling these activities' scheduling frequencies are grouped in the  schedule  section of Config.nmis, with these defaults:

Code Block
	'schedule' => {
		# empty, 0 or negative to disable automatic scheduling
		'schedule_configbackup' => 86400,
		'schedule_purge' => 3600,
		'schedule_dbcleanup' => 86400,
		'schedule_selftest' => 15*60,
		'schedule_permission_test' => 2*3600,
		'schedule_escalations' => 90,
		'schedule_metrics' => 120,
		'schedule_thresholds' => 120, # ignored if global_threshold is false or threshold_poll_node is true
	},

If you want to manually schedule one of these with nmis-cli, use the suffix after schedule_  as the job type, e.g. permission_test  for the  extended selftest.

Node Activity Scheduling

The node-centric actions (e.g. collect, update) are scheduled based on the node's last activity timestamps and its polling policy, which works the same as in NMIS8. Service checks are scheduled based on the service's period definition, again mostly unchanged from NMIS8.

Fault-recovery

If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort  or service nmis9d stop, because these actions immediately kill the relevant processes and don't take active operations into account.

When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis  under overtime_schedule, with these defaults:

Code Block
"overtime_schedule" => {
		# empty, 0 or negative to not abort stuck overtime jobs
		"abort_collect_after" => 900, # seconds
		"abort_update_after" => 7200,
		"abort_services_after" => 900,
		"abort_configbackup_after" => 900, # seconds
		'abort_purge_after' => 600,
		'abort_dbcleanup_after' => 600,
		'abort_selftest_after' => 120,
		'abort_permission_test_after' => 240,
		'abort_escalations_after' => 300,
		'abort_metrics_after' => 300,
		'abort_thresholds_after' => 300,
	},

...

scheduling
		'schedule_configbackup' => 86400,
		'schedule_purge' => 3600,
		'schedule_dbcleanup' => 86400,
		'schedule_selftest' => 15*60,
		'schedule_permission_test' => 2*3600,
		'schedule_escalations' => 90,
		'schedule_metrics' => 120,
		'schedule_thresholds' => 120, # ignored if global_threshold is false or threshold_poll_node is true
	},

If you want to manually schedule one of these with nmis-cli, use the suffix after schedule_  as the job type, e.g. permission_test  for the  extended selftest.

Node Activity Scheduling

The node-centric actions (e.g. collect, update) are scheduled based on the node's last activity timestamps and its polling policy, which works the same as in NMIS8. Service checks are scheduled based on the service's period definition, again mostly unchanged from NMIS8.

When the Updates and Collects last occurred can be found using:


Fault-recovery

If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort  or service nmis9d stop, because these actions immediately kill the relevant processes and don't take active operations into account.

When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis  under overtime_schedule, with these defaults:

Code Block
"overtime_schedule" => {
		# empty, 0 or negative to not abort stuck overtime jobs
		"abort_collect_after" => 900, # seconds
		"abort_update_after" => 7200,
		"abort_services_after" => 900,
		"abort_configbackup_after" => 900, # seconds
		'abort_purge_after' => 600,
		'abort_dbcleanup_after' => 600,
		'abort_selftest_after' => 120,
		'abort_permission_test_after' => 240,
		'abort_escalations_after' => 300,
		'abort_metrics_after' => 300,
		'abort_thresholds_after' => 300,
	},

NMIS also warns about unexpected queue states, e.g. if there are too many overdue queued jobs or if there are excessively many queued jobs altogether.

Parameters to prevent the queue getting too big

When the server has limited resources and cannot process the jobs in time, there is a risk of the jobs getting stacked in the queue. There are two configuration parameters that can help and can be set in Config.nmis:

  • There was no default abort_plugins_after option in the configuration. This value can be added in Config.nmis:
    'overtime_schedule' => {
        'abort_plugins_after' => 7200, # Seconds
       ...
    }
    
  • The schedule keeps adding these jobs into the queue. The workers can discard these jobs changing the configuration options postpone_clashing_schedule to 0.
    'postpone_clashing_schedule' => 0,
    

After theses two changes, nmis9d daemon needs to be restarted.

Interacting with the daemon using nmis-cli

...