Table of Contents |
---|
...
When the Updates and Collects last occurred can be found using the GUI, in the Menu "System > Configuration Check > Node Admin Summary".
Fault-recovery
If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort
or service nmis9d stop
, because these actions immediately kill the relevant processes and don't take active operations into account.
When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis
under overtime_schedule
, with these defaults:
...
Running Collect and Update Jobs Manually
You may need to schedule a collect or update to run immediately, generally if you are doing some modelling activities.
Run an update job on a node called "sol" with debug and log it to a file:
Code Block |
---|
/usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.node=sol job.verbosity=9 job.force=true job.output=/tmp/sol |
The result will be something like:
Code Block |
---|
Job 6142a01930437a20d2084c91 created for node sol (05575270-a4ed-4c79-b992-18218c70ce42) and type update. |
If you get this error, change the node name:
Code Block |
---|
No nodes found matching your selectors! |
The debug logs would be in a file starting with /tmp/sol e.g.
Code Block |
---|
keith@kaos:~$ ls -lrt /tmp/sol*
-rw-r--r-- 1 root root 315334 Sep 16 11:38 /tmp/sol-1631756322.04417.log |
To run an update on all nodes once you have finished with your new model
Code Block |
---|
/usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.force=true |
The result will be a list of nodes with jobs scheduled
Code Block |
---|
keith@kaos:~$ /usr/local/nmis9/bin/nmis-cli act=schedule job.type=update job.force=true
Job 6142a1be9eb635425dd1c211 created for node excalibur (f8653511-9cb5-45a0-a1aa-bef81f4e34b8) and type update.
Job 6142a1be9eb635425dd1c213 created for node sif (46b8e7d2-e2d6-4ea4-8599-349fba105556) and type update.
Job 6142a1be9eb635425dd1c215 created for node sol (05575270-a4ed-4c79-b992-18218c70ce42) and type update. |
To view the scheduler
Code Block |
---|
keith@kaos:~$ /usr/local/nmis9/bin/nmis-cli act=list-schedules verbose=t
Active Jobs:
Id When Status What Parameters
6142a2236301fbc46bb58ee1 Thu Sep 16 11:47:15 2021 In Progress since Thu Sep 16 11:47:16 2021 (Worker 16471) collect {'uuid'='afaea97b-d72d-4ffe-bd09-80df44a8295b','wantsnmp'=1,'wantwmi'=1}
6142a229b45a12c0c4863b87 Thu Sep 16 11:47:21 2021 In Progress since Thu Sep 16 11:47:26 2021 (Worker 16563) update {'force'=1,'uuid'='9cfed9b9-5395-43a9-a52e-f339e1c69c21'}
6142a229b45a12c0c4863b8b Thu Sep 16 11:47:21 2021 In Progress since Thu Sep 16 11:47:26 2021 (Worker 16663) update {'force'=1,'uuid'='42bed16d-8029-401e-bf54-fbe6c074c072'}
6142a229b45a12c0c4863b8f Thu Sep 16 11:47:21 2021 In Progress since Thu Sep 16 11:47:28 2021 (Worker 16303) update {'force'=1,'uuid'='3b1a2c57-97e7-449e-bfad-c30c2d0d645a'}
Queued Jobs:
Id When Priority What Parameters
6142a229b45a12c0c4863b90 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='7c197b17-2d50-434c-a9d2-b8f685afe75a'}
6142a229b45a12c0c4863b91 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='f8653511-9cb5-45a0-a1aa-bef81f4e34b8'}
6142a229b45a12c0c4863b92 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='46b8e7d2-e2d6-4ea4-8599-349fba105556'}
6142a229b45a12c0c4863b93 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='d51dab62-2d6e-4dba-be31-eff1f496cfcb'}
6142a229b45a12c0c4863b94 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='801c9c70-0c06-47e3-a830-76bcabf07e8a'}
6142a229b45a12c0c4863b95 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='fab72303-93dd-4eb0-a917-02c6c3f20efd'}
6142a229b45a12c0c4863b96 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='05575270-a4ed-4c79-b992-18218c70ce42'}
6142a229b45a12c0c4863b97 Thu Sep 16 11:47:21 2021 1 update {'force'=1,'uuid'='4550361e-26a8-43d6-b48d-339b986b9534'} |
Fault-recovery
If a job remains stuck as active job for too long then the nmis daemon will abort it and reschedule a suitable new job. Such stuck jobs can appear in the queue if you terminate the nmis daemon with act=abort
or service nmis9d stop
, because these actions immediately kill the relevant processes and don't take active operations into account.
When and whether NMIS should attempt to recover from stuck jobs is configurable, in Config.nmis
under overtime_schedule
, with these defaults:
Code Block |
---|
"overtime_schedule" => {
# empty, 0 or negative to not abort stuck overtime jobs
"abort_collect_after" => 900, # seconds
"abort_update_after" => 7200,
"abort_services_after" => 900,
"abort_configbackup_after" => 900, # seconds
'abort_purge_after' => 600,
'abort_dbcleanup_after' => 600,
'abort_selftest_after' => 120,
'abort_permission_test_after' => 240,
'abort_escalations_after' => 300,
'abort_metrics_after' => 300,
'abort_thresholds_after' => 300,
}, |
NMIS also warns about unexpected queue states, e.g. if there are too many overdue queued jobs or if there are excessively many queued jobs altogether.
...
- There was no default abort_plugins_after option in the configuration. This value can be added in Config.nmis:
'overtime_schedule' => { 'abort_plugins_after' => 7200, # Seconds ... }
- The schedule keeps adding these jobs into the queue. The workers can discard these jobs changing the configuration options postpone_clashing_schedule to 0.
'postpone_clashing_schedule' => 0,
After theses two changes, nmis9d daemon needs to be restarted.
...
Code Block |
---|
./bin/nmis-cli act=schedule-help
...
Supported Arguments for Schedule Creation:
at: optional time argument for the job to commence, default is now.
job.type: job type, required, one of: collect update services
thresholds escalations metrics configbackup purge dbcleanup
selftest permission_test or plugins
job.priority: optional number between 0 (lowest) and 1 (highest) job priority.
default is 1 for manually scheduled jobs
For collect/update/services:
job.node: node name
job.uuid: node uuid
job.group: group name
All three are optional and can be repeated. If none are given,
all active nodes are chosen.
For collect:
job.wantsnmp, job.wantwmi: optional, default is 1.
For plugins:
job.phase: required, one of update or collect
job.uuid: required, one or more node uuids to operate on
job.force: optional, if set to 1 certain job types ignore scheduling policies
and bypass any cached data.
job.verbosity: optional, verbosity level for just this one job.
must be one of 1..9, debug, info, warn, error or fatal.
job.output: optional, if given as /path/name_prefix or name_prefix
then all log output for this job is saved in a separate file.
path is relative to log directory, and actual file is
name_prefix-<timestamp>.log.
job.tag: somerandomvalue
Optional, used for post-operation plugin grouping. |
For example, if you wanted to schedule a forced update
operation for one particular node to be performed five minutes from now, you'd use the following invocation:
Code Block |
---|
./, if given as /path/name_prefix or name_prefix then all log output for this job is saved in a separate file. path is relative to log directory, and actual file is name_prefix-<timestamp>.log. job.tag: somerandomvalue Optional, used for post-operation plugin grouping. |
For example, if you wanted to schedule a forced update
operation for one particular node to be performed five minutes from now, you'd use the following invocation:
Code Block |
---|
./bin/nmis-cli act=schedule job.type=update at="now + 5 minutes" job.node=testnode job.force=1 Job 5d3a5e2d3feeed1f19c46e55 created for node testnode (6204cd3d-3cc1-4a3a-b91e-e269eb5042a4) and type update. # or with job.priority, job.verbosity and job.output bin/nmis-cli act=schedule job.type=update at="now + 5 minutes" job.node=testnodejob.priority=1 job.node=testnode job.verbosity=9 job.output=/tmp/localhost.log job.force=1 Job 5d3a5e2d3feeed1f19c46e555e7d67dec6c2b14bd3679101 created for node testnode (6204cd3d3d994eb5-3cc1dcba-4a3a46de-b91ebb90-e269eb5042a4914b5dde822f) and type update. # or with job.priority, job.verbosity and job.output bin/nmis-cli act=schedule job.type=update job.priority=1 job.node=testnode job.verbosity=9 job.output=/tmp/localhost.log job.force=1 Job 5e7d67dec6c2b14bd3679101 created for node testnode (3d994eb5-dcba-46de-bb90-914b5dde822f) and type update. |
If successful nmis-cli will report the queue Id and the expanded parameters of your new job.
Administrative and Other CLI Operations
...
If successful nmis-cli will report the queue Id and the expanded parameters of your new job.
Administrative and Other CLI Operations
- If you edit or transfer NMIS files across machines then some file permissions may change for the worse, and the NMIS9 selftest may alert you about invalid file permissions.
The fastest way to resolve this is to use the nmis cli with theact=fixperms
argument. - The
config-backup
argument instructs nmis-cli to produce a backup of your configuration files right now;
normally configuration backups are performed automatically and daily.
Performance data
- From nmis 9.2, new actions where introduced in NMIS 9.2 to collect performance data:
- collect-top-data will collect and process the information of the top data and place it on a csv file. It will run every 5 minutes by a cron job. The files will be purge every 8 days by default:
Code Block |
---|
./nmis-cli act=collect-top-data |
- collect-performance-data will run a set of commands specified in the file conf/performance.nmis. It will run every hour by a cron job and will be purged every 8 days by default:
Code Block |
---|
./nmis-cli act=collect-performance-data |
Logging and Verbosity
Standard Log Files
...