There are lots of factors that determine the system health of a server. The hardware capabilities - CPU, memory or disk - is an important one, but also the server load - number of devices (Nodes to be polled, updated, audited, synchronised), number of products (NMIS, OAE, opCharts, opHA - each running different processes), number of concurrent users.
We all want the best performance for a server, and to optimise physical resources, our configuration has to be fine-grained adjusted. In this guide you will find recommended parameters, that may not suit in all cases, as a server performance will depend on a lot of factors.
Related Articles
- Scaling NMIS Polling
- Scaling NMIS polling - how NMIS handles long running processes
- Recommended Configuration for Server Performance
NMIS 9
Before Start
The first thing to do will be get the information of out system:
- System Information: NMIS and OMK support tool will give us all the information needed.
- Monitor services: NMIS can monitor the involved processes - apache2, nmis9d, omkd and mongod - and provide useful information about CPU and memory - among others.
Number of processes
NMIS runs a daemon to obtain periodically the nodes information.
The number of workers is set in the parameter:
nmisd_max_workers
By default 10.
Some aprox. configurations:
Number of nodes | Number of threads |
---|---|
120 | 3-4 |
OMK has the equivalent parameter:
omkd_workers
Setting also omkd_max_requests, will help to have the threads restart gracefully before they get too big.
omkd_max_requests
MongoDB memory usage
MongoDB, in its default configuration, will use will use the larger of either 256 MB or ½ of (ram – 1 GB) for its cache size.
MongoDB cache size can be changed by adding the cacheSizeGB argument to the /etc/mongod.conf configuration file, as shown below.
storage: dbPath: /var/lib/mongodb journal: enabled: true wiredTiger: engineConfig: cacheSizeGB: 1
Here is an interesting information regarding how MongoDB reserves memory for internal cache and WiredTiger, the underneath technology. Also some adjustment that can be done: https://dba.stackexchange.com/questions/148395/mongodb-using-too-much-memory
Server examples
Two servers are compared in this section.
- Master only have one node, but more than 400 poller nodes. opHA process is what will require more CPU and memory usage.
- Poller have more more than 500 nodes. nmis process will require more CPU and memory, for polling the information for all the nodes.
Stressed system POLLER-NINE
System information:
Name | Value |
---|---|
nmisd_max_workers | 10 |
omkd_workers | 4 |
omkd_max_requests | 500 |
Nodes | 406 |
Active Nodes | 507 |
OS | Ubuntu 18.04.3 LTS |
role | poller |
This is how the server memory graphs looks in a stressed system - We will be focused on the memory as it is where the bottleneck is:
NMIS process keeps stable, is not using more than 120 mb, and the process was stopped - probably killed for the system due to high memory usage: TODO How to check this
The OMK process has more fluctuations and higher memory usage - peaks up to 800 mb. The memory trend is to raise:
And mongod keeps using a lot of memory - 3GB, as configured - but it is stable:
Check processes once nmis9d is restarted again:
top
Healthy system MASTER-NINE
System information:
Name | Value |
---|---|
nmisd_max_workers | 5 |
omkd_workers | 10 |
omkd_max_requests | undef |
Nodes | 2 |
Poller Nodes | 536 |
OS | Ubuntu 18.04.3 LTS |
role | master |
This is how the server memory graphs looks in a normal system:
Daemons graphs:
omk:
mongo:
NMIS 8
The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, master, ...
For a collect or an update, the main thread creates forks to perform the operation requested.
Configurations that affect performance
There are some important configuration that affects performace:
- abort_after: From NMIS 8.6.8G there is a new command line option, abort_after, that prevents the main thread to run for a long time, preventing it to collide from the next cron job. By default, this parameter is 60 seconds, as the cron job is set to run every 60 minutes.
This option always needs to have also the option mthreads=true.
nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;
- max_thread: The other important configuration option is max_thread, that will prevent the number of children of the main process be too big. Considerations:
- If the collect has a lot of nodes to process, the number of children won't reach the limit instantly. While the main thread is forking, the children complete their jobs and will exit. Also, the main process will wait for them to change their state so the number will increase slowly.
- NMIS can have more than one instance of the main process running, and the number of children could be higher that max_threads, as the limit is only per instance.
Gaps in Graphs
If the server takes a long time to collect and cannot complete any operation, an useful tool is nmis8/admin/polling_summary. Here we can see how many nodes have any late collect, and a summary of nodes being collected and not collected.
A symptom of an overloaded server can be the gaps in the graphs.
This is an example about how these parameters can impact in the performance of the server, in a server with 64 CPUs and more than 3700 nodes:
When | abort_after | demote_faulty_nodes | CPU | Nodes No Collected | Other |
---|---|---|---|---|---|
Initial Configuration | Default (60 seg) | false | <50% (Aprox.) | 1100 ~ | totalPoll=3713 ontime=891 1x_late=1460 3x_late=41 12x_late=56 144x_late=1265 |
1 Test | 120 | true | <50% (Aprox.) | 500 ~ | N/A |
2 Test | 240 | true | <60% (Aprox.) | 240 ~ | totalPoll=1229 ontime=998 no_snmp=14 demoted=0 1x_late=217 3x_late=0 12x_late=0 144x_late=0 |
3 Test | 0 (Disabled) | true | Around 100% (Aprox.) | 0 | Took 7 minutes. Processed >3000 nodes. Disabled cron |
4 Test | 0 (Disabled) | true | 100% (Aprox.) | N/A | Commented while (wait for children) in nmis.pl |
5 Test | 0 (Disabled) | false | 100% (Aprox.) | N/A | N/A |
Note that problems in the modelling that throughs errors in the logs can also make the system slow.
SUPPORT-6976