...
Mongo cluster heartbeat check on Main Primary
Code Block shankarn@opha-dev4:/usr/local/omk/conf$ mongosh --username opUserRW --password op42flow42 admin --port 27017 rs1 [direct: primary] admin> rs.status() { ... members: [ { _id: 0, name: 'opha-dev4.opmantek.net:27017', health: 1, state: 1, stateStr: 'PRIMARY', uptime: 17503, optime: { ts: Timestamp({ t: 1763526818, i: 9 }), t: Long('7') }, optimeDate: ISODate('2025-11-19T04:33:38.000Z'), lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'), lastDurableWallTime: ISODate('2025-11-19T04:33:38.190Z'), }, { _id: 1, name: 'opha-dev7.opmantek.net:27017', health: 1, state: 2, stateStr: 'SECONDARY', uptime: 17496, optime: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') }, optimeDurable: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') }, optimeDate: ISODate('2025-11-19T04:33:34.000Z'), optimeDurableDate: ISODate('2025-11-19T04:33:34.000Z'), lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'), lastDurableWallTime: ISODate('2025-11-19T04:33:38.225Z'), lastHeartbeat: ISODate('2025-11-19T04:33:36.300Z'), lastHeartbeatRecv: ISODate('2025-11-19T04:33:37.493Z'), }, { _id: 2, name: 'opha-dev6.opmantek.net:27018', health: 1, state: 7, stateStr: 'ARBITER', uptime: 17496, lastHeartbeat: ISODate('2025-11-19T04:33:36.301Z'), lastHeartbeatRecv: ISODate('2025-11-19T04:33:36.290Z'), } ],
Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.
If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.
If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).
Peer: Pause the message bus on the Peer
| Code Block | ||||
|---|---|---|---|---|
| ||||
/usr/local/omk/bin/ophad cmd producer pause |
Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.
...
Move the desired Poller back to opha4 (opha4 => opha5)
This command would set opHA to start using the Message bus.
| Code Block | ||||
|---|---|---|---|---|
| ||||
/usr/local/omk/bin/ophad cmd producer start |
...
Scenario 2 : opHA-MB failover and failback commands.
State: It is possible to get the state of the Peers on the Main Primary using the cli
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer state |
Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.
Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID> |
There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID> |
Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.
Switching Main and Secondary Primary Servers
In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.
Run as root user
| Code Block | ||||
|---|---|---|---|---|
| ||||
systemctl restart nmis9d opchartsd opeventsd omkd ophad |
To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:
Connect to MongoDB on the master server in this case the (second-primary):
Code Block mongosh --username opUserRW --password op42flow42 adminUpdate member priorities:
Code Block cfg = rs.conf() cfg.members[0].priority = 0.6 cfg.members[1].priority = 0.5 rs.reconfig(cfg)
Enable logging:
ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
"ophad_logfile" : "/usr/local/omk/log/ophad.log",Code Block "opha" : { "opha_role" : "Main Primary", "ophad_logfile" : "/usr/local/omk/log/ophad.log", "ophad_streaming_apps" : [ "nmis", "opevents" ],nats-server logging : add the lines to /etc/nats-server.conf
log_file: "/var/log/nats-server.log"Code Block shankarn@opha-dev4:~$ cat /etc/nats-server.conf server_name: "opha-dev4.opmantek.net" http_port: 8222 listen: 4222 jetstream: enabled #tls { # cert_file: "<path>" # key_file: "<path>" # #ca_file: "<path>" # verify: true #} log_file: "/var/log/nats-server.log"
Debugging guide:
Scenario 1 : ophad doesn’t come up
Check sudo journalctl -f -u ophad
| Code Block | ||||
|---|---|---|---|---|
| ||||
shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'. |
edit /etc/systemd/system/ophad.service to remove the below lines
| Code Block | ||||
|---|---|---|---|---|
| ||||
Type=simple
User=root
Group=root |
| Code Block | ||||
|---|---|---|---|---|
| ||||
cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target
[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10
WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats
[Install] |
reload and restart ophad
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo systemctl daemon-reload
sudo systemctl restart ophad |
Scenario 4 : Using ophad command line to verify the configuration and connection status
Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.
The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.
...
| breakoutMode | wide |
|---|---|
| breakoutWidth | 760 |
...
Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.
The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.
Code Block shankarn@opha-dev5:~$ sudo /usr/local/omk/bin/ophad verify [sudo] password for shankarn: ophad v0.0.52: agent Appending to file "/usr/local/omk/log/ophad.log" Settings ----------------------------------------- * ClusterId: 783d7b91-6c64-4db9-a28f-6364a54b8505 * OMKDatabase: * ConnectionTimeout: 5h33m20s * RetryTimeout: 3m0s * PingTimeout: 33m20s * QueryTimeout: 1h23m20s * Port: 27017 * Server: localhost * MongoCluster: [] * ReplicaSet: (blank) * Name: omk_shared * Username: opUserRW * Password: ****** * WriteConcern: 1 * Uri: (blank) * BatchSize: 0 * BatchTimeout: 0 * NMISDatabase: * ConnectionTimeout: 2m0s * RetryTimeout: 3m0s * PingTimeout: 20s * QueryTimeout: 1h23m20s * Port: 27017 * Server: localhost * MongoCluster: [] * ReplicaSet: (blank) * Name: nmisng * Username: opUserRW * Password: ****** * WriteConcern: 1 * Uri: (blank) * BatchSize: 50 * BatchTimeout: 500 * OpEventsDatabase: * ConnectionTimeout: 2m0s * RetryTimeout: 3m0s * PingTimeout: 20s * QueryTimeout: 5m0s * Port: 27017 * Server: localhost * MongoCluster: [] * ReplicaSet: (blank) * Name: opevents * Username: opUserRW * Password: ****** * WriteConcern: 1 * Uri: (blank) * BatchSize: 50 * BatchTimeout: 500 * OMK: * LogLevel: info * BindAddr: * * Directories: * Base: /usr/local/omk * Conf: /usr/local/omk/conf * Logs: /usr/local/omk/log * Var: /usr/local/omk/var * OPHA: * DBName: opha * StreamingApps: [nmis opevents] * Logfile: /usr/local/omk/log/ophad.log * MongoWatchFilters: [] * StreamType: nats * AgentPort: 6000 * NonActiveTimeout: 8m0s * ResumeTokenCollection: resume_token * OpHACliPath: /usr/local/omk/bin/opha-cli.pl * Compression: true * Role: Poller * Consumer: false * Producer: false * ConsumerPollerSet: (blank) * DebugEnabled: false * Redis: * RedisServer: localhost * RedisPort: 6379 * RedisPassword: ****** * RetryTimeout: 3m0s * RedisStreamLenCheckPeriod: 5 * RedisProducerMaxStreamLength: 10000 * MaxRetries: 180 * RedisTLSEnabled: false * RedisTLSSkipVerify: false * RedisProducerDegradeTimeout: 10 * RedisProducerFullDegradeTimeout: 10 * Kafka: * Seeds: localhost:63616,localhost:63627,localhost:63629 * RetryTimeout: 3m0s * MaxRetries: 180 * Nats: * NatsServer: opha-dev4.opmantek.net * NatsCluster: [] * NatsPort: 4222 * NatsNumReplicas: 1 * NatsUsername: omkadmin * NatsPassword: ****** * RetryTimeout: 3m0s *
...
NatsStreamLenCheckPeriod:
...
5 *
...
NatsProducerMaxMsgPerSubject:
...
1000000 *
...
NatsMaxAge:
...
604800 *
...
MaxRetries:
...
180 *
...
NatsTLSEnabled:
...
false *
...
NatsTLSCert:
...
<path> *
...
NatsTLSKey:
...
<path> *
...
NatsTLSSkipVerify:
...
false *
...
NatsProducerDegradeTimeout:
...
10 *
...
NatsProducerFullDegradeTimeout:
...
10
...
*
...
Authentication:
...
*
...
AuthTokenKeys:
...
*
...
*
...
*
...
*
...
*
...
* -------------------------------------------------- 2025-10-22T08:01:46.329+1100 [INFO] ophad.verify: verify nmis9 mongodb connection with database: name=nmisng 2025-10-22T08:01:46.451+1100 [INFO] ophad.verify: MongoDB NMIS connect: maybe="found nodes collection in nmis9 ✅" 2025-10-22T08:01:46.451+1100 [INFO] ophad.verify: verify omk mongodb connection with database: name=opha 2025-10-22T08:01:46.551+1100 [INFO] ophad.verify: MongoDB OMK connect: maybe="found opstatus collection in omk database ✅" 2025-10-22T08:01:46.575+1100 [INFO] ophad.verify: Nats connect: result= | can connect to nats-server: opha-dev4.opmantek.net version: 2.11.9 ✅ | we can connect to Nats-server ✅ 2025-10-22T08:01:46.575+1100 [INFO] ophad.verify: ready for liftoff 🚀
Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.
If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.
If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).
Peer: Pause the message bus on the Peer
| Code Block | ||||
|---|---|---|---|---|
| ||||
/usr/local/omk/bin/ophad cmd producer pause |
Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.
...
Move the desired Poller back to opha4 (opha4 => opha5)
This command would set opHA to start using the Message bus.
| Code Block | ||||
|---|---|---|---|---|
| ||||
/usr/local/omk/bin/ophad cmd producer start |
...
Scenario 2 : opHA-MB failover and failback commands.
State: It is possible to get the state of the Peers on the Main Primary using the cli
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer state |
Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.
Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID> |
There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID> |
Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.
Switching Main and Secondary Primary Servers
In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.
Run as root user
| Code Block | ||||
|---|---|---|---|---|
| ||||
systemctl restart nmis9d opchartsd opeventsd omkd ophad |
To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:
Connect to MongoDB on the master server in this case the (second-primary):
Code Block mongosh --username opUserRW --password op42flow42 adminUpdate member priorities:
Code Block cfg = rs.conf() cfg.members[0].priority = 0.6 cfg.members[1].priority = 0.5 rs.reconfig(cfg)
Enable logging:
ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
"ophad_logfile" : "/usr/local/omk/log/ophad.log",Code Block "opha" : { "opha_role" : "Main Primary", "ophad_logfile" : "/usr/local/omk/log/ophad.log", "ophad_streaming_apps" : [ "nmis", "opevents" ],nats-server logging : add the lines to /etc/nats-server.conf
log_file: "/var/log/nats-server.log"Code Block shankarn@opha-dev4:~$ cat /etc/nats-server.conf server_name: "opha-dev4.opmantek.net" http_port: 8222 listen: 4222 jetstream: enabled #tls { # cert_file: "<path>" # key_file: "<path>" # #ca_file: "<path>" # verify: true #} log_file: "/var/log/nats-server.log"
Debugging guide:
Scenario 1 : ophad doesn’t come up
Check sudo journalctl -f -u ophad
| Code Block | ||||
|---|---|---|---|---|
| ||||
shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'. |
edit /etc/systemd/system/ophad.service to remove the below lines
| Code Block | ||||
|---|---|---|---|---|
| ||||
Type=simple
User=root
Group=root |
| Code Block | ||||
|---|---|---|---|---|
| ||||
cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target
[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10
WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats
[Install] |
reload and restart ophad
| Code Block | ||||
|---|---|---|---|---|
| ||||
sudo systemctl daemon-reload
sudo systemctl restart ophad |