Content Comparison

...

Mongo cluster heartbeat check on Main Primary

Code Block

shankarn@opha-dev4:/usr/local/omk/conf$ mongosh --username opUserRW --password op42flow42 admin --port 27017
rs1 [direct: primary] admin> rs.status()
{
   ...
  members: [
    {
      _id: 0,
      name: 'opha-dev4.opmantek.net:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 17503,
      optime: { ts: Timestamp({ t: 1763526818, i: 9 }), t: Long('7') },
      optimeDate: ISODate('2025-11-19T04:33:38.000Z'),
      lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastDurableWallTime: ISODate('2025-11-19T04:33:38.190Z'),
    },
    {
      _id: 1,
      name: 'opha-dev7.opmantek.net:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 17496,
      optime: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
      optimeDurable: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
      optimeDate: ISODate('2025-11-19T04:33:34.000Z'),
      optimeDurableDate: ISODate('2025-11-19T04:33:34.000Z'),
      lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastDurableWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastHeartbeat: ISODate('2025-11-19T04:33:36.300Z'),
      lastHeartbeatRecv: ISODate('2025-11-19T04:33:37.493Z'),
    },
    {
      _id: 2,
      name: 'opha-dev6.opmantek.net:27018',
      health: 1,
      state: 7,
      stateStr: 'ARBITER',
      uptime: 17496,
      lastHeartbeat: ISODate('2025-11-19T04:33:36.301Z'),
      lastHeartbeatRecv: ISODate('2025-11-19T04:33:36.290Z'),
    }
  ],

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.

If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).

Peer: Pause the message bus on the Peer

Code Block

breakoutMode	wide
breakoutWidth	760

/usr/local/omk/bin/ophad cmd producer pause

Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.

...

Move the desired Poller back to opha4 (opha4 => opha5)

This command would set opHA to start using the Message bus.

Code Block

breakoutMode	wide
breakoutWidth	760

/usr/local/omk/bin/ophad cmd producer start

...

Scenario 2 : opHA-MB failover and failback commands.

State: It is possible to get the state of the Peers on the Main Primary using the cli

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer state

Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.

Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID>

There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID>

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.

Run as root user

Code Block

breakoutMode	wide
breakoutWidth	760

systemctl restart nmis9d opchartsd opeventsd omkd ophad

To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:

Connect to MongoDB on the master server in this case the (second-primary):
Code Block
mongosh --username opUserRW --password op42flow42 admin

Update member priorities:

Code Block
cfg = rs.conf() cfg.members[0].priority = 0.6 cfg.members[1].priority = 0.5 rs.reconfig(cfg)

Enable logging:

ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
"ophad_logfile" : "/usr/local/omk/log/ophad.log",

Code Block
"opha" : { "opha_role" : "Main Primary", "ophad_logfile" : "/usr/local/omk/log/ophad.log", "ophad_streaming_apps" : [ "nmis", "opevents" ],

nats-server logging : add the lines to /etc/nats-server.conf
log_file: "/var/log/nats-server.log"

Code Block

shankarn@opha-dev4:~$ cat /etc/nats-server.conf
server_name: "opha-dev4.opmantek.net"
http_port: 8222
listen: 4222
jetstream: enabled

#tls {
#    cert_file: "<path>"
#    key_file:  "<path>"
#    #ca_file:   "<path>"
#    verify: true
#}

log_file: "/var/log/nats-server.log"

Debugging guide:

Scenario 1 : ophad doesn’t come up

Check sudo journalctl -f -u ophad

Code Block

breakoutMode	wide
breakoutWidth	760

shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'.

edit /etc/systemd/system/ophad.service to remove the below lines

Code Block

breakoutMode	wide
breakoutWidth	760

Type=simple
User=root
Group=root

Code Block

breakoutMode	wide
breakoutWidth	760

cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target

[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10

WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats

[Install]

reload and restart ophad

Code Block

breakoutMode	wide
breakoutWidth	760

sudo systemctl daemon-reload                                                    
sudo systemctl restart ophad

Scenario 4 : Using ophad command line to verify the configuration and connection status

Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.

The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.

...

breakoutMode	wide
breakoutWidth	760

...

Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.

The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.

Code Block

shankarn@opha-dev5:~$ sudo /usr/local/omk/bin/ophad verify
[sudo] password for shankarn:
ophad v0.0.52: agent
Appending to file "/usr/local/omk/log/ophad.log"
Settings -----------------------------------------
  * ClusterId: 783d7b91-6c64-4db9-a28f-6364a54b8505
  * OMKDatabase:
    * ConnectionTimeout: 5h33m20s
    * RetryTimeout: 3m0s
    * PingTimeout: 33m20s
    * QueryTimeout: 1h23m20s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: omk_shared
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 0
    * BatchTimeout: 0
  * NMISDatabase:
    * ConnectionTimeout: 2m0s
    * RetryTimeout: 3m0s
    * PingTimeout: 20s
    * QueryTimeout: 1h23m20s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: nmisng
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 50
    * BatchTimeout: 500
  * OpEventsDatabase:
    * ConnectionTimeout: 2m0s
    * RetryTimeout: 3m0s
    * PingTimeout: 20s
    * QueryTimeout: 5m0s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: opevents
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 50
    * BatchTimeout: 500
  * OMK:
    * LogLevel: info
    * BindAddr: *
  * Directories:
    * Base: /usr/local/omk
    * Conf: /usr/local/omk/conf
    * Logs: /usr/local/omk/log
    * Var: /usr/local/omk/var
  * OPHA:
    * DBName: opha
    * StreamingApps: [nmis opevents]
    * Logfile: /usr/local/omk/log/ophad.log
    * MongoWatchFilters: []
    * StreamType: nats
    * AgentPort: 6000
    * NonActiveTimeout: 8m0s
    * ResumeTokenCollection: resume_token
    * OpHACliPath: /usr/local/omk/bin/opha-cli.pl
    * Compression: true
    * Role: Poller
  * Consumer: false
  * Producer: false
  * ConsumerPollerSet: (blank)
  * DebugEnabled: false
  * Redis:
    * RedisServer: localhost
    * RedisPort: 6379
    * RedisPassword: ******
    * RetryTimeout: 3m0s
    * RedisStreamLenCheckPeriod: 5
    * RedisProducerMaxStreamLength: 10000
    * MaxRetries: 180
    * RedisTLSEnabled: false
    * RedisTLSSkipVerify: false
    * RedisProducerDegradeTimeout: 10
    * RedisProducerFullDegradeTimeout: 10
  * Kafka:
    * Seeds: localhost:63616,localhost:63627,localhost:63629
    * RetryTimeout: 3m0s
    * MaxRetries: 180
  * Nats:
    * NatsServer: opha-dev4.opmantek.net
    * NatsCluster: []
    * NatsPort: 4222
    * NatsNumReplicas: 1
    * NatsUsername: omkadmin
    * NatsPassword: ******
    * RetryTimeout: 3m0s
    *

...

NatsStreamLenCheckPeriod:

...

5 *

...

NatsProducerMaxMsgPerSubject:

...

1000000 *

...

NatsMaxAge:

...

604800 *

...

MaxRetries:

...

180 *

...

NatsTLSEnabled:

...

false *

...

NatsTLSCert:

...

<path> *

...

NatsTLSKey:

...

<path> *

...

NatsTLSSkipVerify:

...

false *

...

NatsProducerDegradeTimeout:

...

10 *

...

NatsProducerFullDegradeTimeout:

...

10

...

*

...

Authentication:

...

*

...

AuthTokenKeys:

...

*

...

*

...

*

...

*

...

*

...

*
--------------------------------------------------
2025-10-22T08:01:46.329+1100 [INFO]  ophad.verify: verify nmis9 mongodb connection with database: name=nmisng
2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: MongoDB NMIS connect: maybe="found nodes collection in nmis9 ✅"
2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: verify omk mongodb connection with database: name=opha
2025-10-22T08:01:46.551+1100 [INFO]  ophad.verify: MongoDB OMK connect: maybe="found opstatus collection in omk database ✅"
2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: Nats connect:
  result=
  | can connect to nats-server: opha-dev4.opmantek.net version: 2.11.9 ✅
  | we can connect to Nats-server ✅
2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: ready for liftoff 🚀

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.

If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).

Peer: Pause the message bus on the Peer

Code Block

breakoutMode	wide
breakoutWidth	760

/usr/local/omk/bin/ophad cmd producer pause

Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.

...

Move the desired Poller back to opha4 (opha4 => opha5)

This command would set opHA to start using the Message bus.

Code Block

breakoutMode	wide
breakoutWidth	760

/usr/local/omk/bin/ophad cmd producer start

...

Scenario 2 : opHA-MB failover and failback commands.

State: It is possible to get the state of the Peers on the Main Primary using the cli

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer state

Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.

Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID>

There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)

Code Block

breakoutMode	wide
breakoutWidth	760

sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID>

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.

Run as root user

Code Block

breakoutMode	wide
breakoutWidth	760

systemctl restart nmis9d opchartsd opeventsd omkd ophad

To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:

Connect to MongoDB on the master server in this case the (second-primary):
Code Block
mongosh --username opUserRW --password op42flow42 admin

Update member priorities:

Code Block
cfg = rs.conf() cfg.members[0].priority = 0.6 cfg.members[1].priority = 0.5 rs.reconfig(cfg)

Enable logging:

ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
"ophad_logfile" : "/usr/local/omk/log/ophad.log",

Code Block
"opha" : { "opha_role" : "Main Primary", "ophad_logfile" : "/usr/local/omk/log/ophad.log", "ophad_streaming_apps" : [ "nmis", "opevents" ],

nats-server logging : add the lines to /etc/nats-server.conf
log_file: "/var/log/nats-server.log"

Code Block

shankarn@opha-dev4:~$ cat /etc/nats-server.conf
server_name: "opha-dev4.opmantek.net"
http_port: 8222
listen: 4222
jetstream: enabled

#tls {
#    cert_file: "<path>"
#    key_file:  "<path>"
#    #ca_file:   "<path>"
#    verify: true
#}

log_file: "/var/log/nats-server.log"

Debugging guide:

Scenario 1 : ophad doesn’t come up

Check sudo journalctl -f -u ophad

Code Block

breakoutMode	wide
breakoutWidth	760

shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'.

edit /etc/systemd/system/ophad.service to remove the below lines

Code Block

breakoutMode	wide
breakoutWidth	760

Type=simple
User=root
Group=root

Code Block

breakoutMode	wide
breakoutWidth	760

cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target

[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10

WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats

[Install]

reload and restart ophad

Code Block

breakoutMode	wide
breakoutWidth	760

sudo systemctl daemon-reload                                                    
sudo systemctl restart ophad

Version	Old Version 20	New Version 21
Changes made by	Shankar Nagarajan	Shankar Nagarajan
Saved on	Nov 19, 2025	Nov 19, 2025

Content Comparison

Versions Compared

Key

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

Scenario 2 : opHA-MB failover and failback commands.

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

Enable logging:

Debugging guide:

Scenario 1 : ophad doesn’t come up

Scenario 4 : Using ophad command line to verify the configuration and connection status

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

Scenario 2 : opHA-MB failover and failback commands.

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

Enable logging:

Debugging guide:

Scenario 1 : ophad doesn’t come up