opHA-MB 5.0 User Guide

Config PreChecks

Nats cluster config
Check /usr/local/omk/conf/opCommon.json on all the vms including Main Pri, Sec Pri, Pollers, Mirrors) for nats_cluster and verify it has all the 3 server DNS address of Main Primary, Secondary Primary and the Arbiter Poller

omkadmin@lab-ophamb-mp01:/usr/local/omk/conf$ grep -a4 nats_cluster /usr/local/omk/conf/opCommon.json
      "db_use_v26_features" : 1,
      "redis_port" : 6379,
      "redis_server" : "localhost",
      "db_port" : "27017",
      "nats_cluster" : [
         “Main Primary ,
         “Sec Primary”,
         “New Arbiter Poller”
      ],

Nats number of replicas setting
Check /usr/local/omk/conf/opCommon.json on all the vms including Main Pri, Sec Pri, Pollers, Mirrors) it has 3 (for replicated)
```
omkadmin@lab-ophamb-mp01:/usr/local/omk/conf$ grep nats_num_replicas /usr/local/omk/conf/opCommon.json
      "nats_num_replicas" : 3,
```
Nats stream info check (only for replicated setup with 3 Nats servers)
```
nats stream info --user omkadmin --password op42opha42
```

The cluster group needs to have all 3 servers in the replica set - DNS address of Main Primary, Secondary Primary and the Arbiter Poller
nats replicas needs to be set 3

nats stream info --user omkadmin --password op42opha42

             Description: Streaming Message ophamb
                Subjects: 12da408c-adcc-441b-8918-1dafe07b3f88.*
                Replicas: 3
                 Storage: File


Cluster Information:

                    Name: C1
           Cluster Group: S-R3F-dLBC9xEB
                  Leader: http://10.1.50.21 (11d22h59m24s)
                 Replica: http://10.1.50.22/, current, seen 127ms ago
                 Replica: http://10.1.60.22, current, seen 127ms ago

Short script using jq to output the cluster information

for stream_name in $(nats stream list --user omkadmin --password op42opha42 --json | jq -r '.[]');
  do echo "Stream name: " $stream_name; 
    num=`nats stream info --user omkadmin --password op42opha42 "$stream_name" --json | jq -r '.config.num_replicas'`; 
    echo "Num replicas: " $num; 
    leader=`nats stream info --user omkadmin --password op42opha42 "$stream_name" --json | jq -r '.cluster.leader'`; 
    echo "Leader: " $leader; 
    replicas=`nats stream info --user omkadmin --password op42opha42 "$stream_name" --json | jq -r '[.cluster.replicas[] |  .name ]'`; 
    echo "Replicas: " $replicas; 
    echo ""
  done

Stream name:  12da408c-adcc-441b-8918-1dafe07b3f88
Num replicas:  3
Leader:  http://10.1.50.21
Replicas:  [ "http://10.1.50.22/", "http://10.1.60.22" ]

Stream name:  5e06e63b-2cbf-4aac-a2e9-81ee3133119e
Num replicas:  3
Leader:  http://10.1.60.22
Replicas:  [ "http://10.1.50.21", "http://10.1.50.22/" ]

Mongo cluster heartbeat check ‘uptime’ on Main Primary

shankarn@opha-dev4:/usr/local/omk/conf$ mongosh --username opUserRW --password op42flow42 admin --port 27017
rs1 [direct: primary] admin> rs.status()
{
   ...
  members: [
    {
      _id: 0,
      name: 'opha-dev4.opmantek.net:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 17503,
      optime: { ts: Timestamp({ t: 1763526818, i: 9 }), t: Long('7') },
      optimeDate: ISODate('2025-11-19T04:33:38.000Z'),
      lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastDurableWallTime: ISODate('2025-11-19T04:33:38.190Z'),
    },
    {
      _id: 1,
      name: 'opha-dev7.opmantek.net:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 17496,
      optime: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
      optimeDurable: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
      optimeDate: ISODate('2025-11-19T04:33:34.000Z'),
      optimeDurableDate: ISODate('2025-11-19T04:33:34.000Z'),
      lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastDurableWallTime: ISODate('2025-11-19T04:33:38.225Z'),
      lastHeartbeat: ISODate('2025-11-19T04:33:36.300Z'),
      lastHeartbeatRecv: ISODate('2025-11-19T04:33:37.493Z'),
    },
    {
      _id: 2,
      name: 'opha-dev6.opmantek.net:27018',
      health: 1,
      state: 7,
      stateStr: 'ARBITER',
      uptime: 17496,
      lastHeartbeat: ISODate('2025-11-19T04:33:36.301Z'),
      lastHeartbeatRecv: ISODate('2025-11-19T04:33:36.290Z'),
    }
  ],

Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.

The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.

shankarn@opha-dev5:~$ sudo /usr/local/omk/bin/ophad verify
[sudo] password for shankarn:
ophad v0.0.52: agent
Appending to file "/usr/local/omk/log/ophad.log"
Settings -----------------------------------------
  * ClusterId: 783d7b91-6c64-4db9-a28f-6364a54b8505
  * OMKDatabase:
    * ConnectionTimeout: 5h33m20s
    * RetryTimeout: 3m0s
    * PingTimeout: 33m20s
    * QueryTimeout: 1h23m20s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: omk_shared
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 0
    * BatchTimeout: 0
  * NMISDatabase:
    * ConnectionTimeout: 2m0s
    * RetryTimeout: 3m0s
    * PingTimeout: 20s
    * QueryTimeout: 1h23m20s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: nmisng
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 50
    * BatchTimeout: 500
  * OpEventsDatabase:
    * ConnectionTimeout: 2m0s
    * RetryTimeout: 3m0s
    * PingTimeout: 20s
    * QueryTimeout: 5m0s
    * Port: 27017
    * Server: localhost
    * MongoCluster: []
    * ReplicaSet: (blank)
    * Name: opevents
    * Username: opUserRW
    * Password: ******
    * WriteConcern: 1
    * Uri: (blank)
    * BatchSize: 50
    * BatchTimeout: 500
  * OMK:
    * LogLevel: info
    * BindAddr: *
  * Directories:
    * Base: /usr/local/omk
    * Conf: /usr/local/omk/conf
    * Logs: /usr/local/omk/log
    * Var: /usr/local/omk/var
  * OPHA:
    * DBName: opha
    * StreamingApps: [nmis opevents]
    * Logfile: /usr/local/omk/log/ophad.log
    * MongoWatchFilters: []
    * StreamType: nats
    * AgentPort: 6000
    * NonActiveTimeout: 8m0s
    * ResumeTokenCollection: resume_token
    * OpHACliPath: /usr/local/omk/bin/opha-cli.pl
    * Compression: true
    * Role: Poller
  * Consumer: false
  * Producer: false
  * ConsumerPollerSet: (blank)
  * DebugEnabled: false
  * Redis:
    * RedisServer: localhost
    * RedisPort: 6379
    * RedisPassword: ******
    * RetryTimeout: 3m0s
    * RedisStreamLenCheckPeriod: 5
    * RedisProducerMaxStreamLength: 10000
    * MaxRetries: 180
    * RedisTLSEnabled: false
    * RedisTLSSkipVerify: false
    * RedisProducerDegradeTimeout: 10
    * RedisProducerFullDegradeTimeout: 10
  * Kafka:
    * Seeds: localhost:63616,localhost:63627,localhost:63629
    * RetryTimeout: 3m0s
    * MaxRetries: 180
  * Nats:
    * NatsServer: opha-dev4.opmantek.net
    * NatsCluster: []
    * NatsPort: 4222
    * NatsNumReplicas: 1
    * NatsUsername: omkadmin
    * NatsPassword: ******
    * RetryTimeout: 3m0s
    * NatsStreamLenCheckPeriod: 5
    * NatsProducerMaxMsgPerSubject: 1000000
    * NatsMaxAge: 604800
    * MaxRetries: 180
    * NatsTLSEnabled: false
    * NatsTLSCert: <path>
    * NatsTLSKey: <path>
    * NatsTLSSkipVerify: false
    * NatsProducerDegradeTimeout: 10
    * NatsProducerFullDegradeTimeout: 10
  * Authentication:
    * AuthTokenKeys: ******
--------------------------------------------------
2025-10-22T08:01:46.329+1100 [INFO]  ophad.verify: verify nmis9 mongodb connection with database: name=nmisng
2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: MongoDB NMIS connect: maybe="found nodes collection in nmis9 ✅"
2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: verify omk mongodb connection with database: name=opha
2025-10-22T08:01:46.551+1100 [INFO]  ophad.verify: MongoDB OMK connect: maybe="found opstatus collection in omk database ✅"
2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: Nats connect:
  result=
  | can connect to nats-server: opha-dev4.opmantek.net version: 2.11.9 ✅
  | we can connect to Nats-server ✅
2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: ready for liftoff 🚀

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.

If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).

Peer (Poller): Pause the message bus on the Peer to move Poller state to opha4

/usr/local/omk/bin/ophad cmd producer pause

Main Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.

Move the desired Poller back to opha5 (opha4 => opha5)

Peer (Poller): This command would set opHA to start using the Message bus.

/usr/local/omk/bin/ophad cmd producer start

Scenario 2 : opHA-MB failover and failback commands.

State: It is possible to get the state of the Peers on the Main Primary using the cli

sudo /usr/local/omk/bin/ophad cmd consumer state

Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.

Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)

sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID>

There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)

sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID>

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.

Run as root user

systemctl restart nmis9d opchartsd opeventsd omkd ophad

To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:

Connect to MongoDB on the master server in this case the (second-primary):
```
mongosh --username opUserRW --password op42flow42 admin
```

Update member priorities:

cfg = rs.conf()
cfg.members[0].priority = 0.6
cfg.members[1].priority = 0.5
rs.reconfig(cfg)

Enable logging:

ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
"ophad_logfile" : "/usr/local/omk/log/ophad.log",

 "opha" : {
      "opha_role" : "Main Primary",
      "ophad_logfile" : "/usr/local/omk/log/ophad.log",
      "ophad_streaming_apps" : [
         "nmis",
         "opevents"
      ],

nats-server logging : add the lines to /etc/nats-server.conf
log_file: "/var/log/nats-server.log"

shankarn@opha-dev4:~$ cat /etc/nats-server.conf
server_name: "opha-dev4.opmantek.net"
http_port: 8222
listen: 4222
jetstream: enabled

#tls {
#    cert_file: "<path>"
#    key_file:  "<path>"
#    #ca_file:   "<path>"
#    verify: true
#}

log_file: "/var/log/nats-server.log"

Debugging guide:

Scenario 1 : ophad doesn’t come up

Check sudo journalctl -f -u ophad

shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'.

edit /etc/systemd/system/ophad.service to remove the below lines

Type=simple
User=root
Group=root

cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target

[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10

WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats

[Install]

reload and restart ophad

sudo systemctl daemon-reload                                                    
sudo systemctl restart ophad