Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Mongo cluster heartbeat check on Main Primary

    Code Block
    shankarn@opha-dev4:/usr/local/omk/conf$ mongosh --username opUserRW --password op42flow42 admin --port 27017
    rs1 [direct: primary] admin> rs.status()
    {
       ...
      members: [
        {
          _id: 0,
          name: 'opha-dev4.opmantek.net:27017',
          health: 1,
          state: 1,
          stateStr: 'PRIMARY',
          uptime: 17503,
          optime: { ts: Timestamp({ t: 1763526818, i: 9 }), t: Long('7') },
          optimeDate: ISODate('2025-11-19T04:33:38.000Z'),
          lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
          lastDurableWallTime: ISODate('2025-11-19T04:33:38.190Z'),
        },
        {
          _id: 1,
          name: 'opha-dev7.opmantek.net:27017',
          health: 1,
          state: 2,
          stateStr: 'SECONDARY',
          uptime: 17496,
          optime: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
          optimeDurable: { ts: Timestamp({ t: 1763526814, i: 1 }), t: Long('7') },
          optimeDate: ISODate('2025-11-19T04:33:34.000Z'),
          optimeDurableDate: ISODate('2025-11-19T04:33:34.000Z'),
          lastAppliedWallTime: ISODate('2025-11-19T04:33:38.225Z'),
          lastDurableWallTime: ISODate('2025-11-19T04:33:38.225Z'),
          lastHeartbeat: ISODate('2025-11-19T04:33:36.300Z'),
          lastHeartbeatRecv: ISODate('2025-11-19T04:33:37.493Z'),
        },
        {
          _id: 2,
          name: 'opha-dev6.opmantek.net:27018',
          health: 1,
          state: 7,
          stateStr: 'ARBITER',
          uptime: 17496,
          lastHeartbeat: ISODate('2025-11-19T04:33:36.301Z'),
          lastHeartbeatRecv: ISODate('2025-11-19T04:33:36.290Z'),
        }
      ],
    

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.

If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).

Peer: Pause the message bus on the Peer

Code Block
breakoutModewide
breakoutWidth760
/usr/local/omk/bin/ophad cmd producer pause

Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.

...

 Move the desired Poller back to opha4 (opha4 => opha5)

This command would set opHA to start using the Message bus.

Code Block
breakoutModewide
breakoutWidth760
/usr/local/omk/bin/ophad cmd producer start

...

Scenario 2 : opHA-MB failover and failback commands.

State: It is possible to get the state of the Peers on the Main Primary using the cli

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer state

Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.

Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID>

There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID>

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.

Run as root user

Code Block
breakoutModewide
breakoutWidth760
systemctl restart nmis9d opchartsd opeventsd omkd ophad

To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:

  1. Connect to MongoDB on the master server in this case the (second-primary):

    Code Block
    mongosh --username opUserRW --password op42flow42 admin
  2. Update member priorities:

    Code Block
    cfg = rs.conf()
    cfg.members[0].priority = 0.6
    cfg.members[1].priority = 0.5
    rs.reconfig(cfg)

Enable logging:

  1. ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
    "ophad_logfile" : "/usr/local/omk/log/ophad.log",

    Code Block
     "opha" : {
          "opha_role" : "Main Primary",
          "ophad_logfile" : "/usr/local/omk/log/ophad.log",
          "ophad_streaming_apps" : [
             "nmis",
             "opevents"
          ],
  2. nats-server logging : add the lines to /etc/nats-server.conf
    log_file: "/var/log/nats-server.log"

    Code Block
    shankarn@opha-dev4:~$ cat /etc/nats-server.conf
    server_name: "opha-dev4.opmantek.net"
    http_port: 8222
    listen: 4222
    jetstream: enabled
    
    #tls {
    #    cert_file: "<path>"
    #    key_file:  "<path>"
    #    #ca_file:   "<path>"
    #    verify: true
    #}
    
    log_file: "/var/log/nats-server.log"

Debugging guide:

Scenario 1 : ophad doesn’t come up

Check sudo journalctl -f -u ophad

Code Block
breakoutModewide
breakoutWidth760
shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'.  

edit /etc/systemd/system/ophad.service to remove the below lines

Code Block
breakoutModewide
breakoutWidth760
Type=simple
User=root
Group=root
Code Block
breakoutModewide
breakoutWidth760
cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target

[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10

WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats

[Install]

reload and restart ophad

Code Block
breakoutModewide
breakoutWidth760
sudo systemctl daemon-reload                                                    
sudo systemctl restart ophad

Scenario 4 : Using ophad command line to verify the configuration and connection status

Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.

The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.

...

breakoutModewide
breakoutWidth760

...

  1. Run the command sudo /usr/local/omk/bin/ophad verify on all the Peers/Primary.

    The last line “ophad.verify: ready for liftoff 🚀 “ to indicate the configuration is good.

     

    Code Block
    shankarn@opha-dev5:~$ sudo /usr/local/omk/bin/ophad verify
    [sudo] password for shankarn:
    ophad v0.0.52: agent
    Appending to file "/usr/local/omk/log/ophad.log"
    Settings -----------------------------------------
      * ClusterId: 783d7b91-6c64-4db9-a28f-6364a54b8505
      * OMKDatabase:
        * ConnectionTimeout: 5h33m20s
        * RetryTimeout: 3m0s
        * PingTimeout: 33m20s
        * QueryTimeout: 1h23m20s
        * Port: 27017
        * Server: localhost
        * MongoCluster: []
        * ReplicaSet: (blank)
        * Name: omk_shared
        * Username: opUserRW
        * Password: ******
        * WriteConcern: 1
        * Uri: (blank)
        * BatchSize: 0
        * BatchTimeout: 0
      * NMISDatabase:
        * ConnectionTimeout: 2m0s
        * RetryTimeout: 3m0s
        * PingTimeout: 20s
        * QueryTimeout: 1h23m20s
        * Port: 27017
        * Server: localhost
        * MongoCluster: []
        * ReplicaSet: (blank)
        * Name: nmisng
        * Username: opUserRW
        * Password: ******
        * WriteConcern: 1
        * Uri: (blank)
        * BatchSize: 50
        * BatchTimeout: 500
      * OpEventsDatabase:
        * ConnectionTimeout: 2m0s
        * RetryTimeout: 3m0s
        * PingTimeout: 20s
        * QueryTimeout: 5m0s
        * Port: 27017
        * Server: localhost
        * MongoCluster: []
        * ReplicaSet: (blank)
        * Name: opevents
        * Username: opUserRW
        * Password: ******
        * WriteConcern: 1
        * Uri: (blank)
        * BatchSize: 50
        * BatchTimeout: 500
      * OMK:
        * LogLevel: info
        * BindAddr: *
      * Directories:
        * Base: /usr/local/omk
        * Conf: /usr/local/omk/conf
        * Logs: /usr/local/omk/log
        * Var: /usr/local/omk/var
      * OPHA:
        * DBName: opha
        * StreamingApps: [nmis opevents]
        * Logfile: /usr/local/omk/log/ophad.log
        * MongoWatchFilters: []
        * StreamType: nats
        * AgentPort: 6000
        * NonActiveTimeout: 8m0s
        * ResumeTokenCollection: resume_token
        * OpHACliPath: /usr/local/omk/bin/opha-cli.pl
        * Compression: true
        * Role: Poller
      * Consumer: false
      * Producer: false
      * ConsumerPollerSet: (blank)
      * DebugEnabled: false
      * Redis:
        * RedisServer: localhost
        * RedisPort: 6379
        * RedisPassword: ******
        * RetryTimeout: 3m0s
        * RedisStreamLenCheckPeriod: 5
        * RedisProducerMaxStreamLength: 10000
        * MaxRetries: 180
        * RedisTLSEnabled: false
        * RedisTLSSkipVerify: false
        * RedisProducerDegradeTimeout: 10
        * RedisProducerFullDegradeTimeout: 10
      * Kafka:
        * Seeds: localhost:63616,localhost:63627,localhost:63629
        * RetryTimeout: 3m0s
        * MaxRetries: 180
      * Nats:
        * NatsServer: opha-dev4.opmantek.net
        * NatsCluster: []
        * NatsPort: 4222
        * NatsNumReplicas: 1
        * NatsUsername: omkadmin
        * NatsPassword: ******
        * RetryTimeout: 3m0s
        * 

...

  1. NatsStreamLenCheckPeriod: 

...

  1. 5
        * 

...

  1. NatsProducerMaxMsgPerSubject: 

...

  1. 1000000
        * 

...

  1. NatsMaxAge: 

...

  1. 604800
        * 

...

  1. MaxRetries: 

...

  1. 180
        * 

...

  1. NatsTLSEnabled: 

...

  1. false
        * 

...

  1. NatsTLSCert: 

...

  1. <path>
        * 

...

  1. NatsTLSKey: 

...

  1. <path>
        * 

...

  1. NatsTLSSkipVerify: 

...

  1. false
        * 

...

  1. NatsProducerDegradeTimeout: 

...

  1. 10
        * 

...

  1. NatsProducerFullDegradeTimeout: 

...

  1. 10
      

...

  1. * 

...

  1. Authentication:

...

  1. 
        * 

...

  1. AuthTokenKeys: 

...

  1. *

...

  1. *

...

  1. *

...

  1. *

...

  1. *

...

  1. *
    --------------------------------------------------
    2025-10-22T08:01:46.329+1100 [INFO]  ophad.verify: verify nmis9 mongodb connection with database: name=nmisng
    2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: MongoDB NMIS connect: maybe="found nodes collection in nmis9 ✅"
    2025-10-22T08:01:46.451+1100 [INFO]  ophad.verify: verify omk mongodb connection with database: name=opha
    2025-10-22T08:01:46.551+1100 [INFO]  ophad.verify: MongoDB OMK connect: maybe="found opstatus collection in omk database ✅"
    2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: Nats connect:
      result=
      | can connect to nats-server: opha-dev4.opmantek.net version: 2.11.9 ✅
      | we can connect to Nats-server ✅
    2025-10-22T08:01:46.575+1100 [INFO]  ophad.verify: ready for liftoff 🚀

     

Scenario 1 : Using opHA4 ‘Pull’ on Primary to synchronize nmisng collections.

If the system is for upgrade from opHA 4.1.2 to opHA 5.1.1, it would be good to do a “Pull” on Primary opHA portal, before upgrading to opHA 5.1.1.

If the Poller has been running for a while, it would be better to move to opHA4 and then do a “Pull” to sync the data. After the sync has happened, its easy to move it back to opHAMB.
Move the desired Poller to opha4 to sync upto the latest (opha5 => opha4).

Peer: Pause the message bus on the Peer

Code Block
breakoutModewide
breakoutWidth760
/usr/local/omk/bin/ophad cmd producer pause

Primary: on the opHA-MB peer portal, “Pull” to sync data from Peer that has been paused.

...

 Move the desired Poller back to opha4 (opha4 => opha5)

This command would set opHA to start using the Message bus.

Code Block
breakoutModewide
breakoutWidth760
/usr/local/omk/bin/ophad cmd producer start

...


Scenario 2 : opHA-MB failover and failback commands.

State: It is possible to get the state of the Peers on the Main Primary using the cli

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer state

Failover: If Poller were to go down, the Mirror would take over automatically. But, once the Poller comes back online, the switchover from Mirror to Poller is not automatic.

Failback: There is a cli command to accomplish the same which needs to be run the Main Primary (and Primary)

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer failback <Poller Cluster ID>

There is also a way to force a Failover which again needs to be run on Main Primary (and Primary)

Code Block
breakoutModewide
breakoutWidth760
sudo /usr/local/omk/bin/ophad cmd consumer failover <Poller Cluster ID>

Scenario 3 : (Replication mode) If the main-primary were to go down in replication mode.

Switching Main and Secondary Primary Servers

In the unforeseen event where the main-primary server goes down the second-primary will take over and become the primary server and ensure that the system still runs. Once we recover the main-primary server we can then restart all the services on the main-primary server, to do that run the following command.

Run as root user

Code Block
breakoutModewide
breakoutWidth760
systemctl restart nmis9d opchartsd opeventsd omkd ophad

To switch from the Secondary Primary back to the Main-Primary so the main-primary is the master again follow these steps:

  1. Connect to MongoDB on the master server in this case the (second-primary):

    Code Block
    mongosh --username opUserRW --password op42flow42 admin
  2. Update member priorities:

    Code Block
    cfg = rs.conf()
    cfg.members[0].priority = 0.6
    cfg.members[1].priority = 0.5
    rs.reconfig(cfg)

Enable logging:

  1. ophad logging : /usr/local/omk/conf/opCommon.json under “opha” add the line
    "ophad_logfile" : "/usr/local/omk/log/ophad.log",

    Code Block
     "opha" : {
          "opha_role" : "Main Primary",
          "ophad_logfile" : "/usr/local/omk/log/ophad.log",
          "ophad_streaming_apps" : [
             "nmis",
             "opevents"
          ],
  2. nats-server logging : add the lines to /etc/nats-server.conf
    log_file: "/var/log/nats-server.log"

    Code Block
    shankarn@opha-dev4:~$ cat /etc/nats-server.conf
    server_name: "opha-dev4.opmantek.net"
    http_port: 8222
    listen: 4222
    jetstream: enabled
    
    #tls {
    #    cert_file: "<path>"
    #    key_file:  "<path>"
    #    #ca_file:   "<path>"
    #    verify: true
    #}
    
    log_file: "/var/log/nats-server.log"

Debugging guide:

Scenario 1 : ophad doesn’t come up

Check sudo journalctl -f -u ophad

Code Block
breakoutModewide
breakoutWidth760
shankarn@opha-dev2:/usr/local/omk/log$ sudo journalctl -f -u ophad
-- Journal begins at Fri 2024-09-06 16:23:19 AEST. --
Aug 01 10:15:59 opha-dev2 ophad[46242]: ophad v0.0.0: agent
Aug 01 10:16:01 opha-dev2 ophad[46242]: cannot init logger: cannot create logfile open /usr/local/omk/log/ophad.log: permission denied
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Main process exited, code=exited, status=1/FAILURE
Aug 01 10:16:01 opha-dev2 systemd[1]: ophad.service: Failed with result 'exit-code'.  

edit /etc/systemd/system/ophad.service to remove the below lines

Code Block
breakoutModewide
breakoutWidth760
Type=simple
User=root
Group=root
Code Block
breakoutModewide
breakoutWidth760
cat /etc/systemd/system/ophad.service.bkup
[Unit]
Description=opHA daemon
After=network-online.target
Wants=network-online.target

[Service]
#on failure try to restart every RestartSec, upto StartLimitBurst times within StartLimitInterval
Restart=on-failure
RestartSec=10
StartLimitInterval=300
StartLimitBurst=10

WorkingDirectory=/usr/local/omk
ExecStart=/usr/local/omk/bin/ophad agent --streaming-type=nats

[Install]

reload and restart ophad

Code Block
breakoutModewide
breakoutWidth760
sudo systemctl daemon-reload                                                    
sudo systemctl restart ophad