Hi, everybody.
Here is our plan for Friday.
1. Stop all incoming traffic to all cluster nodes (as we'll hopefully have maintenance window arranged for 2 hours, as previously requested in the call)
2. Create a backup of (one of the) current fully functional nodes that have the most complete data in it (we could try to estimate this using output of `# dsreplication status` command, or with a command like this executed on each node: `# /opt/opendj/bin/ldapsearch -T -h 127.0.0.1 -p 1636 -Z -X -s sub -D 'cn=directory manager' -b 'o=gluu' -j /tmp/.dpw '&(objectclass=gluuPerson)' 1.1 | grep -v '^$' | wc -l`)
3. Stop replication on all still functioning nodes with `# /opt/opendj/bin/dsreplication disable --disableAll -I 'admin' -w 'REPLICATION_ADMIN_PASS' --trustAll --no-prompt`
4. Decommission nodes 3 and 4 where OpenDJ isn't able to start (either remove the Gluu Server package from there manually and restart them - or, even better, spin up two fresh vms of the same size)
5. Re-enable replication between the two still running nodes (nodes 1 and 2); hopefully, we'll be able to achieve that with Cluster Manager's UI, but just in case I'm dropping the manual commands in here too:
- `# /opt/opendj/bin/dsreplication enable -I 'admin' -w 'REPLICATION_ADMIN_PASS' -b 'o=gluu' -h hostname.or.ip.of.this.node -p 4444 -D 'cn=directory manager' --bindPassword1 'LDAP_PASS_OF_INSTANCE_ON_THIS_NODE' -r 8989 -O hostname.or.ip.of.the.other.node --port2 4444 --bindDN2 'cn=directory manager'--bindPassword2 'LDAP_PASS_OF_INSTANCE_ON_THE_OTHER_NODE' -R 8989 --secureReplication1 --secureReplication2 -X -n`
- `/opt/opendj/bin/dsreplication initialize --baseDN "o=gluu" --adminUID admin -w 'REPLICATION_ADMIN_PASS' --hostSource 172.31.90.19 --portSource 4444 --hostDestination 172.31.27.183 --portDestination 4444 -X -n`
6. Plan A: We could stop here and let it run for a couple days, to make sure it's stable with just two nodes, before adding more.
7. Plan B: Add the other two nodes, one by one, using CM's web UI (Gluu Server will be installed there at the same time) - that could be done without additional downtime windows if done later
8. Put the nodes behind LB as well and do some smoke testing of most critical flows
To create the backup we will export all data from under "o=gluu" branch ("o=site" as well if Cache Refresh is used there) with `export-ldif` or `ldapsearch`.