By: Florian Pühs user 01 Oct 2018 at 4:06 p.m. CDT

27 Responses
Florian Pühs gravatar
Hi, We are trying to setup a cluster on AWS using Cluster Manager 3.1.3-11. All instances (manager and identity providers) are using the following ami-086a09d5b9fa35dc7 (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20180912) updated after creation to the current patch level. The installation of the Cluster Manager and subsequently the deployment of primary and replica server through the Cluster Manager are succesfull. The next step which is Replication (no NGINX due to use of external LB) fails. All steps related to the primary server seem to be OK, but the "Securing replication on server replica" on the replica server has two errors: 1) The OpenDJ binary version '3.0.1.c5ad2e4846d8aeb501ffdfe5ae2dfd35136dfa68' does not match the installed version '3.0.0.92065c7762cf1f59042fbc5d71993c13bb9023fa'. Please run upgrade before continuing and 2) Connection to LDAPserver as directory manager at port 1636 has failed: invalid server address thus resulting in "Ending server setup process." and no replication enabled on the replica. The Security Groups are configured based on the diagram and port list here: https://gluu.org/docs/cm/installation/ Both 4444 and 8989 are open between the identity providers (which also seems to be OK with no errors during setup) and 1636 is open inbound for both identity providers from each other and from the cluster manager, but seems to throw the error above. The IPs used for the servers are also the internal IPs as suggested by the documentation. The behaviour is consistent as checked with three separate install/deployment attempts. The two questions would be: 1) What is the cause of the version mismatch for OpenDJ and is there a way to manually fix it? 2) Are we missing something in the server setup that causes the "invalid server address" error on port 1636? The question here would also be which server is the setup looking for exactly from which server? Maybe there is some name resolution issue that's not evident? From the "Errors were found. Fix them in the server and refresh this page to try again." message it should be fixable, but it's not evident where. Thank you!

By Chris Blanton user 01 Oct 2018 at 4:12 p.m. CDT

Chris Blanton gravatar
Can you tell me what version of Cluster Manager you're using? 3.1.3-11 is the latest.

By Chris Blanton user 01 Oct 2018 at 4:29 p.m. CDT

Chris Blanton gravatar
> using Cluster Manager 3.1.3-11. Disregard, I missed this part.

By Chris Blanton user 01 Oct 2018 at 4:39 p.m. CDT

Chris Blanton gravatar
> 2) Connection to LDAPserver as directory manager at port 1636 has failed: invalid server address >thus resulting in "Ending server setup process." and no replication enabled on the replica. Are the servers able to resolve the hostnames of each other? Like are they FQDN's. If not you need to add them inside the chroot's `/etc/hosts` file so they can access each other by name.

By Florian Pühs user 02 Oct 2018 at 12:36 a.m. CDT

Florian Pühs gravatar
The two identity providers both have the entries in the chroot hosts file: _ip of idp1_ idp1.FQDN _ip of idp2_ lb.FQDN idp2.FQDN So both have their own one and also the IP of the oder identity provider and the same ip also for the load balancer FQDN given during the setup. They both can also see each other. All the steps pass during the replication setup except the errors in the "Securing replication on server idp2.FQDN". What's also interesting is that the status table shows at the end in the "Checking replication status" section shows "Replication enabled" being true for all four rows, but "Security (4)" is false for the o=gluu and o=site rows with idp2. Then is says right after that on the bottom of the page "Errors were found. Fix them in the server and refresh this page to try again."

By Chris Blanton user 02 Oct 2018 at 10:27 a.m. CDT

Chris Blanton gravatar
> IP of the oder identity provider and the same ip also for the load balancer FQDN given during the setup. This won't work. The load balancer has to be separate from the Gluu Servers. > 1) The OpenDJ binary version '3.0.1.c5ad2e4846d8aeb501ffdfe5ae2dfd35136dfa68' does not match the installed version '3.0.0.92065c7762cf1f59042fbc5d71993c13bb9023fa'. Please run upgrade before continuing For some reason the OpenDJ versions are different. We upgraded our fork of OpenDJ to 3.0.1 to allow for BCrypt password checking, but it seems the second server didn't get the upgrade from Cluster Manager. This is why the security failed. We're looking into it. Also, please note that use of Cluster Manager reuiqres a [Gluu Support](https://github.com/GluuFederation/cluster-mgr/blob/master/LICENSE) contract when used in production.

By Florian Pühs user 02 Oct 2018 at 12:35 p.m. CDT

Florian Pühs gravatar
> This won't work. The load balancer has to be separate from the Gluu Servers. This is how the Cluster Manager did it, we didn't change or enter anything manually. Following the documentation we chose the external LB which only asked for the FQDN, this was entered. It also asked for the proxy FQDN and IP which was entered, but it didn't seem to do anything with it at this stage, so not sure if it will, the documentation is not clear if there is a manual setup required from us. After entering the data and filling out the data for the primary and replica the installation went through and the hosts file in the chroot for both idp1 and idp2 was filled by the Cluster Manager as given above. The fail to connect is also bizarre, we can login to the chroot of ipd2 and see the server (ipd1) from there. This is also why we asked which server is looking for which server here, because both can see and reach both. Double-checked the ports as well and it's enabled for both from both plus from the cluster manager. Can also be reached from within the chroot (used telnet for the there). For the OpenDJ issue - we've performed three installations from scratch in a span of a couple of hours on Monday > please note that use of Cluster Manager reuiqres a Gluu Support contract when used in production. Yes, this also matches our policy of having a support contract for production systems so no issues there. We'd like to get through a successful setup first though, there were other issues during POC and review that need to be clarified from the operational point of view, but we'll open a separate ticket for those. We'd just like to get through the issues here first, after three tries it doesn't seem to be useful to do the same from scratch install again just to get the same result.

By Chris Blanton user 02 Oct 2018 at 1:29 p.m. CDT

Chris Blanton gravatar
> For the OpenDJ issue - we've performed three installations from scratch in a span of a couple of hours on Monday We're currently looking in to it. Can you send me the output of `cat /opt/gluu-server-3.1.3.1/opt/opendj/config/buildinfo` on each node in your cluster?

By Florian Pühs user 02 Oct 2018 at 1:39 p.m. CDT

Florian Pühs gravatar
idp1: 3.0.0.92065c7762cf1f59042fbc5d71993c13bb9023fa idp2: 3.0.1.c5ad2e4846d8aeb501ffdfe5ae2dfd35136dfa68

By Chris Blanton user 02 Oct 2018 at 1:53 p.m. CDT

Chris Blanton gravatar
Did you manually install the first server (idp1) or did Cluster Manager? Trying to decipher why it wasn't provided the updated files for OpenDJ on idp1.

By Florian Pühs user 02 Oct 2018 at 2:07 p.m. CDT

Florian Pühs gravatar
Everything was done through the Cluster Manager as per the documentation here: https://gluu.org/docs/cm/installation/ Checked "This is an external load balancer" so only filled the hostname there and then the "Cache Proxy Hostname" and "Cache Proxy IP Address" and also checked the "Add IP Addresses and hostnames to /etc/hosts file on each server" option. The primary and replica were both installed, no errors. Went fine up until the "Once completed, repeat the process for the other servers in the cluster." sentence in the docs. Skipped the NGINX part and went to Replication. Ended with the results in the original post. As far as we understood the product is supported under Ubuntu 16.04 and also AWS deployments just based on the documentation. Maybe there is a specific AMI you know is working? Or is it supposed to work with the one we've used? We're not opposed to give this another go from scratch, but it would be nice to have a confirmation that the AMI version and the Cluster Manager version we've used works for a 2 node deployment. We'll have to set it up from scratch again anyway to have confidence in the solution for a production deplyment.

By Chris Blanton user 02 Oct 2018 at 2:21 p.m. CDT

Chris Blanton gravatar
> As far as we understood the product is supported under Ubuntu 16.04 and also AWS deployments just based on the documentation. Maybe there is a specific AMI you know is working? Or is it supposed to work with the one we've used? Yes I don't think this is an AWS or ami issue at this juncture. I'm prying so I can attempt to recreate the issue and issue a patch. You can see here that the build versions are different for some reason: > idp1: 3.0.0.92065c7762cf1f59042fbc5d71993c13bb9023fa idp2: 3.0.1.c5ad2e4846d8aeb501ffdfe5ae2dfd35136dfa68 Cluster Manager should have configured the first server to be the same version as idp2, but didn't for some reason. I will report back with my findings.

By Chris Blanton user 02 Oct 2018 at 2:29 p.m. CDT

Chris Blanton gravatar
Are you using Gluu Server 3.1.3.1 or 3.1.3?

By Florian Pühs user 02 Oct 2018 at 3:04 p.m. CDT

Florian Pühs gravatar
It's gluu-server-3.1.3.1 on both nodes, that's what the Cluster Manager pulled.

By Chris Blanton user 02 Oct 2018 at 3:05 p.m. CDT

Chris Blanton gravatar
Thank you. Testing now.

By Chris Blanton user 02 Oct 2018 at 4:45 p.m. CDT

Chris Blanton gravatar
Pushing a patch. Standby and I'll update you.

By Chris Blanton user 02 Oct 2018 at 4:55 p.m. CDT

Chris Blanton gravatar
Cluster Manager 3.1.3-12 released. To upgrade from scratch do the following steps as the root user that installed Cluster Manager: ``` rm -rf ~/.clustermgr clustermgr-cli stop pip uninstall clustermgr pip install --no-cache-dir clustermgr clustermgr-cli start ``` This is the [commit](https://github.com/GluuFederation/cluster-mgr/commit/a842a492690c1a8b95d358ffd030718d2cbe16fb) that fixed the version mismatch issue.

By Chris Blanton user 02 Oct 2018 at 4:57 p.m. CDT

Chris Blanton gravatar
I would also remove the Gluu Server installation from the nodes with `apt purge gluu-server-3.1.3.1`/

By Florian Pühs user 02 Oct 2018 at 5:48 p.m. CDT

Florian Pühs gravatar
OK, the patch has solved the OpenDJ issue, only the other one remained: In the *Securing replication on server idp2.FQDN* section the two ox-ldap.properties file entries are OK (green), but the errors mentioned originally remained: - Connection to LDAPserver as directory manager at port 1636 has failed: invalid server address - Ending server setup process. The rest of the sections are fine including the *Checking replication status* where both the Replication and Security are true for all four rows. The two errors above of course result in a red Retry button at the bottom of the page though. I'm not sure what it's looking for there. If I login to the chroot on both nodes the entries are in the hosts file and the port is open between the nodes.

By Florian Pühs user 02 Oct 2018 at 6:06 p.m. CDT

Florian Pühs gravatar
Meant to ask, can we install a specific Cluster Manager version? Or if this helps in any way, Cluster Manager 3.1.3-09 that installed Gluu 3.1.3 was working fine, no errors.

By Chris Blanton user 03 Oct 2018 at 1:45 p.m. CDT

Chris Blanton gravatar
> Meant to ask, can we install a specific Cluster Manager version? Or if this helps in any way, Cluster Manager 3.1.3-09 that installed Gluu 3.1.3 was working fine, no errors. `pip install -I clustermgr==3.1.3-09` Please be aware, this older version doesn't have Gluu Server 3.1.3.1, which is pre-patched for a security vulnerability. Please follow [these patching instructions](https://gluu.org/docs/ce/upgrade/patches/#code-white-patch) to rectify the security vulnerability if you choose to continue down this path. > I'm not sure what it's looking for there. If I login to the chroot on both nodes the entries are in the hosts file and the port is open between the nodes.

By Florian Pühs user 03 Oct 2018 at 1:52 p.m. CDT

Florian Pühs gravatar
We were just looking for ways to keep consistency, the 3.1.3-09 comment was only relating to the fact that both Cluster Manager and nodes deployment with all other features were working with that combination. If there is no further info on what's causing that invalid server address and 1636 port then we'll just go back and try a new deployment using 3.1.3-10 first and if that's also not OK then 09 and the patching.

By Chris Blanton user 03 Oct 2018 at 2 p.m. CDT

Chris Blanton gravatar
> If there is no further info on what's causing that invalid server address and 1636 port then we'll just go back and try a new deployment using 3.1.3-10 first and if that's also not OK then 09 and the patching. I'll keep looking into this particular issue as I haven't experienced it in testing so far. > I'm not sure what it's looking for there. If I login to the chroot on both nodes the entries are in the hosts file and the port is open between the nodes. I meant to reply to this before I hit post, but the mechanism in Cluster Manager is checking `stderr` for errors and assuming that error is breaking the installation, as the complicated nature of systems are hard to predict. That being said, we have a multitude of catches for non-breaking errors, which yours may be part of, but we've never seen it before. I'm assuming you have replication enabled already. Will you run the following inside one of the Gluu Server terminals: ``` /opt/opendj/bin/dsreplication status -n -X -p 1444 -I admin -w secret ``` and share the output?

By Florian Pühs user 03 Oct 2018 at 2:52 p.m. CDT

Florian Pühs gravatar
It's the same table as from last section *"Checking replication status"* on the replication setup page: ``` Suffix DN : Server : Entries : Replication enabled : DS ID : RS ID : RS Port (1) : M.C. (2) : A.O.M.C. (3) : Security (4) ----------:--------------------:---------:---------------------:-------:-------:-------------:----------:--------------:------------- o=gluu : idp1.FQDN :4444 : 164 : true : 28278 : 10843 : 8989 : 0 : : true o=gluu : idp2.FQDN :4444 : 164 : true : 3656 : 13748 : 8989 : 0 : : true o=site : idp1.FQDN :4444 : 2 : true : 1326 : 10843 : 8989 : 0 : : true o=site : idp2.FQDN :4444 : 2 : true : 8173 : 13748 : 8989 : 0 : : true ``` Everything seems fine and enabled, so not sure if that *"Ending server setup process."* message actually means anything. The issue also may be specific to that AMI used and that's why it's not common?

By Chris Blanton user 03 Oct 2018 at 4:50 p.m. CDT

Chris Blanton gravatar
Okay so replication is enabled and working as expected. From what I can tell the `Connection to LDAPserver as directory manager at port 1636 has failed: invalid server address` error is about Cluster Manager not being able to connect to the Gluu Server at the end to check replication status. This generally happens if the OpenDJ server hasn't restarted yet.

By Florian Pühs user 08 Oct 2018 at 5:55 p.m. CDT

Florian Pühs gravatar
Sorry, was unavailable for a while. What's weird that it's so consistently failing if the reason is that the OpenDJ server hasn't restarted yet. Through several redeployment attempts to new instances and it also hasn't been an issue before. We'll try a new setup in the coming days and see where that goes.

By Florian Pühs user 09 Oct 2018 at 5:58 p.m. CDT

Florian Pühs gravatar
Update The latest 3.1.4-01 CM and 3.1.4 Gluu went through the installation fine. The last "invalid server address" might have been because the server address could not be resolved. As there was no information given what is looking for what we simply added the entries for all servers into the hosts files of both the host and the chroot. That solved the issue and Replication could be setup. So we still can't say which one was missing, but it's not failing at that step anymore.

By Florian Pühs user 10 Oct 2018 at 5:12 a.m. CDT

Florian Pühs gravatar
We can close this one, the latest releases solved the installation and replication issues. Thank you!