By: Robert Dinga named 07 May 2018 at 9:34 p.m. CDT

9 Responses
Robert Dinga gravatar
We're running a 3 server Gluu cluster. Each server has 4 CPU and 8GB RAM. Each server has OpenDJ (installed with Gluu) and the servers were clustered following the Gluu documentation. Prior to today, we would see EntryPersistenceExceptions due to 300000ms (5 min) timeouts in the oxAuth logs two to three times daily. These were usually Gluu searching fo the appliance entry or an attempt to modify the admin account entry, but also seen trying to add a metric entry. Starting today, we're now seeing EntryPersistenceExceptions every 10 minutes on one server where Gluu is attempting to add a metric entry. Looking at the OpenDJ access log around the time 5 minutes before the exceptions were logged, I do not see the relevant searches, modifies, or adds logged, as if the requests didn't even make it to OpenDJ. The OpenDJ instances are located on the same servers running Gluu, so networking shouldn't be a problem across localhost. Example appliance search exception: ``` 2018-05-04 19:12:18,056 ERROR [ForkJoinPool.commonPool-worker-2] [org.xdi.oxauth.service.AppInitializer] (AppInitializer.java:554) - Failed to load appliance entry from Ldap org.gluu.site.ldap.persistence.exception.EntryPersistenceException: Failed to find entry: inum=@!AF68.4445.CF47.B698!0002!D44B.63D0,ou=appliances,o=gluu at org.gluu.site.ldap.persistence.LdapEntryManager.find(LdapEntryManager.java:303) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.AbstractEntryManager.find(AbstractEntryManager.java:444) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.AbstractEntryManager.find(AbstractEntryManager.java:381) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.LdapEntryManager$Proxy$_$$_WeldClientProxy.find(Unknown Source) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.xdi.oxauth.service.AppInitializer.loadAppliance(AppInitializer.java:552) [classes/:?] at org.xdi.oxauth.service.AppInitializer.getActualDefaultAuthenticationMethod(AppInitializer.java:527) [classes/:?] at org.xdi.oxauth.service.AppInitializer.setDefaultAuthenticationMethod(AppInitializer.java:514) [classes/:?] at org.xdi.oxauth.service.AppInitializer.reloadConfiguration(AppInitializer.java:264) [classes/:?] at org.xdi.oxauth.service.AppInitializer.reloadConfigurationTimerEvent(AppInitializer.java:244) [classes/:?] at org.xdi.oxauth.service.AppInitializer$Proxy$_$$_WeldSubclass.reloadConfigurationTimerEvent$$super(Unknown Source) [classes/:?] at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112] at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51) [weld-core-impl-3.0.0.Final.jar:3.0.0.Final] at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:78) [weld-core-impl-3.0.0.Final.jar:3.0.0.Final] at org.xdi.service.cdi.async.AsynchronousInterceptor$1.get(AsynchronousInterceptor.java:36) [oxcore-service-3.1.2.Final.jar:?] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) [?:1.8.0_112] at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582) [?:1.8.0_112] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_112] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_112] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_112] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_112] Caused by: org.gluu.site.ldap.exception.ConnectionException: Failed to lookup entry --> A client-side timeout was encountered while waiting 300000ms for a response to search request with message ID 105, base DN 'inum=@!AF68.4445.CF47.B698!0002!D44B.63D0,ou=appliances,o=gluu', scope BASE, and filter '(objectClass=*)' from server localhost:1636. at org.gluu.site.ldap.OperationsFacade.lookup(OperationsFacade.java:513) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.LdapEntryManager.find(LdapEntryManager.java:297) ~[oxcore-ldap-3.1.2.Final.jar:?] ... 21 more Caused by: com.unboundid.ldap.sdk.LDAPSearchException: A client-side timeout was encountered while waiting 300000ms for a response to search request with message ID 105, base DN 'inum=@!AF68.4445.CF47.B698!0002!D44B.63D0,ou=appliances,o=gluu', scope BASE, and filter '(objectClass=*)' from server localhost:1636. at com.unboundid.ldap.sdk.SearchRequest.process(SearchRequest.java:1193) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.LDAPConnection.search(LDAPConnection.java:3635) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.LDAPConnection.getEntry(LDAPConnection.java:1712) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.AbstractConnectionPool.getEntry(AbstractConnectionPool.java:611) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at org.gluu.site.ldap.OperationsFacade.lookup(OperationsFacade.java:510) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.LdapEntryManager.find(LdapEntryManager.java:297) ~[oxcore-ldap-3.1.2.Final.jar:?] ... 21 more ``` oxAuth log 5 minutes prior: ``` 2018-05-04 19:07:18,000 DEBUG [oxAuthScheduler_Worker-1] [org.xdi.service.timer.TimerJob] (TimerJob.java:34) - Fire timer event [org.xdi.service.cdi.event.ConfigurationEvent] with qualifiers [@org. xdi.service.cdi.event.Scheduled()] 2018-05-04 19:07:18,001 DEBUG [oxAuthScheduler_Worker-1] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:62) - Bound request ended 2018-05-04 19:07:18,029 DEBUG [oxAuthScheduler_Worker-3] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:52) - Bound request started 2018-05-04 19:07:18,029 DEBUG [oxAuthScheduler_Worker-3] [org.xdi.service.timer.TimerJob] (TimerJob.java:34) - Fire timer event [org.xdi.oxauth.service.cdi.event.CleanerEvent] with qualifiers [@org .xdi.service.cdi.event.Scheduled()] 2018-05-04 19:07:18,031 DEBUG [oxAuthScheduler_Worker-3] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:62) - Bound request ended 2018-05-04 19:07:18,031 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:138) - Start AuthorizationGrant clean up 2018-05-04 19:07:18,037 TRACE [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.GrantService] (GrantService.java:184) - Removed token from LDAP, code: <redacted> 2018-05-04 19:07:18,040 TRACE [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.GrantService] (GrantService.java:175) - Removed grant, id: <redacted> 2018-05-04 19:07:18,041 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:140) - End AuthorizationGrant clean up 2018-05-04 19:07:18,041 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:144) - Start Client clean up 2018-05-04 19:07:18,043 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:176) - End Client clean up 2018-05-04 19:07:18,045 DEBUG [oxAuthScheduler_Worker-4] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:52) - Bound request started 2018-05-04 19:07:18,045 DEBUG [oxAuthScheduler_Worker-4] [org.xdi.service.timer.TimerJob] (TimerJob.java:34) - Fire timer event [org.xdi.oxauth.service.cdi.event.AuthConfigurationEvent] with qualifiers [@org.xdi.service.cdi.event.Scheduled()] 2018-05-04 19:07:18,046 DEBUG [oxAuthScheduler_Worker-4] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:62) - Bound request ended 2018-05-04 19:07:18,046 TRACE [ForkJoinPool.commonPool-worker-2] [org.xdi.oxauth.service.AppInitializer] (AppInitializer.java:256) - Attempting to use ldapEntryManager: org.gluu.site.ldap.OperationsFacade@7c2b2b56 2018-05-04 19:07:18,047 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:180) - Start U2F request clean up 2018-05-04 19:07:18,048 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:207) - End U2F request clean up 2018-05-04 19:07:18,048 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:211) - Start U2F request clean up 2018-05-04 19:07:18,065 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:239) - End U2F request clean up 2018-05-04 19:07:18,065 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:243) - Start metric entries clean up 2018-05-04 19:07:18,456 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:253) - End metric entries clean up ``` Example metric add exception: ``` 2018-05-08 02:32:19,473 ERROR [metrics-ldap-reporter-1-thread-1] [com.codahale.metrics.ScheduledReporter$1] (ScheduledReporter.java:119) - RuntimeException thrown from LdapEntryReporter#report. Exception was suppressed. org.gluu.site.ldap.persistence.exception.EntryPersistenceException: Failed to persist entry: uniqueIdentifier=1525460236860,ou=201805,ou=oxauth,ou=@!AF68.4445.CF47.B698!0002!D44B.63D0,ou=metric,o=@!AF68.4445.CF47.B698!0001!5E04.A6B1,o=gluu at org.gluu.site.ldap.persistence.LdapEntryManager.persist(LdapEntryManager.java:124) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.AbstractEntryManager.persist(AbstractEntryManager.java:90) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.LdapEntryManager$Proxy$_$$_WeldClientProxy.persist(Unknown Source) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.xdi.service.metric.MetricService.add(MetricService.java:139) ~[oxcore-service-3.1.2.Final.jar:?] at org.xdi.oxauth.service.MetricService$Proxy$_$$_WeldSubclass.add(Unknown Source) ~[classes/:?] at org.xdi.oxauth.service.MetricService$Proxy$_$$_WeldClientProxy.add(Unknown Source) ~[classes/:?] at org.xdi.service.metric.LdapEntryReporter.reportImpl(LdapEntryReporter.java:192) ~[oxcore-service-3.1.2.Final.jar:?] at org.xdi.service.metric.LdapEntryReporter.report(LdapEntryReporter.java:161) ~[oxcore-service-3.1.2.Final.jar:?] at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~[metrics-core-3.1.2.jar:3.1.2] at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) [metrics-core-3.1.2.jar:3.1.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_112] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_112] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_112] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] Caused by: com.unboundid.ldap.sdk.LDAPException: A client-side timeout was encountered while waiting 300000ms for a response to add request with message ID 25 for entry 'uniqueIdentifier=1525460236860,ou=201805,ou=oxauth,ou=@!AF68.4445.CF47.B698!0002!D44B.63D0,ou=metric,o=@!AF68.4445.CF47.B698!0001!5E04.A6B1,o=gluu' from server localhost:1636. at com.unboundid.ldap.sdk.AddRequest.handleResponse(AddRequest.java:1268) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.AddRequest.process(AddRequest.java:1035) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.LDAPConnection.add(LDAPConnection.java:1950) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.AbstractConnectionPool.add(AbstractConnectionPool.java:752) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at com.unboundid.ldap.sdk.AbstractConnectionPool.add(AbstractConnectionPool.java:683) ~[unboundid-ldapsdk-3.2.0.jar:3.2.0] at org.gluu.site.ldap.OperationsFacade.addEntry(OperationsFacade.java:533) ~[oxcore-ldap-3.1.2.Final.jar:?] at org.gluu.site.ldap.persistence.LdapEntryManager.persist(LdapEntryManager.java:119) ~[oxcore-ldap-3.1.2.Final.jar:?] ... 16 more ``` ... and 5 minutes prior: ``` 2018-05-08 02:27:18,028 DEBUG [oxAuthScheduler_Worker-1] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:52) - Bound request started 2018-05-08 02:27:18,028 DEBUG [oxAuthScheduler_Worker-1] [org.xdi.service.timer.TimerJob] (TimerJob.java:34) - Fire timer event [org.xdi.oxauth.service.cdi.event.CleanerEvent] with qualifiers [@org .xdi.service.cdi.event.Scheduled()] 2018-05-08 02:27:18,029 DEBUG [oxAuthScheduler_Worker-1] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:62) - Bound request ended 2018-05-08 02:27:18,029 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:138) - Start AuthorizationGrant clean up 2018-05-08 02:27:18,045 DEBUG [oxAuthScheduler_Worker-2] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:52) - Bound request started 2018-05-08 02:27:18,046 DEBUG [oxAuthScheduler_Worker-2] [org.xdi.service.timer.TimerJob] (TimerJob.java:34) - Fire timer event [org.xdi.oxauth.service.cdi.event.AuthConfigurationEvent] with qualif iers [@org.xdi.service.cdi.event.Scheduled()] 2018-05-08 02:27:18,046 DEBUG [oxAuthScheduler_Worker-2] [org.xdi.service.timer.RequestJobListener] (RequestJobListener.java:62) - Bound request ended 2018-05-08 02:27:18,046 TRACE [ForkJoinPool.commonPool-worker-2] [org.xdi.oxauth.service.AppInitializer] (AppInitializer.java:256) - Attempting to use ldapEntryManager: org.gluu.site.ldap.Operation sFacade@7c2b2b56 2018-05-08 02:27:18,596 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:140) - End AuthorizationGrant clean up 2018-05-08 02:27:18,596 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:144) - Start Client clean up 2018-05-08 02:27:18,796 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:176) - End Client clean up 2018-05-08 02:27:19,397 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:180) - Start U2F request clean up 2018-05-08 02:27:19,493 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:207) - End U2F request clean up 2018-05-08 02:27:19,493 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:211) - Start U2F request clean up 2018-05-08 02:27:19,495 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:239) - End U2F request clean up 2018-05-08 02:27:19,496 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:243) - Start metric entries clean up 2018-05-08 02:27:20,098 DEBUG [ForkJoinPool.commonPool-worker-1] [org.xdi.oxauth.service.CleanerTimer] (CleanerTimer.java:253) - End metric entries clean up ```

By Mohib Zico Account Admin 09 May 2018 at 4:47 a.m. CDT

Mohib Zico gravatar
As you have sufficient memory ( 8GB ) each; you can try to tweak allocated memory for identity,oxauth,idp and opendj. - identity/oxauth/idp allocated memory can be done inside container:/etc/defaults/ - openDJ one can be done from java.properties Make sure to stop/start container after those modification. Let's see how that goes.

By Robert Dinga named 14 May 2018 at 2:35 p.m. CDT

Robert Dinga gravatar
Hi Mohib, I increased the JVM memory on OpenDJ to 3gb (Xms and Xmx) and on oxAuth to 1788mb (added 500mb to Xmx). We don't have idp installed and Identity is barely used. We are still seeing the 5 minute timeout exceptions as above on all 3 of our servers. They seem to be less frequent, but are still happening. Is this still not enough memory? The userRoot db is only ~291MB and this environment isn't getting much use, yet. I'm worried if I keep going higher, I won't be leaving much for other applications - redis, apache, monitoring, etc - or the system. Is there a recommended setting for these JVMs in a clustered environment? With these memory issues showing up under light usage, how much memory should each server have to accommodate heavy use?

By Mohib Zico Account Admin 15 May 2018 at 12:13 a.m. CDT

Mohib Zico gravatar
Hello Robert, - Are you using cluster-mgr or manual cluster config? - For sanity checking... let's build 2 servers cluster instead of 3.

By Robert Dinga named 18 May 2018 at 9:01 a.m. CDT

Robert Dinga gravatar
We're manually configuring the cluster per Gluu documentation [here](https://gluu.org/docs/ce/3.1.2/installation-guide/cluster/). I can create a separate environment with 2 servers, but that will take some time to build out by hand. Is there a reason Gluu wouldn't support 3 servers in a cluster? OpenDJ and Redis certainly would support far greater than 3 instances each.

By Mohib Zico Account Admin 18 May 2018 at 9:14 a.m. CDT

Mohib Zico gravatar
It should support but personally I haven't worked with any customer who are using more than 2 instances. None of them have faced any issue like that. That's why requested you to cross check.

By Robert Dinga named 25 May 2018 at 10:30 a.m. CDT

Robert Dinga gravatar
Hi Mohib, We built a 2-server cluster (2CPU 4GB RAM each, default JVM memory settings) earlier this week and have been monitoring it, but have not seen the 5 minute timeout errors so far. Granted, the only activity has been me logging into oxtrust to turn on trace logging in oxauth, but we were seeing the issue on silent clusters before. Does this indicate something in Gluu only works in a two-server cluster? If so, could splitting apart some of the components offer some solutions? Maybe split OpenDJ into its own 2-server replica set with numerous Gluu "web servers" pointing to it? I'm kind of grasping at straws here, but we're not comfortable only running two servers in our solution. I would much rather have the option of spreading servers across physical locations for the reliability/durability. I would also rather scale horizontally than vertically as load increases. I'm hoping there's a way to achieve this with Gluu.

By Chris Blanton user 25 May 2018 at 4:25 p.m. CDT

Chris Blanton gravatar
Robert, Let me try to replicate this issues and figure out where issues may lie. For clarification, is this issue with oxAuth throwing errors and timeouts service breaking or something you've noticed in the logs? This will help me figure out how to mitigate issues going forward.

By Robert Dinga named 30 May 2018 at 7:36 a.m. CDT

Robert Dinga gravatar
Hi Chris, The issue isn't service breaking (nothing falls down), but it can be noticeable by the user depending on which LDAP call gets "lost". I have seen Apache return a 502 30s after the LDAP call was attempted, or 4m30s before the 300000ms timeout error appears in the log, which is also seen in the Nginx logs and in the UI. This has happened a couple times on the call to update oxLastLogonTime during user login. Most of the errors come from the Gluu metrics inserts though, like the above logs. I have not seen any issues with OpenDJ locking up like OpenLDAP did when we were seeing similar issues back in December ([in this ticket](https://support.gluu.org/single-sign-on/4903/ldap-entry-persistence-exception/)). It continues with replication and normal access the whole time.

By Mohib Zico Account Admin 01 Jun 2018 at 2:06 a.m. CDT

Mohib Zico gravatar
Robert, We are going to test more than two nodes manual clustering when we can manage some bandwidth. I'll request you to use our [Cluster Manager](https://gluu.org/docs/cm/3.1.3/); it shouldn't create any problem like that.