By: Levente Kintzel user 08 Mar 2021 at 3:46 a.m. CST

8 Responses
Levente Kintzel gravatar
Hello, we are running Gluu on an Ubuntu 20.04 LTS with 12 GB of RAM. First we run 4.2.2 also the dockerized version and the non dockerized version (allocated 10 GB). We observed that both after aprox 24 hrs runs out of memory. Most of the time the `oxtrust` or `ldap`. I exported the heap dump of `oxtrust` after such a situation, and analyzed it with `Eclipse Memory Analyzer`, that confirms that there is a huge `HashMap` related to `log4j` that occupied 1.4 GB / 1.5 GB of heap space (I have the report if you are interested). Because we went through most of the threads on the support portal related to Out of memory, we saw that in 4.2.3 there are some memory leaks fixed. So we decided to give it a try to 4.2.3. So we deployed 4.2.3. non docker version with 10 GB RAM allocated and also we disabled metrics and logging (according to instructions in the performance tunning section of the doc). Things were fine after more than 24 hrs. Because we are interested mostly on dockerized version, we deployed it in two versions: * with metrics and logging disabled, it is still running after 3 days * with metrics and logging enabled. For couple of times ldap failed to even start with the default setup, running directly in OutOfMemory: Java heap space. I extended the size of the LDAP container to 4GB, and I managed to start, but it runs again out of memory after 2 hrs, the server being simply idle, nobody was working on it. I reset everything back to default and run again the setup, system starts and worked ok for aprox 24 hrs even with some loads generated with `jmeter`. But after that it runs out of memory again. The logs of the ldap were: ``` INFO - pygluu.containerlib.wait - 2021-03-04 21:00:58,940 - Config is ready INFO - pygluu.containerlib.wait - 2021-03-04 21:00:58,985 - Secret is ready INFO - entrypoint - 2021-03-04 21:01:01,217 - Syncing OpenDJ certs. INFO - entrypoint - 2021-03-04 21:01:01,301 - Checking certificate's Subject Alt Name (SAN) INFO - entrypoint - 2021-03-04 21:01:01,371 - Loading Serf key from secrets ==> Starting Serf agent... ==> Starting Serf agent RPC... ==> Serf agent running! Node name: 'ldap' Bind addr: '0.0.0.0:7946' Advertise addr: '172.28.0.4:7946' RPC addr: '127.0.0.1:7373' Encrypted: true Snapshot: false Profile: lan Message Compression Enabled: true ==> Log data will now stream in as it occurs: INFO - ldap_peer - 2021-03-04 21:01:02,719 - Loading initial Serf peers from /etc/gluu/conf/serf-peers-static.json WARNING - ldap_peer - 2021-03-04 21:01:02,720 - Unable to load initial Serf peers from /etc/gluu/conf/serf-peers-static.json; reason=[Errno 2] No such file or directory: '/etc/gluu/conf/serf-peers-static.json' INFO - ldap_replicator - 2021-03-04 21:01:04,529 - Getting current server info INFO - ldap_replicator - 2021-03-04 21:01:04,592 - Checking replicated backends (attempt 1) [04/Mar/2021:21:01:09 +0000] category=com.forgerock.opendj.ldap.config.config severity=NOTICE msgID=571 msg=Loaded extension from file '/opt/opendj/lib/extensions/snmp-mib2605.jar' (build 4.0.0.gluu, revision a647132f766ca29542a072c835a4a31a3436d12f) [04/Mar/2021:21:01:09 +0000] category=CORE severity=NOTICE msgID=134 msg=OpenDJ Server 4.0.0-gluu (build 20200923091646, revision number a647132f766ca29542a072c835a4a31a3436d12f) starting up [04/Mar/2021:21:01:13 +0000] category=JVM severity=NOTICE msgID=21 msg=Installation Directory: /opt/opendj [04/Mar/2021:21:01:13 +0000] category=JVM severity=NOTICE msgID=23 msg=Instance Directory: /opt/opendj [04/Mar/2021:21:01:13 +0000] category=JVM severity=NOTICE msgID=17 msg=JVM Information: 11.0.8+10 by AdoptOpenJDK, 64-bit architecture, 1610612736 bytes heap size [04/Mar/2021:21:01:13 +0000] category=JVM severity=NOTICE msgID=18 msg=JVM Host: ldap, running Linux 5.4.0-66-generic amd64, 12564185088 bytes physical memory size, number of processors available 2 [04/Mar/2021:21:01:13 +0000] category=JVM severity=NOTICE msgID=19 msg=JVM Arguments: "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-Dorg.opends.server.scriptName=start-ds" INFO - ldap_replicator - 2021-03-04 21:01:14,670 - Checking replicated backends (attempt 2) [04/Mar/2021:21:01:15 +0000] category=BACKEND severity=NOTICE msgID=513 msg=The database backend site containing 2 entries has started [04/Mar/2021:21:01:15 +0000] category=BACKEND severity=NOTICE msgID=513 msg=The database backend metric containing 7 entries has started [04/Mar/2021:21:01:17 +0000] category=BACKEND severity=NOTICE msgID=513 msg=The database backend userRoot containing 195 entries has started [04/Mar/2021:21:01:17 +0000] category=EXTENSIONS severity=NOTICE msgID=221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: ldap INFO - ldap_replicator - 2021-03-04 21:01:24,705 - Checking replicated backends (attempt 3) INFO - ldap_replicator - 2021-03-04 21:01:34,722 - Checking replicated backends (attempt 4) [04/Mar/2021:21:01:37 +0000] category=CORE severity=NOTICE msgID=135 msg=The Directory Server has started successfully [04/Mar/2021:21:01:37 +0000] category=CORE severity=NOTICE msgID=139 msg=The Directory Server has sent an alert notification generated by class org.opends.server.core.DirectoryServer (alert type org.opends.server.DirectoryServerStarted, alert ID org.opends.messages.core-135): The Directory Server has started successfully [04/Mar/2021:21:01:38 +0000] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on Administration Connector 0.0.0.0 port 4444 [04/Mar/2021:21:01:38 +0000] category=PROTOCOL severity=NOTICE msgID=276 msg=Started listening for new connections on LDAPS Connection Handler 0.0.0.0 port 1636 INFO - ldap_replicator - 2021-03-04 21:01:44,743 - Checking replicated backends (attempt 5) INFO - ldap_replicator - 2021-03-04 21:01:54,779 - Checking replicated backends (attempt 6) INFO - ldap_replicator - 2021-03-04 21:02:04,807 - Checking replicated backends (attempt 7) INFO - ldap_replicator - 2021-03-04 21:02:14,855 - Checking replicated backends (attempt 8) INFO - ldap_replicator - 2021-03-04 21:02:24,887 - Checking replicated backends (attempt 9) INFO - ldap_replicator - 2021-03-04 21:02:34,906 - Checking replicated backends (attempt 10) INFO - ldap_replicator - 2021-03-04 21:02:44,936 - Checking replicated backends (attempt 11) INFO - ldap_replicator - 2021-03-04 21:02:54,960 - Checking replicated backends (attempt 12) INFO - ldap_replicator - 2021-03-04 21:03:04,983 - Checking replicated backends (attempt 13) INFO - ldap_replicator - 2021-03-04 21:03:15,008 - Checking replicated backends (attempt 14) INFO - ldap_replicator - 2021-03-04 21:03:25,030 - Checking replicated backends (attempt 15) INFO - ldap_replicator - 2021-03-04 21:03:35,051 - Checking replicated backends (attempt 16) INFO - ldap_replicator - 2021-03-04 21:03:45,074 - Checking replicated backends (attempt 17) INFO - ldap_replicator - 2021-03-04 21:03:55,099 - Checking replicated backends (attempt 18) INFO - ldap_replicator - 2021-03-04 21:04:05,121 - Checking replicated backends (attempt 19) INFO - ldap_replicator - 2021-03-04 21:04:15,148 - Checking replicated backends (attempt 20) INFO - ldap_replicator - 2021-03-04 21:04:25,165 - Checking replicated backends (attempt 21) INFO - ldap_replicator - 2021-03-04 21:04:35,187 - Checking replicated backends (attempt 22) INFO - ldap_replicator - 2021-03-04 21:04:45,205 - Checking replicated backends (attempt 23) INFO - ldap_replicator - 2021-03-04 21:04:55,227 - Checking replicated backends (attempt 24) INFO - ldap_replicator - 2021-03-04 21:05:05,248 - Checking replicated backends (attempt 25) INFO - ldap_replicator - 2021-03-04 21:05:15,269 - Checking replicated backends (attempt 26) INFO - ldap_replicator - 2021-03-04 21:05:25,292 - Checking replicated backends (attempt 27) INFO - ldap_replicator - 2021-03-04 21:05:35,308 - Checking replicated backends (attempt 28) INFO - ldap_replicator - 2021-03-04 21:05:45,331 - Checking replicated backends (attempt 29) INFO - ldap_replicator - 2021-03-04 21:05:55,359 - Checking replicated backends (attempt 30) Mar 05, 2021 10:27:17 PM com.sleepycat.je.utilint.DaemonThread run SEVERE: Error caught in <DaemonThread name="Cleaner-4"/> java.lang.OutOfMemoryError: Java heap space ``` Just some additional details: we have enabled `Shibboleth` to support `SAML`, and `Jackrabbit` enabled in our deployments. Any help would be really appreciated. Many thanks, Levente

By Mohib Zico Account Admin 09 Mar 2021 at 10:18 p.m. CST

Mohib Zico gravatar
Hello Levente, >> with metrics and logging enabled. For couple of times ldap failed to even start with the default setup, running directly in OutOfMemory: Java heap space. I extended the size of the LDAP container to 4GB, and I managed to start, but it runs again out of memory after 2 hrs, This is for dockerized version of Gluu 4.2.3 ?

By Levente Kintzel user 10 Mar 2021 at 2:12 a.m. CST

Levente Kintzel gravatar
Yes, it is the dockerized version. Thanks, Levente

By Mohib Zico Account Admin 16 Mar 2021 at 1:41 a.m. CDT

Mohib Zico gravatar
Thanks! Will try to reproduce the situation locally. Question: >> with metrics and logging enabled. Can you please help me to understand this a bit?

By Levente Kintzel user 16 Mar 2021 at 2:55 a.m. CDT

Levente Kintzel gravatar
Yes, I applied what was specified in the `Perfomance Tuning` section of the `Operation Guide`: "Make sure logging is turned OFF. Logging blocks threads and has a significant impact on performance. First test with low load, then test for high load with logging completely off. To turn off logging, in oxTrust navigate to Configuration -> JSON Configuration -> oxAuth Configuration and set loggingLevel: to OFF. Check the log files to confirm logging is off. Turn off metrics. Gathering metrics also impacts performance. To turn metrics off, in oxTrust navigate to: Configuration -> JSON Configuration -> oxAuth Configuration, and set metricReporterEnabled: to false." However I have some more details about this ticket, let me summarize it in the next comment.

By Levente Kintzel user 16 Mar 2021 at 3 a.m. CDT

Levente Kintzel gravatar
As next step I created a custom entrypoint.sh for the ldap container, and changed it in the docker compose file. In this file I deleted any JVM setup and added just `Xms` and `Xmx` paramters, both set to 4096m. The container memory has been set to 5000m in the docker file. Both metrics and logging was turned on. JMeter test has been run on it, for more than 8 hrs (aprox 7000 logins), after that it has been stopped, and the server left without pressure. After 24hrs plus, it runs out of memory. Heap was dumped with `jmap`, the size was minimal, so it was not the heap. After investigated memory consumption with `docker stat ldap` it shows 99% memory consumption, but it was not the heap. Further investigation with `jstat` turned out that it was the Metaspace that occupied 90%+ of the memory. (the column with `M` -> 96.84%) ``` S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT 0.00 0.00 0.86 3.65 96.84 94.59 299 17.654 102 369.959 7 0.598 388.210 ``` So as next step the `Metaspace` size was set to `1024m` and the `Xms`, `Xms` to `3072m`. Results are under investigations. The server is running ok for 4 days, memory consumptiom seems to be stabilized. As conclusion maybe setting also a limit for the Metaspace not just the 75% of container memory as heap seems to solve the issue, at least at first sight. I will continue to investigate our deployment, maybe using the original `entrypoint.sh` and giving the Xmx Xms and Metadata size from env variable as it is suggested in the doc.

By Levente Kintzel user 25 Mar 2021 at 2:40 p.m. CDT

Levente Kintzel gravatar
Hi, I continued investigation with different memory setups: - custom entrypoint.sh for container where just Xmx, Xms and MaxMetaspaceSize are provided are the rest of the flags were removed (like -server) - setting up JVM parameters through GLUU_JAVA_OPTIONS param Sometimes it goes better sometimes not, most of the time in couple of days I endup with an OutOfMemory. All the time the ldap container, and all the time Metaspace was 90%+. I exported the content of metadata using these commands: https://stackoverflow.com/questions/37919215/how-to-get-java8-metaspace-dump-not-heap-dump I attached to this post the export of the metadata commands. Now I upgraded to 4.2.3_02. Let's see how it goes. Levente

By Levente Kintzel user 06 Apr 2021 at 11:20 a.m. CDT

Levente Kintzel gravatar
Because the docker logs did not show all the log entries for opendj, after investigating some more log files within the container I came at the conclusion that my logs are very similar to what is described in this ticket: https://support.gluu.org/outages/9205/unexplained-outofmemoryerror-periodically-invalidates-ldap-environment/ The difference is that in my case I'm using dockerized version and already the new 4.2.3_02 docker image. Thanks, Levente

By Mohib Zico Account Admin 10 Apr 2021 at 11:23 p.m. CDT

Mohib Zico gravatar
Hi Levente, Apologies for late response. We ran some tests with couple of our customers who are using CN ( Cloud Native ), but couldn't reproduce the issue. As CN is actually need a good amount of time and resources to troubleshoot any issue permanently, do you think you can book a call [here](https://gluu.org/booking)? It will be easier for us to allocate resources if we know your business case and urgency and timeline. Thanks!