Delphix Java Process May Stop Responding on AIX (KBA1783)
KBA
KBA# 1783Issue
On AIX, customers may find that there are orphaned Delphix java processes that do not appear to be functioning, but continue to consume memory resources. These processes may be hung (deadlocked) and never exit which means that a server could be deprived of memory resources to the point that it results in a loss of service due to a complete lack of memory. It may also dramatically impact the performance of the system prior to this as standard paging and swapping algorithms are used to free physical memory.
Troubleshooting
Utilize the "ps" command to see if there are processes running from under the Delphix toolkit directory that are left unresponsive for an extended periods of time (days). In the following example, there are a number of processes owned by th Delphix operating system user "delphix":
$ ps -ef | grep delphix delphix 4980954 50331840 0 0:00 <defunct> delphix 12910728 1 0 07:58:19 - 0:02 /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/java/jdk/bin/java -ea -XX:-UseVMInterruptibleIO -Ddelphix.host.os=unix -Ddelphix.toolkit.base.dir=/var/opt/delphix/Toolkit -Ddelphix.max.worker=16 -Djava.io.tmpdir=/var/opt/delphix/Toolkit/Delphix_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/tmp -jar /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/client/dsp/client.jar delphix 13303810 1 0 Aug 19 - 0:02 /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/java/jdk/bin/java -ea -XX:-UseVMInterruptibleIO -Ddelphix.host.os=unix -Ddelphix.toolkit.base.dir=/var/opt/delphix/Toolkit -Ddelphix.max.worker=16 -Djava.io.tmpdir=/var/opt/delphix/Toolkit/Delphix_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/tmp -jar /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/client/dsp/client.jar delphix 16253078 35192968 0 0:00 <defunct> delphix 16973938 59048178 0 0:00 <defunct> delphix 20971624 1 0 Aug 19 - 0:02 /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/java/jdk/bin/java -ea -XX:-UseVMInterruptibleIO -Ddelphix.host.os=unix -Ddelphix.toolkit.base.dir=/var/opt/delphix/Toolkit -Ddelphix.max.worker=16 -Djava.io.tmpdir=/var/opt/delphix/Toolkit/Delphix_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/tmp -jar /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/client/dsp/client.jar delphix 22282392 41680912 0 0:00 <defunct> delphix 22937658 1 0 Aug 19 - 0:02 /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/java/jdk/bin/java -ea -XX:-UseVMInterruptibleIO -Ddelphix.host.os=unix -Ddelphix.toolkit.base.dir=/var/opt/delphix/Toolkit -Ddelphix.max.worker=16 -Djava.io.tmpdir=/var/opt/delphix/Toolkit/Delphix_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/tmp -jar /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/client/dsp/client.jar delphix 35192968 1 0 Aug 17 - 0:02 /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/java/jdk/bin/java -ea -XX:-UseVMInterruptibleIO -Ddelphix.host.os=unix -Ddelphix.toolkit.base.dir=/var/opt/delphix/Toolkit -Ddelphix.max.worker=16 -Djava.io.tmpdir=/var/opt/delphix/Toolkit/Delphix_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/tmp -jar /var/opt/delphix/Toolkit/Delphix_COMMON_423e004e_8079_502d_14bb_82d1bfdd3532_delphix_host/client/dsp/client.jar
It is also likely that each of these processes will have a child process with an executable name of “<defunct>” (zombie). If these conditions are observed, it is likely that the Java process is hung in memory allocation and cannot make forward progress. This is not a Delphix or Java issue as the hang occurs in the AIX memory management facility.
Resolution
In order to mitigate the hang, simple environment variable additions for the Delphix operating system users can be made that will only affect the Delphix OS users and have no impact to the rest of the system. You may have multiple Delphix operating system users as seen in the following graphic:
Adding the following environment variable and options for all of the environment users will prevent the accumulation of Java processes and the depletion of memory:
$ cat .ssh/environment MALLOCOPTIONS=multiheap:32,pool
Update the sshd_config file to permit setting user environment variables:
$ grep PermitUserEnvironment /etc/ssh/sshd_config #PermitUserEnvironment no PermitUserEnvironment yes
If PermitUserEnvironment==no which is the default setting, the operating system will not allow environment variables to be set in the ".ssh/environment" file.
To test that the environment variable is defined correctly for non-interactive logins the following test can be completed:
$ ssh ora11202@aix101-14.delphix.com "env | grep MALLOCOPTIONS" ora11202@aix101-14.delphix.com's password: MALLOCOPTIONS=multiheap:32,pool
From IBM’s documentation these two options have the following impact:
multiheap
- Configures the number of parallel heaps to be used by memory allocators. You can set the multiheap by exporting MALLOCOPTIONS=multipheap:n. The value n can vary from 1 through 32. The default value is 32, if n is not specified. This option is advisable for multithreaded applications, as it can significantly improve the performance.
pool
- Maintains the bucket for each thread and provides a lock-free allocation and deallocation for blocks less than 513 bytes. This option improves the performance of multithreaded applications as it avoids the time that is spent on locking of memory size less than 513 bytes. The pool option makes small memory block allocations fast and efficient.
Related Articles
Several IBM articles were referenced to determine the appropriate MALLOCOPTIONS necessary to avoid this problem:
- https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.genprogc/sys_mem_alloc.htm
- https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.genprogc/malloc_multiheap.htm
- https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/aix_throughput_problems_when_malloc_is_called_often?lang=en_us
- https://publib.boulder.ibm.com/httpserv/cookbook/Operating_Systems-AIX.html