IBM Support

Resolve "Too Many Open files error" and "native OutOfMemory due to failed to create thread" issues in WebSphere Application Server running on Linux

Technical Blog Post


Abstract

Resolve "Too Many Open files error" and "native OutOfMemory due to failed to create thread" issues in WebSphere Application Server running on Linux

Body

image

 

We receive quite a few problem records (PMRs) / service requests (SRs) for native OutOfMemory issues in WebSphere Application Server and one of most famous native OOM issues happens particularly on Linux OS due to insufficient ulimit -u(NPROC) value.

We also receive a good number of PMRs for "Too many Open Files" error for WebSphere Application Server running on Linux.

With simple troubleshooting and ulimit command tuning, you can easily avoid opening a PMR with IBM support for these issues.

1) What is ulimit in Linux?
The ulimit command allows you to control the user resource limits in the system such as process data size, process virtual memory, and process file size, number of process etc.

2) What happens when the settings in this command are not set up properly?
Various issues happen like native OutOfMemory, Too Many Open files error, dump files are not being generated completely etc.

3) How can you check current ulimit settings?
There are various ways to check the current settings:

a) From the command prompt, issue
$ ulimit -a

We can see similiar output like below.
core file size           (blocks, -c) 0
data seg size            (kbytes, -d) unlimited
scheduling priority              (-e) 0
file size                (blocks, -f) unlimited
pending signals                  (-i) 32767
max locked memory        (kbytes, -l) 32
max memory size          (kbytes, -m) unlimited
open files                       (-n) 1024
pipe size             (512 bytes, -p) 8
POSIX message queues      (bytes, -q) 819200
real-time priority               (-r) 0
stack size               (kbytes, -s) 10240
cpu time                (seconds, -t) unlimited
max user processes               (-u) 50
virtual memory           (kbytes, -v) unlimited
file locks                       (-x) unlimited


This will display all current settings that are set for the current login session and by default soft limits will be displayed. Limits can Soft and Hard.
Hard limits are the maximum limit that can be configured. Only the root user can increase hard limits, though other users can decrease them. Soft limits can be set and changed by other users, but they cannot exceed the hard limits.

If you want to find specific limit values issue
ulimit -Sa

for current soft limit value.
ulimit -Ha
for current hard limit value.

b)If you know the Process ID (PID) of the WebSphere Application Server to be investigated, you can also inspect following file.
Location: /proc/<PID>
File:limits

The contents of this file is similar to the output of the "ulimit -a" command.
This file will have a list of ulimit parameters and their associated values for the specified PID.

c)If you know the process ID of the server you want to check the current ulimit settings, you can take a Javacore by issuing
kill -3 <PID>

You can open this Javacore in any text editor (like NotePad++, Ultra Edit etc.)
and search for ulimit and it will take you the ulimit section.
Example of ulimit settings as it is seen from a Javacore.
User Limits (in bytes except for NOFILE and NPROC)
--------------------------------------------------------------
type             soft limit      hard limit
RLIMIT_AS       11788779520       unlimited
RLIMIT_CORE            1024       unlimited
RLIMIT_CPU        unlimited       unlimited
RLIMIT_DATA       unlimited       unlimited
RLIMIT_FSIZE      unlimited       unlimited
RLIMIT_LOCKS      unlimited       unlimited
RLIMIT_MEMLOCK    unlimited       unlimited
RLIMIT_NOFILE         18192           18192
RLIMIT_NPROC          79563           79563
RLIMIT_RSS       8874856448       unlimited
RLIMIT_STACK       33554432       unlimited


If you want to find the global settings, inspect the below file in linux.
/etc/security/limits.conf.

Any changes to these global configuration limits files should be performed by your system administrator.
To find out more details on each setting in ulimit command and also to find about ulimit command on various OS, see this technote: Guidelines for setting ulimits (WebSphere Application Server)

4) What kind of native OOM is expected due to insufficient ulimit settings?
An out of memory Dump Event with a "Failed to create a thread" is going to happen.
Example: Below message will appear in Javacore.

"systhrow" (00040000) Detail "java/lang/OutOfMemoryError"
"Failed to create a thread: retVal -1073741830, errno 12" received
errno 12 is an actual native OOM on a start thread.

Sometimes, failed to create a thread is also seen in Server logs like SystemOut.log, SystemErr.log etc., and also in FFDC logs and this error indicates a native OutOfMemory happened during the creation of new thread.

5) What is the reason for this error to happen?
The reason is, the  current ulimit -u(NPROC) value is too low causing it.
The nproc limit usually only counts processes on a server towards determining this number. Linux systems running WebSphere Application Server are a particular case. The nproc limit on Linux counts the number of threads within all processes that can exist for a given user. For most cases of older versions of Linux this value will be defaulted to around 2048. For out of the box Red Hat Enterprise Linux (RHEL) 6 the default value for nproc will be set to 1024.
This low default setting for larger systems will not allow for enough threads in all processes.

6) How to fix this issue?
WebSphere Application Server Support recommends setting the ulimit -u or nproc to a value of 131072 when running on Linux to safely account for all the forked threads within processes that could be created.

It can be increased temporarily for the current session by setting

ulimit -u 131072

which sets the value for soft limit.

To set both soft and hard limits, issue

ulimit -Su 131072 for soft limit.

ulimit -Hu 131072 for hard limit.

to set it globally, the Linux system administrator has to edit

/etc/security/limits.conf

We have this technote explaining this: Insufficient ulimit -u (NPROC) Value Contributes to Native OutOfMemory

7) What about "Too Many Open Files" error?
This error indicates that all available file handles for the process have been used (this includes sockets as well).
Example: Errors similar to below will be seen Server logs.

java.io.IOException: Too many open files
prefs W Could not lock User prefs. UNIX error code 24.

8) Why this error happens?
It can happen if the current Number of Open Files limit is too low or if this is the result of file handles being leaked by some part of the application.

9) How to fix this?
IBM support recommends the number of open files setting ulimit -n value for WebSphere Application Server running on Linux as 65536 for both soft and hard limits.

ulimit -Sn 65536
ulimit -Hn 65536

10) What if there is a file descriptor leak in the application?
On Linux, we can find if any particular open files are growing over a period of time by taking below data with lsof command against he problematic JVM process ID on a periodic basis.

lsof -p [PID] -r [interval in seconds, 1800 for 30 minutes] > lsof.out

The output will provide you with all of the open files for the specified PID. You will be able to determine which files are opened and which files are growing over time.
Alternately you can list the contents of the file descriptors as a list of symbolic links in the following directory, where you replace PID with
the process ID. This is especially useful if you don't have access to the lsof command:

ls -al /proc/PID/fd

Related technote: Too Many Open Files error message

11) Is there anything else to be tuned?
We have one more setting we can tune on Linux using pid_max which is rare and occurs only large environments. If you are not using a large environment, you can skip this step.

The pid_max setting is for internal limit for maximum number of unique process identifiers your system supports.
The default value is 32,768 and this is sufficient for most of customers.
On large environments with huge number of processes there is a possibility this limit can be reached and
native OutOfMemory will happen with similar message in
Javacore with failed to create thread errno 11.
Example:

Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError"
"Failed to create a thread: retVal -106040066, errno 11" received

To find the current pid_max value on Linux.
cat /proc/sys/kernel/pid_max

To increase it,issue
sysctl -w kernel.pid_max=<Value>

Sometimes, the default 32,768 can be reached due to some thread leak/s,causing native OOM. In this case,you have to fix this thread pool leak to resolve native OOM.

Related technotes:
Troubleshooting native memory issues

Potential native memory use in WebSphere Application Server thread pools

Summary:
Make sure to have the below ulimit settings on Linux to avoid "too many open files error" and "native out of memory" issues due to failed to create a thread.
User Limits (in bytes except for NOFILE and NPROC)

                soft_limit      hard_limit
RLIMIT_NOFILE        65536           65536
RLIMIT_NPROC        131072          131072

12) Is there anything else to check?
IBM support recommends the below values for all ulimit settings for WebSphere Application Server running on Linux which includes the settings we discussed so far.

User Limits (in bytes except for NOFILE and NPROC)
type            soft limit      hard limit
RLIMIT_AS        unlimited       unlimited
RLIMIT_CORE      unlimited       unlimited
RLIMIT_CPU       unlimited       unlimited
RLIMIT_DATA      unlimited       unlimited
RLIMIT_FSIZE     unlimited       unlimited
RLIMIT_LOCKS     unlimited       unlimited
RLIMIT_MEMLOCK       65536           65536
RLIMIT_NOFILE        65536           65536
RLIMIT_NPROC        131072          131072

13) What is next?
Make sure to have the above discussed settings on all WebSphere Application Server JVMs like DMGr, NodeAgent and AppServers and restart the JVMs if the settings were done globally or log off and log back in with same user if the changes were done in the current session (shell).

 

 

[{"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

UID

ibm11081203