Debugging thread exhaustion on HP-UX

Thread exhaustion on an HP-UX machine manifests itself by one or more of the following errors in

/var/adm/syslog/syslog.log
vmunix: kthread: table is full
vmunix: WARNING: hponc_thread_create(): error creating thread for autofskd (12)
sshd[8474]: fatal: fork of unprivileged child failed
sshd[1895]: error: fork: Resource temporarily unavailable

Keywords being thread and fork and failed.  You should immediatly look at nkthread with kcusage

sudo kcusage nkthread
Tunable                 Usage / Setting
=============================================
nkthread                 5254 / 4096

You then get a descrption with

kctune
sudo kctune -v nkthread
Tunable             nkthread
Description         Maximum number of threads on the system
Module              pm_proc
Current Value      4096
Value at Next Boot  4096
Value at Last Boot  4096
Default Value       8416
Constraints         nkthread >= 200
 nkthread <= 4194304
 nkthread >= max_thread_proc
 nkthread >= nproc + 100
 nkthread >= (5 * vx_era_nthreads)
Can Change          Immediately or at Next Boot

You then resolve the problem with ie

sudo kctune nkthread+=8192

After that you would, as a good sysadmin start to look at the usage back in time. Mind you that the percentage you see is relative to the new tunable value you just set a moment ago, not what it was at the time of the measurement back in time!

sudo kcusage -m nkthread
Tunable:        nkthread
Setting:        28051
Time                           Usage      %
=============================================
Thu 09/09/10                    6285   22.4
Fri 09/10/10                    6403   22.8
Sat 09/11/10                    6368   22.7
Sun 09/12/10                    6150   21.9
Mon 09/13/10                    6336   22.6
Tue 09/14/10                    6436   22.9
Wed 09/15/10                    6382   22.8
Thu 09/16/10                    6416   22.9
Fri 09/17/10                    6277   22.4
Sat 09/18/10                    6157   21.9
Sun 09/19/10                    6203   22.1
Mon 09/20/10                    6319   22.5
Tue 09/21/10                    6420   22.9
Wed 09/22/10                    6306   22.5
Thu 09/23/10                    6474   23.1
Fri 09/24/10                    6567   23.4
Sat 09/25/10                    6452   23.0
Sun 09/26/10                    5910   21.1
Mon 09/27/10                    8260   29.4
Tue 09/28/10                    8240   29.4
Wed 09/29/10                    6617   23.6
Thu 09/30/10                    6461   23.0
Fri 10/01/10                    5799   20.7
Sat 10/02/10                    5558   19.8
Sun 10/03/10                    5892   21.0
Mon 10/04/10                    6983   24.9
Tue 10/05/10                    6542   23.3
Wed 10/06/10                    6479   23.1
Thu 10/07/10                   12289   43.8
Fri 10/08/10                   11108   39.6
Sat 10/09/10                    5292   18.9

Now you might be able to see what happend when and correlate it with your Change Management procedures to figure out what went wrong. I was not that lucky. This was on a database hotel consisting of 80 databases and database related applications and nearly no change control. So what to do? I needed to correlate the process information visible with ps with a thread. But how did you do that in HP-UX.

First guess, the alway valuable tool glance. And lo and behold the capital Z will show you the thread (also called Light Weight Processes in HP-UX, or LWP for short) information. But on a screen by screen basis. Useless if you have thousands of processes. After a bit of ping pong with a fellow sysadmin we ended up with the pstack (print stack) tool. It works like

$ ps -ef | grep -i java | head -1
 dma65t7  6728  6708  0 17:48:17 ?         3:23 
/opt/dma65t7/product/6.5/classes/com/documentum/jboss4.2.0/jdk/bin/IA64N/java 
-Dprogram.name=run.sh -server -Xms256m -Xmx512m -

sudo pstack 6728 | grep -i lwpid | sed -e 's,-*,,g' | head -10
lwpid : 7653569
lwpid : 7653570
lwpid : 7653572
lwpid : 7653573
lwpid : 7653574
lwpid : 7653575
lwpid : 7653576
lwpid : 7653577
lwpid : 7653578
lwpid : 7653579

So basically I ended up with the following one-liner

ps -ef > out ;  ps -ef | awk '{ print $1 " " $2 }' | grep -v root |
 while read user pid ; do sudo pstack $pid | egrep -i "($pid|lwpid)"|
 sed -e 's,-*,,g' ; done >> out 2>&1

Which gives a quick and dirty indication of which process eats up all the resources and you can go ask that application owner if that is normal. It wasn’t!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.