Debugging thread exhaustion on HP-UX

Thread exhaustion on an HP-UX machine manifests itself by one or more of the following errors in

/var/adm/syslog/syslog.log

vmunix: kthread: table is full
vmunix: WARNING: hponc_thread_create(): error creating thread for autofskd (12)
sshd[8474]: fatal: fork of unprivileged child failed
sshd[1895]: error: fork: Resource temporarily unavailable

Keywords being thread and fork and failed.Â You should immediatly look at nkthread with kcusage

sudo kcusage nkthread
TunableÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Usage / Setting
=============================================
nkthreadÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5254 / 4096

You then get a descrption with

kctune

sudo kctune -v nkthread
TunableÂ Â Â Â Â Â Â Â Â Â Â Â  nkthread
DescriptionÂ Â Â Â Â Â Â Â  Maximum number of threads on the system
ModuleÂ Â Â Â Â Â Â Â Â Â Â Â Â  pm_proc
Current ValueÂ Â Â Â Â  4096
Value at Next BootÂ  4096
Value at Last BootÂ  4096
Default ValueÂ Â Â Â Â Â  8416
ConstraintsÂ Â Â Â Â Â Â Â  nkthread >= 200
 nkthread <= 4194304
 nkthread >= max_thread_proc
 nkthread >= nproc + 100
 nkthread >= (5 * vx_era_nthreads)
Can ChangeÂ Â Â Â Â Â Â Â Â  Immediately or at Next Boot

You then resolve the problem with ie

sudo kctune nkthread+=8192

After that you would, as a good sysadmin start to look at the usage back in time. Mind you that the percentage you see is relative to the new tunable value you just set a moment ago, not what it was at the time of the measurement back in time!

sudo kcusage -m nkthread
Tunable:Â Â Â Â Â Â Â  nkthread
Setting:Â Â Â Â Â Â Â  28051
TimeÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  UsageÂ Â Â Â Â  %
=============================================
Thu 09/09/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6285Â Â  22.4
Fri 09/10/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6403Â Â  22.8
Sat 09/11/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6368Â Â  22.7
Sun 09/12/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6150Â Â  21.9
Mon 09/13/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6336Â Â  22.6
Tue 09/14/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6436Â Â  22.9
Wed 09/15/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6382Â Â  22.8
Thu 09/16/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6416Â Â  22.9
Fri 09/17/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6277Â Â  22.4
Sat 09/18/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6157Â Â  21.9
Sun 09/19/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6203Â Â  22.1
Mon 09/20/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6319Â Â  22.5
Tue 09/21/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6420Â Â  22.9
Wed 09/22/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6306Â Â  22.5
Thu 09/23/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6474Â Â  23.1
Fri 09/24/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6567Â Â  23.4
Sat 09/25/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6452Â Â  23.0
Sun 09/26/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5910Â Â  21.1
Mon 09/27/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  8260Â Â  29.4
Tue 09/28/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  8240Â Â  29.4
Wed 09/29/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6617Â Â  23.6
Thu 09/30/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6461Â Â  23.0
Fri 10/01/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5799Â Â  20.7
Sat 10/02/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5558Â Â  19.8
Sun 10/03/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5892Â Â  21.0
Mon 10/04/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6983Â Â  24.9
Tue 10/05/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6542Â Â  23.3
Wed 10/06/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  6479Â Â  23.1
Thu 10/07/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  12289Â Â  43.8
Fri 10/08/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  11108Â Â  39.6
Sat 10/09/10Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  5292Â Â  18.9

Now you might be able to see what happend when and correlate it with your Change Management procedures to figure out what went wrong. I was not that lucky. This was on a database hotel consisting of 80 databases and database related applications and nearly no change control. So what to do? I needed to correlate the process information visible with ps with a thread. But how did you do that in HP-UX.

First guess, the alway valuable tool glance. And lo and behold the capital Z will show you the thread (also called Light Weight Processes in HP-UX, or LWP for short) information. But on a screen by screen basis. Useless if you have thousands of processes. After a bit of ping pong with a fellow sysadmin we ended up with the pstack (print stack) tool. It works like

$ ps -ef | grep -i java | head -1
 dma65t7Â  6728Â  6708Â  0 17:48:17 ?Â Â Â Â Â Â Â Â  3:23 
/opt/dma65t7/product/6.5/classes/com/documentum/jboss4.2.0/jdk/bin/IA64N/java 
-Dprogram.name=run.sh -server -Xms256m -Xmx512m -

sudo pstack 6728 | grep -i lwpid | sed -e 's,-*,,g' | head -10
lwpid : 7653569
lwpid : 7653570
lwpid : 7653572
lwpid : 7653573
lwpid : 7653574
lwpid : 7653575
lwpid : 7653576
lwpid : 7653577
lwpid : 7653578
lwpid : 7653579

So basically I ended up with the following one-liner

ps -ef > out ;Â  ps -ef | awk '{ print $1 " " $2 }' | grep -v root |
 while read user pid ; do sudo pstack $pid | egrep -i "($pid|lwpid)"|
 sed -e 's,-*,,g' ; done >> out 2>&1

Which gives a quick and dirty indication of which process eats up all the resources and you can go ask that application owner if that is normal. It wasn’t!

Hi Rasmus Thanks a lot! The C356BEE is a huge upgrade over the 3020i. It has plenty og power and…

Had my eyes on the C356BEE as well. Bought a used Lyngdorf SDAI 2175 (now discontinued), which has a serial…

Hi It does not support NFS shares using the webinterface. The Bluesound is not your average media player. It is…

Does the Bluesound support NFS shares? Did you determine which media player it was using?

i just can’t thank you enough for this post. so i will just say: thank! you! best, /mstelios

Weblog – Thomas S. Iversen

Leave a Reply Cancel reply