Archive for October, 2010

Debugging thread exhaustion on HP-UX

Saturday, October 9th, 2010

Thread exhaustion on an HP-UX machine manifests itself by one or more of the following errors in

/var/adm/syslog/syslog.log
vmunix: kthread: table is full
vmunix: WARNING: hponc_thread_create(): error creating thread for autofskd (12)
sshd[8474]: fatal: fork of unprivileged child failed
sshd[1895]: error: fork: Resource temporarily unavailable

Keywords being thread and fork and failed.  You should immediatly look at nkthread with kcusage

sudo kcusage nkthread
Tunable                 Usage / Setting
=============================================
nkthread                 5254 / 4096

You then get a descrption with

kctune
sudo kctune -v nkthread
Tunable             nkthread
Description         Maximum number of threads on the system
Module              pm_proc
Current Value      4096
Value at Next Boot  4096
Value at Last Boot  4096
Default Value       8416
Constraints         nkthread >= 200
 nkthread <= 4194304
 nkthread >= max_thread_proc
 nkthread >= nproc + 100
 nkthread >= (5 * vx_era_nthreads)
Can Change          Immediately or at Next Boot

You then resolve the problem with ie

sudo kctune nkthread+=8192

After that you would, as a good sysadmin start to look at the usage back in time. Mind you that the percentage you see is relative to the new tunable value you just set a moment ago, not what it was at the time of the measurement back in time!

sudo kcusage -m nkthread
Tunable:        nkthread
Setting:        28051
Time                           Usage      %
=============================================
Thu 09/09/10                    6285   22.4
Fri 09/10/10                    6403   22.8
Sat 09/11/10                    6368   22.7
Sun 09/12/10                    6150   21.9
Mon 09/13/10                    6336   22.6
Tue 09/14/10                    6436   22.9
Wed 09/15/10                    6382   22.8
Thu 09/16/10                    6416   22.9
Fri 09/17/10                    6277   22.4
Sat 09/18/10                    6157   21.9
Sun 09/19/10                    6203   22.1
Mon 09/20/10                    6319   22.5
Tue 09/21/10                    6420   22.9
Wed 09/22/10                    6306   22.5
Thu 09/23/10                    6474   23.1
Fri 09/24/10                    6567   23.4
Sat 09/25/10                    6452   23.0
Sun 09/26/10                    5910   21.1
Mon 09/27/10                    8260   29.4
Tue 09/28/10                    8240   29.4
Wed 09/29/10                    6617   23.6
Thu 09/30/10                    6461   23.0
Fri 10/01/10                    5799   20.7
Sat 10/02/10                    5558   19.8
Sun 10/03/10                    5892   21.0
Mon 10/04/10                    6983   24.9
Tue 10/05/10                    6542   23.3
Wed 10/06/10                    6479   23.1
Thu 10/07/10                   12289   43.8
Fri 10/08/10                   11108   39.6
Sat 10/09/10                    5292   18.9

Now you might be able to see what happend when and correlate it with your Change Management procedures to figure out what went wrong. I was not that lucky. This was on a database hotel consisting of 80 databases and database related applications and nearly no change control. So what to do? I needed to correlate the process information visible with ps with a thread. But how did you do that in HP-UX.

First guess, the alway valuable tool glance. And lo and behold the capital Z will show you the thread (also called Light Weight Processes in HP-UX, or LWP for short) information. But on a screen by screen basis. Useless if you have thousands of processes. After a bit of ping pong with a fellow sysadmin we ended up with the pstack (print stack) tool. It works like

$ ps -ef | grep -i java | head -1
 dma65t7  6728  6708  0 17:48:17 ?         3:23 
/opt/dma65t7/product/6.5/classes/com/documentum/jboss4.2.0/jdk/bin/IA64N/java 
-Dprogram.name=run.sh -server -Xms256m -Xmx512m -

sudo pstack 6728 | grep -i lwpid | sed -e 's,-*,,g' | head -10
lwpid : 7653569
lwpid : 7653570
lwpid : 7653572
lwpid : 7653573
lwpid : 7653574
lwpid : 7653575
lwpid : 7653576
lwpid : 7653577
lwpid : 7653578
lwpid : 7653579

So basically I ended up with the following one-liner

ps -ef > out ;  ps -ef | awk '{ print $1 " " $2 }' | grep -v root |
 while read user pid ; do sudo pstack $pid | egrep -i "($pid|lwpid)"|
 sed -e 's,-*,,g' ; done >> out 2>&1

Which gives a quick and dirty indication of which process eats up all the resources and you can go ask that application owner if that is normal. It wasn’t!

Configuring and using the BMC on an IBM eServer 326.

Sunday, October 3rd, 2010

The IBM eServer 326 comes with a fairly minimalistic Baseboard Management Controller (BMC). That is understandable when you look at the form factor (1U) of the eServer 326 as well as the pricetag it had when it was introduced. IBM tried to give the eServer 326 some RAS features without impacting the price. Kudos for that!

It does also however, mean that it is hard to use and configure. A couple of points

  • The BMC piggybacks/shares the first NIC with the system (LAN 1)
  • You have to start the server once before the BMC gets activated. It is not truely a seperate entity of its own
  • When you enter the system BIOS you see a BMC menu option. Clicking that however will not allow you to configure the LAN settings of the BMC, hence this blog post.
  • At times the BMC is unreachable due to the fact that the NIC is a shared NIC.

When you boot the disk/iso, it will normally go through an upgrade cycle before you end up in the lancfg tool. If you only want to configure the BMC, you can break the upgrade during startup (ctrl-c) and start lancfg yourself.

content of the lancfg disk

Inside lancfg you can configure the ip settings of th BMC ….

lancfg – ip settings

… as well as the SNMP settings

snmp settings of the BMC

You can also setup privileges to be used when accessing the BMC. The default BMC username and password for the IBM eServer 326 is USERID/PASSW0RD (with a zero)

After you have installed the tool, you can use it under linux as follows:

edison% ./usr/bin/smbridge -ip 192.168.1.223 sysinfo
DeviceID=       0
DeviceRevision= 1
FirmwareVersion=        1.25
IpmiVersion=    1.5
ManufacturerID= 2
ProductID=      34888
Status= OK
SDRVersion=     1.5
Guid=   171c049f-4ea8-b387-119d-321e09d85fe8

or

edison% ./usr/bin/smbridge -ip 192.168.1.223 power
Status= on

edison% ./usr/bin/smbridge -ip 192.168.1.223 power reset
Error(power,0xa3):Insufficient privilege level.
edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power reset

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power off

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power on

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power cycle

And that is about it. Not enterprise RAS features, but enough to reset a hung machine and stuff like that. Given that the eServer 326 was marketed as a cheap calculation node in a calculation farm, the BMC features fits the bill perfectly, even though it would have been nice with some remote console features as well….