Archive for the ‘unix’ Category

Debugging thread exhaustion on HP-UX

Saturday, October 9th, 2010

Thread exhaustion on an HP-UX machine manifests itself by one or more of the following errors in

/var/adm/syslog/syslog.log
vmunix: kthread: table is full
vmunix: WARNING: hponc_thread_create(): error creating thread for autofskd (12)
sshd[8474]: fatal: fork of unprivileged child failed
sshd[1895]: error: fork: Resource temporarily unavailable

Keywords being thread and fork and failed.  You should immediatly look at nkthread with kcusage

sudo kcusage nkthread
Tunable                 Usage / Setting
=============================================
nkthread                 5254 / 4096

You then get a descrption with

kctune
sudo kctune -v nkthread
Tunable             nkthread
Description         Maximum number of threads on the system
Module              pm_proc
Current Value      4096
Value at Next Boot  4096
Value at Last Boot  4096
Default Value       8416
Constraints         nkthread >= 200
 nkthread <= 4194304
 nkthread >= max_thread_proc
 nkthread >= nproc + 100
 nkthread >= (5 * vx_era_nthreads)
Can Change          Immediately or at Next Boot

You then resolve the problem with ie

sudo kctune nkthread+=8192

After that you would, as a good sysadmin start to look at the usage back in time. Mind you that the percentage you see is relative to the new tunable value you just set a moment ago, not what it was at the time of the measurement back in time!

sudo kcusage -m nkthread
Tunable:        nkthread
Setting:        28051
Time                           Usage      %
=============================================
Thu 09/09/10                    6285   22.4
Fri 09/10/10                    6403   22.8
Sat 09/11/10                    6368   22.7
Sun 09/12/10                    6150   21.9
Mon 09/13/10                    6336   22.6
Tue 09/14/10                    6436   22.9
Wed 09/15/10                    6382   22.8
Thu 09/16/10                    6416   22.9
Fri 09/17/10                    6277   22.4
Sat 09/18/10                    6157   21.9
Sun 09/19/10                    6203   22.1
Mon 09/20/10                    6319   22.5
Tue 09/21/10                    6420   22.9
Wed 09/22/10                    6306   22.5
Thu 09/23/10                    6474   23.1
Fri 09/24/10                    6567   23.4
Sat 09/25/10                    6452   23.0
Sun 09/26/10                    5910   21.1
Mon 09/27/10                    8260   29.4
Tue 09/28/10                    8240   29.4
Wed 09/29/10                    6617   23.6
Thu 09/30/10                    6461   23.0
Fri 10/01/10                    5799   20.7
Sat 10/02/10                    5558   19.8
Sun 10/03/10                    5892   21.0
Mon 10/04/10                    6983   24.9
Tue 10/05/10                    6542   23.3
Wed 10/06/10                    6479   23.1
Thu 10/07/10                   12289   43.8
Fri 10/08/10                   11108   39.6
Sat 10/09/10                    5292   18.9

Now you might be able to see what happend when and correlate it with your Change Management procedures to figure out what went wrong. I was not that lucky. This was on a database hotel consisting of 80 databases and database related applications and nearly no change control. So what to do? I needed to correlate the process information visible with ps with a thread. But how did you do that in HP-UX.

First guess, the alway valuable tool glance. And lo and behold the capital Z will show you the thread (also called Light Weight Processes in HP-UX, or LWP for short) information. But on a screen by screen basis. Useless if you have thousands of processes. After a bit of ping pong with a fellow sysadmin we ended up with the pstack (print stack) tool. It works like

$ ps -ef | grep -i java | head -1
 dma65t7  6728  6708  0 17:48:17 ?         3:23 
/opt/dma65t7/product/6.5/classes/com/documentum/jboss4.2.0/jdk/bin/IA64N/java 
-Dprogram.name=run.sh -server -Xms256m -Xmx512m -

sudo pstack 6728 | grep -i lwpid | sed -e 's,-*,,g' | head -10
lwpid : 7653569
lwpid : 7653570
lwpid : 7653572
lwpid : 7653573
lwpid : 7653574
lwpid : 7653575
lwpid : 7653576
lwpid : 7653577
lwpid : 7653578
lwpid : 7653579

So basically I ended up with the following one-liner

ps -ef > out ;  ps -ef | awk '{ print $1 " " $2 }' | grep -v root |
 while read user pid ; do sudo pstack $pid | egrep -i "($pid|lwpid)"|
 sed -e 's,-*,,g' ; done >> out 2>&1

Which gives a quick and dirty indication of which process eats up all the resources and you can go ask that application owner if that is normal. It wasn’t!

Configuring and using the BMC on an IBM eServer 326.

Sunday, October 3rd, 2010

The IBM eServer 326 comes with a fairly minimalistic Baseboard Management Controller (BMC). That is understandable when you look at the form factor (1U) of the eServer 326 as well as the pricetag it had when it was introduced. IBM tried to give the eServer 326 some RAS features without impacting the price. Kudos for that!

It does also however, mean that it is hard to use and configure. A couple of points

  • The BMC piggybacks/shares the first NIC with the system (LAN 1)
  • You have to start the server once before the BMC gets activated. It is not truely a seperate entity of its own
  • When you enter the system BIOS you see a BMC menu option. Clicking that however will not allow you to configure the LAN settings of the BMC, hence this blog post.
  • At times the BMC is unreachable due to the fact that the NIC is a shared NIC.

When you boot the disk/iso, it will normally go through an upgrade cycle before you end up in the lancfg tool. If you only want to configure the BMC, you can break the upgrade during startup (ctrl-c) and start lancfg yourself.

content of the lancfg disk

Inside lancfg you can configure the ip settings of th BMC ….

lancfg – ip settings

… as well as the SNMP settings

snmp settings of the BMC

You can also setup privileges to be used when accessing the BMC. The default BMC username and password for the IBM eServer 326 is USERID/PASSW0RD (with a zero)

After you have installed the tool, you can use it under linux as follows:

edison% ./usr/bin/smbridge -ip 192.168.1.223 sysinfo
DeviceID=       0
DeviceRevision= 1
FirmwareVersion=        1.25
IpmiVersion=    1.5
ManufacturerID= 2
ProductID=      34888
Status= OK
SDRVersion=     1.5
Guid=   171c049f-4ea8-b387-119d-321e09d85fe8

or

edison% ./usr/bin/smbridge -ip 192.168.1.223 power
Status= on

edison% ./usr/bin/smbridge -ip 192.168.1.223 power reset
Error(power,0xa3):Insufficient privilege level.
edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power reset

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power off

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power on

edison% ./usr/bin/smbridge -ip 192.168.1.223 -u USERID -p PASSW0RD power cycle

And that is about it. Not enterprise RAS features, but enough to reset a hung machine and stuff like that. Given that the eServer 326 was marketed as a cheap calculation node in a calculation farm, the BMC features fits the bill perfectly, even though it would have been nice with some remote console features as well….

About errors setting up SASL/StartTLS in Postfix under Ubuntu

Sunday, August 29th, 2010

In the process of setting up SASL and StartTLS under Postfix, I got this in the log:

“warning: SASL authentication failure: cannot connect to saslauthd server: No such file or directory”

It took me some minutes to figure out what was wrong. Obviously it has to do with postfix not being able to connect to the saslauthd server. The question is just, why? I went with a hunch of permissions … and wasted some time, so if you get the error, here is what solved it for me. In

/etc/default/saslauthd

You have to remove the default line and let saslauthd create its file under the postfix dirstructure. That is

> OPTIONS="-c -m /var/spool/postfix/var/run/saslauthd
< OPTIONS="-c -m /var/run/saslauthd"

And then

/etc/init.d/postfix restart

Not rocket science, but I chased a wild goose looking at permissions for 15 minutes before figuring out what was wrong.

Friday, August 13th, 2010

If you get this, trying to execute basic PBS commands

$ qstat -a
qstat: End of File

it is a permission problem. Basically you have to do some queue/server adminstration using qmgr as a user who have admin access already:

$ qmgr
set server acl_users += <username>
set queue <queuename> acl_users += <username>

Ubuntu Headless installation (serialport)

Sunday, August 1st, 2010

I had to install ubuntu 10.04 server edition over the serialport. This is doable, but requires a keyboard to be attached and keys pressed blindly in the right sequence. Here is a little cookbook on what to press:

  1. Run a terminal client on the serial console device. Configure the serial port to either the failsafe 9600 n-8-1, Hardware Flow Control=NONE, or to the much speedier 115200 n-8-1, Hardware Flow Control=NONE. The latter should be used only if the serial port in both ends support this kind of speed.
  2. Boot the server with the ubuntu 10.04 server install media in the CD/DVD drive, or from a USB ke
  3. When it has booted into the installation menu (takes a couple of seconds), then do this
    • Press ‘Enter’ (for language selection)
    • Press ‘F6′ (Other Options)
    • Press ‘ESC’ (to close the Other Options Menu)
    • Press ‘Backspace’ 3 Times (to delete “– “)
    • Type ‘console=ttyS0,115200n8 — ‘
    • Press ‘Enter’
  4. Installation willl continue, ouputting the dialogs on the serial device.

You should ofcourse replace ttyS0 with another serial device, if applicable. I am at a loss as to why I have to do this in order for me to install Ubuntu Server 10.04 over the serial port in 2010!

How to use a broken IBM Thinkpad T43 for something useful using puppy linux.

Wednesday, July 21st, 2010

I recently got my hands on a IBM Thinkpad T43. Unfortunately it was broken. More specifically the connector between the mainboard and the harddrive had problems.

IBM T43 connector (broken)

I googled a bit and discovered, that this was a known problem. The laptop worked fine, if I put pressure on the right spot on the case of the Thinkpad. Otherwise it would not detect and/or spin up the harddrive. I tried to fix it by re-soldering the connector and using some two-component glue to fixate the connector. I did not succeed :-/

So then what? Throw out the laptop? Or? … I decided to make a project out of it.

A laptop without storage is useless. Due to the broken connector, I could not use a harddrive. I did not want to use a cdrom as it is a slow medium. That left me with a USB flash drive as the only option. It would be a clumsy solution just to plug a USB flash drive into the laptop and be done with it, so I chose to solder a USB flash directly onto the mainboard.

First I stripped a standard 1GB USB Flash from its case and detached the PCB from the USB connector using a soldering iron

Stripped USB flash

After that I soldered 5 wires onto the PCB of the laptop and used one of the holes in the PCB used for assembly as a pass through hole. I initially used 4 wires as the USB connector only had 4 pins, but that was not enough. More on that later.

Wires soldered onto mainboard

Having soldered the wires onto to the mainboard, I now needed to solder the other ends onto to the PCB of the USB flash. That went fairly smoothly

Wires soldered onto flash

Before powering on anything, I used a multimeter to check for bad solderjoins and shorts. I found neither.

Checking for shorts using digital multimeter

Luckily I had a Linux based rescue distribution installed on the USB flash drive, so I just booted that to see if the operation on the T43 was a success. As can be seen below it worked just fine. Well, sort of fine, but more on that in a moment.

Testbooting the flashdrive

Almost done now. I just needed to assemble the laptop again, leaving the USB Flash inside.

Ready to wrap up

Closing the lid on the laptop, securing all the screws I had myself a working IBM Thinkpad T43. Or so I thought. When I tested the laptop thoroughly I discovered that the kernel ring buffer was filled with

hub 2-0:1.0: over-current change on port 1
hub 2-0:1.0: over-current change on port 2

That cryptic message just states that the USB device is drawing more power than it is allowed to by the USB specification. Or more precisely that the port on the USB hub inside the laptop is delivering more power than it was supposed to. It first that puzzled me. Then I read about the USB connector and realized my fail. The 4 wires of the USB connector consists of  VCC, GND, Data+ and Data –. Given both VCC and GND was part of the 4 pins I only soldered 4 pins. After seeing the problems above, I investigated the matter and found a reference to OverCurrent (OC) protection on the header itself. I thus soldered the 5th pin and the problem went away.

I now had a working IBM Thinkpad T43 with 1GB of flash storage. What should I use it for? I decided to use it for puppy linux. Primarily for two reasons.

  1. It appeared to be tailor made for small harddrives
  2. I had never tried it before

I downloaded the 106MB large iso file and burned it onto a CDrom. Now I faced the problem of installing puppy linux onto my flash without using a cdrom drive (as I found the laptop without one). Puppy linux made it quite easy. Using another computer I booted the cdrom. Installed puppy onto a spare flash drive. Booted that flash drive in my IBM Thinkpad T43 and pressed “install” once more, installing it onto the “internal” flash drive.

Booted into puppy linux

All done. Actually it takes quite some time to boot the machine, but that is primarily due to the bios insisting on searching for a harddrive. Unfortunately the IBM BIOS lacks the option to stop it from doing that. After the system is loaded however, it is lightning fast. Way faster than my IBM Thinkpad T400. This is due to the fact that puppy linux loads everything into a ramdisk, so starting program does not require any moving parts to be ready. Programs starts instantaneously. The whole experience just proves (once more), that the computers of today is severely I/O limited, but hopefully SSD will change that real soon now(tm)

Certfied CSA – HP-UX 11i v3

Friday, June 18th, 2010

Then I got around to get certified in HP-UX. I passed with a score of 80% in 75 minutes at Atea using a standard Prometric test. I had hoped on a little bit more, but I was under a lot of pressure work wise up until the test, so I did not get around to rehearse as much as I wanted to.

I can recommend ‘HP Certified Systems Administrator – 11i V3, 3rd Edition‘ by Asghar Ghori as a help on  getting your CSA.

Next up is HP-UX CSE – High Availability.

Unstartable volume in vxvm

Wednesday, February 3rd, 2010

I recently had an case, where I got

ERROR V-5-1-1198 Volume misc2prd_redo2vol has no CLEAN or non-volatile ACTIVE plexes

The plex associated with the volume was in RECOVER state. This can happend “if the plex content is out of-date with respect to the volume. This can happen if a disk containing one or more of the plex’s subdisks has been replaced or reattached”.  In my case it was caused by a failing disk, that was brought online again later.  To recover I did:

# Force the plex into offline state
sudo vxmend -g misc2prd_dg2 -o force off misc2prd_redo2vol-01
# Put the plex into stale state
sudo vxmend -g misc2prd_dg2 on misc2prd_redo2vol-01
# Put the plex into clean state
sudo vxmend -g misc2prd_dg2 fix clean misc2prd_redo2vol-01

# Start the volume
sudo vxvol -g misc2prd_dg2 startall

Virtual interfaces under HPUX 10.20

Friday, January 22nd, 2010

To define do as you (almost) would do on HPUX 11.00+

sudo vi /etc/rc.config.d/netconf
RARPD=0
INTERFACE_NAME[0]=lan2
IP_ADDRESS[0]=10.17.137.227
LANCONFIG_ARGS[0]=ether
SUBNET_MASK[0]=255.255.255.0
DHCP_ENABLE[0]=0

INTERFACE_NAME[1]=lan2
IP_ADDRESS[1]=10.17.137.226
LANCONFIG_ARGS[1]=ether
SUBNET_MASK[1]=255.255.255.0
DHCP_ENABLE[1]=0
sudo /sbin/init.d/net start

To remove it again

sudo ifalias lan2 del 10.17.137.226

Virtual interfaces under linux.

Friday, January 15th, 2010

As with other operating systems it is possible to bring multiple service ip-addresses online under one physical NIC under linux. This is just at brief howto on doing it.

ubuntu:

sudo vi /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
 address 192.168.1.2
 netmask 255.255.255.0
 network 192.168.1.0
 broadcast 192.168.1.255
 gateway 192.168.1.1
# The first virtual interface
auto eth0:1
iface eth0:1 inet static
 address 192.168.1.100
 netmask 255.255.255.0
 broadcast 192.168.1.255
 gateway 192.168.1.1

After you have edited that file, issue

sudo /etc/init.d/networking restart

RHEL (Redhat)/Centos:

cd /etc/sysconfig/network-scripts
 sudo cp ifcfg-eth0 ifcfg-eth0:0
 sudo vi ifcfg-eth0:0

 cat ifcfg-eth0:0

 > DEVICE=eth0:0
 > BOOTPROTO=none
 > IPADDR=192.168.1.100
 >  NETMASK=255.255.255.0
 > GATEWAY=192.168.100
 > ONBOOT=yes

After that, issue

sudo service network restart