Archive for the ‘work’ Category

FTTH via TDC HomeDuo Fiber (egen router på TDC HomeDuo Fiber)

Monday, July 13th, 2015

Finally I got FTTH. My manager at my work was cool enough to let me have FTTH by upgrading my work@home package from ADSL to Fiber. This picture shows the difference between ADSL and Fiber very well. Latency goes from 16ms to 2ms. And the packet loss is gone (the drops you see in the 2ms line is me fooling around, but about that in a moment)

fiber

Technically TDC HomeDuo fiber consists of a Raycore RC-OE1A followed by Sagem HomeBox (rebranded for TDC). The latter is geared towards mr. and ms. ignorance and is very limited to what you can actually do. To make matters even worse, my company has a special profile that locks the box down even further, making it impossible to live with if you are just a bit technical. Playing around with the setup, and wanting to go back to my beloved rt-n16 running openwrt, I had some experiments, where I played around with it, to make it work (hence the packet drops). It is actually not that hard

  • TDC FTTH uses wlan 101 for WAN
  • To get your own router online fast, I suggest using the mac address from the Sagem crap.

So to get that to work, select network in openwrt. Select switch. Enable vlan tagging. Assign  vlan 101 to port 0 (WAN) and CPU

Screenshot from 2015-07-13 14:11:59

After that is done, you select interface, select WAN (eth0.101) and press edit

interfaceUnder the advanced tab you then override the mac address if needed and it ofcourse has to start at boot

Screenshot from 2015-07-13 14:27:36

And then under physical, you bind the WAN to eth0.101

Screenshot from 2015-07-13 14:28:20

And the result of going from Sagem to openwrt is also measuable. The difference is not large in latency, but it is there. The usability of openwrt over the sagem box is however worth every hassle endured.

openwrt_over_sagemThe bandwith provided by my work is “only” 50Mbit/s. The raycore is providing/connect at 100Mbit/s, so the limit is artificial and done on the TDC equipment upstream.

HP-UX filecache_max tunable and system unresponsiveness

Thursday, January 12th, 2012

During work today I tweaked the online tunable called filecache_max on an HP-UX 11.31 box. From 1 to 5%. Went fine. I tested what I needed to test and then decided to lower the value again. So I ran kctune filecache_max=2% and then nothing.

Everything(!) stopped working. Well, the system could be pinged, but the other cluster node started to complain about its sister disapperaing. I littlerally got cold hands. This was supposed to be an online operation. I waited for 4 very very long minutes (being a computer professional have learned me to wait for stuff to finish).

After 4 minutes the system was back. Running as nothing had happened. Everything littlerally just froze up trying to reduce the filecache. I learnt it the hard way. Hope this post will prevent you from getting into the same kind of troubles.

Debugging ‘Unsupported authentication scheme ‘ntlm” errors while using Perl WWW::Mechanize

Thursday, July 21st, 2011

Recently I got a really boring task of performing the same task over and over again … in a GUI. Obviously I am waaay to lazy for that (and bright enough to figure out how big a waste of time that is).

In an instance I fired up vi and started to hack together a little perl script using WWW::Mechanize to do all the work for me. I quickly ran into problem …. I got ‘Unsupported authentication scheme ‘ntlm” errors when trying to fetch the pages.

I played around quite some time before I realized that i missed the Authen::NTLM module. I had the LWP::Authen/Ntlm and that tricked me for quite some time. I was down to a simple script that just had

#!/usr/bin/perl -w
use LWP::Authen/Ntlm;

Before I could see that I missed Authen::NTLM. Once I had that downloaded and installed, things started to work as expected.

Debugging dns problems

Thursday, February 17th, 2011

Recently I faced a DNS problem in a complex setup. I had a very locked down jumphost with one public network and two internal networks and a very nazi firewall controlling what packets went in an out.

On the inside I had a linux machine running BIND, also with a firewall and a locked down setup.

On yet another host on the inside, running HP-UX, DNS resolving worked just fine.

On the jumphost it didn’t work at all.

Took me hours to figure out what was going on. I went over the firewall again and again. On both the jumphost and the DNS server. I went over the bind configuration again and again. The network setup. To no avail. All i got was

Got recursion not available from 192.168.1.79, trying next server

In the end it turned out to be due to the fact that on the linux jump server, I had a two nameserver lines

domain zensonic.dk
search zensonic.dk
nameserver 192.168.1.79
nameserver 192.168.1.80

I hadn’t bothered to setup the DNS at 192.168.1.80 and thus my linux client would not function. As soon as I removed 192.168.1.80 from /etc/resolv.conf everything was as it should be. I hope that you, reading this, saves some hours worth of debugging. If you do, drop me a line/mail/beer :-)

Utilizing Seagate 7200.12 drives in an MSA20

Wednesday, February 2nd, 2011

About a year ago, I upgraded an MSA20 with non-HP drives. 1TB drives (7200.11 series) made by Seagate to be precise. Here one year later, the first drives start to fail. Looking for replacements we had a hard time finding the 7200.11 series drives. We then bought some 1TB 7200.12 drives.

And they work just fine……

Same model number, just different firmware as seen from the MSA20 (CC46 vs CC38):

physicaldrive 1:7
Box: 1
Bay: 7
Status: OK
Drive Type: Data Drive
Interface Type: SATA
Size: 1000.2 GB
Firmware Revision: CC46
Serial Number:             9VPB04V3
Model: Seagate ST31000528AS
SATA NCQ Capable: False

physicaldrive 1:8
Box: 1
Bay: 8
Status: OK
Drive Type: Data Drive
Interface Type: SATA
Size: 1000.2 GB
Firmware Revision: CC38
Serial Number:             9VP4D1F1
Model: Seagate ST31000528AS
SATA NCQ Capable: False

Debugging thread exhaustion on HP-UX

Saturday, October 9th, 2010

Thread exhaustion on an HP-UX machine manifests itself by one or more of the following errors in

/var/adm/syslog/syslog.log
vmunix: kthread: table is full
vmunix: WARNING: hponc_thread_create(): error creating thread for autofskd (12)
sshd[8474]: fatal: fork of unprivileged child failed
sshd[1895]: error: fork: Resource temporarily unavailable

Keywords being thread and fork and failed.  You should immediatly look at nkthread with kcusage

sudo kcusage nkthread
Tunable                 Usage / Setting
=============================================
nkthread                 5254 / 4096

You then get a descrption with

kctune
sudo kctune -v nkthread
Tunable             nkthread
Description         Maximum number of threads on the system
Module              pm_proc
Current Value      4096
Value at Next Boot  4096
Value at Last Boot  4096
Default Value       8416
Constraints         nkthread >= 200
 nkthread <= 4194304
 nkthread >= max_thread_proc
 nkthread >= nproc + 100
 nkthread >= (5 * vx_era_nthreads)
Can Change          Immediately or at Next Boot

You then resolve the problem with ie

sudo kctune nkthread+=8192

After that you would, as a good sysadmin start to look at the usage back in time. Mind you that the percentage you see is relative to the new tunable value you just set a moment ago, not what it was at the time of the measurement back in time!

sudo kcusage -m nkthread
Tunable:        nkthread
Setting:        28051
Time                           Usage      %
=============================================
Thu 09/09/10                    6285   22.4
Fri 09/10/10                    6403   22.8
Sat 09/11/10                    6368   22.7
Sun 09/12/10                    6150   21.9
Mon 09/13/10                    6336   22.6
Tue 09/14/10                    6436   22.9
Wed 09/15/10                    6382   22.8
Thu 09/16/10                    6416   22.9
Fri 09/17/10                    6277   22.4
Sat 09/18/10                    6157   21.9
Sun 09/19/10                    6203   22.1
Mon 09/20/10                    6319   22.5
Tue 09/21/10                    6420   22.9
Wed 09/22/10                    6306   22.5
Thu 09/23/10                    6474   23.1
Fri 09/24/10                    6567   23.4
Sat 09/25/10                    6452   23.0
Sun 09/26/10                    5910   21.1
Mon 09/27/10                    8260   29.4
Tue 09/28/10                    8240   29.4
Wed 09/29/10                    6617   23.6
Thu 09/30/10                    6461   23.0
Fri 10/01/10                    5799   20.7
Sat 10/02/10                    5558   19.8
Sun 10/03/10                    5892   21.0
Mon 10/04/10                    6983   24.9
Tue 10/05/10                    6542   23.3
Wed 10/06/10                    6479   23.1
Thu 10/07/10                   12289   43.8
Fri 10/08/10                   11108   39.6
Sat 10/09/10                    5292   18.9

Now you might be able to see what happend when and correlate it with your Change Management procedures to figure out what went wrong. I was not that lucky. This was on a database hotel consisting of 80 databases and database related applications and nearly no change control. So what to do? I needed to correlate the process information visible with ps with a thread. But how did you do that in HP-UX.

First guess, the alway valuable tool glance. And lo and behold the capital Z will show you the thread (also called Light Weight Processes in HP-UX, or LWP for short) information. But on a screen by screen basis. Useless if you have thousands of processes. After a bit of ping pong with a fellow sysadmin we ended up with the pstack (print stack) tool. It works like

$ ps -ef | grep -i java | head -1
 dma65t7  6728  6708  0 17:48:17 ?         3:23 
/opt/dma65t7/product/6.5/classes/com/documentum/jboss4.2.0/jdk/bin/IA64N/java 
-Dprogram.name=run.sh -server -Xms256m -Xmx512m -

sudo pstack 6728 | grep -i lwpid | sed -e 's,-*,,g' | head -10
lwpid : 7653569
lwpid : 7653570
lwpid : 7653572
lwpid : 7653573
lwpid : 7653574
lwpid : 7653575
lwpid : 7653576
lwpid : 7653577
lwpid : 7653578
lwpid : 7653579

So basically I ended up with the following one-liner

ps -ef > out ;  ps -ef | awk '{ print $1 " " $2 }' | grep -v root |
 while read user pid ; do sudo pstack $pid | egrep -i "($pid|lwpid)"|
 sed -e 's,-*,,g' ; done >> out 2>&1

Which gives a quick and dirty indication of which process eats up all the resources and you can go ask that application owner if that is normal. It wasn’t!

Non-HP harddrives in an HP MSA20

Thursday, March 11th, 2010

A customer asked me if it was possible to use non-hp drives in an HP MSA20 as they costed a lot less than HPs own drives. I honestly said that it would require a POF. The customer accepted the initial expense of single 1TB SATA drive.  I fired up hpacucli to figure out what was up and down on this

=> ctrl ch="mirror" show config detail
....
....
physicaldrive 1:1
Box: 1
Bay: 1
Status: OK
Drive Type: Data Drive
Interface Type: SATA
Size: 1000.2 GB
Firmware Revision: HPG1
Serial Number: 9QJ2B4GD
Model: HP GB1000EAFJL
SATA NCQ Capable: False

As HP does not make harddrives, but uses OEM drives with custom firmware I had to figure out what types of drives was in there. The easiest solution would be to shut the box down and pull out a drive to inspect. Having dealt with HP quite a lot, I know that they also remark the drives, so I would probably not be able to see what types of drives was in there, leaving me with guessing if I choose to go that route.

Instead I opted for figuring out what type of drive it was likely to be based on the firmware.  I googled a bit and found that the MSA20 could support up to 1TB disks. A bit more googling yielded this advisory from HP about upgrading firmware on Seagate drives to HPG6. Based on the age of the MSA20 in question, the age of the 1TB HP disks we already had in them, I decided it was most likely to be Baracuda 7200.11 drives that HP utilized for this and thus we ordered one of those.
Drive arrived. We put it in. Rescanned and lo and behold:

physicaldrive 1:6
Box: 1
Bay: 6
Status: OK
Drive Type: Data Drive
Interface Type: SATA
Size: 1000.2 GB
Firmware Revision: CC38
Serial Number:             9VP4D0ZA
Model: Seagate ST31000528AS
SATA NCQ Capable: False

A non-HP drive working. We have now placed an order for 19 x 1TB Seagate drives.

Your millage may wary if you try this. It is also worth mentioning that it would be an option to test non-seagate disks and/or bigger disks. Beware of the heat and power requirements though! HP themselves only sells the MSA20 with upto 1TB disks.

Finally for the record, it should be state that this was on an MSA20 with this firmware level:

MSA20 in mirror
 Bus Interface: SCSI
 Serial Number: PAAAC0PMQTR7V0
 Chassis Serial Number: E01RMLJ17M
 Chassis Name: mirror
 RAID 6 (ADG) Status: Enabled
 Controller Status: OK
 Chassis Slot: 2
 Hardware Revision: Rev A
 Firmware Version: 2.08
 Rebuild Priority: Medium
 Expand Priority: Medium
 Surface Scan Delay: 3 secs
 Cache Board Present: True
 Cache Status: OK
 Accelerator Ratio: 50% Read / 50% Write
 Drive Write Cache: Disabled
 Read Cache Size: 56 MB
 Write Cache Size: 56 MB
 Total Cache Size: 112 MB
 Chassis Slot 2 Battery Info
 Battery Pack Count: 2
 Battery Status: OK
 Host Bus Adapter Slot: Slot Unknown
 Host Bus Adapter Port: 1
 SATA NCQ Supported: False

New job – Senior Operations Specialist in NNIT

Monday, March 1st, 2010

Well, then it happened.  I quit Telia. I will surely miss my colleagues which are among the smartest and most dedicated people in Denmark, but it was time to move on to new worlds.

I look forward to work for NNIT. A title of Senior is new to me. Let us see, if I can lift the burdens put onto my shoulders.

Virtual interfaces under linux.

Friday, January 15th, 2010

As with other operating systems it is possible to bring multiple service ip-addresses online under one physical NIC under linux. This is just at brief howto on doing it.

ubuntu:

sudo vi /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
 address 192.168.1.2
 netmask 255.255.255.0
 network 192.168.1.0
 broadcast 192.168.1.255
 gateway 192.168.1.1
# The first virtual interface
auto eth0:1
iface eth0:1 inet static
 address 192.168.1.100
 netmask 255.255.255.0
 broadcast 192.168.1.255
 gateway 192.168.1.1

After you have edited that file, issue

sudo /etc/init.d/networking restart

RHEL (Redhat)/Centos:

cd /etc/sysconfig/network-scripts
 sudo cp ifcfg-eth0 ifcfg-eth0:0
 sudo vi ifcfg-eth0:0

 cat ifcfg-eth0:0

 > DEVICE=eth0:0
 > BOOTPROTO=none
 > IPADDR=192.168.1.100
 >  NETMASK=255.255.255.0
 > GATEWAY=192.168.100
 > ONBOOT=yes

After that, issue

sudo service network restart

Problems continuing calculations on using Schrödinger checkpoint files.

Sunday, November 22nd, 2009

I recently helped identifying a problem with continuing calculations from a checkpoint file performed by desmond of the Schrödinger software portfolio. The error we got was

fail to extract simulation parameters from checkpoint file <filename>

It worked for a lot of people, but not for others. I searched high and low for differences:

  • Different unix environment settings.
  • Differnet MPI settings (intel mpi vs. gcc mpi vs. software mpi)
  • Different permission settings on home dirs
  • Differences in groupmembership between those who could continue calculations and those who couldn’t.
  • Performing the calculations on different nodes in the calculation clusters.

As it turned out it was neither of these points that was responsible for problems. Instead it was the use of non-english chars in the comment field in the /etc/password file for the users who could not continue the calculations from a checkpoint file. The fix for this is simple:

usermod -c "new name, only english chars" <userid>

Everything was fine after that.