When using Veritas Cluster Server (VCS), the following error messages from your system logs can indicate a problem with the cluster heartbeat interconnects:

Dec 12 15:26:20 serverb llt: [ID 194859 kern.notice] LLT:10019: delayed hb 935350 ticks from 1 link 0 (qfe:0)
Dec 12 15:26:20 serverb llt: [ID 761530 kern.notice] LLT:10023: lost 18706 hb seq 3448194 from 1 link 0 (qfe:0)
Dec 12 15:26:20 serverb llt: [ID 194859 kern.notice] LLT:10019: delayed hb 935350 ticks from 1 link 1 (eri:0)
Dec 12 15:26:20 serverb llt: [ID 761530 kern.notice] LLT:10023: lost 18706 hb seq 3448194 from 1 link 1 (eri:0)

These types of messages can be seen when you are running two LLT links over the same physical network. This is bad from a design point of view, as it may introduce a single point of failure. However, there are situations where you may have two physical connections into your cluster servers and have the links run over the same VLAN. If you are sure your interconnects are working properly and you are experiencing this error due to the issue described above then you should be able to solve it by changing your /etc/llttab file on all cluster members.

By default, on Solaris, your /etc/llttab file will look something like this:

set-node servera
set-cluster 1
link eri0 /dev/eri:0 - ether - -
link qfe0 /dev/qfe:0 - ether - -
link-lowpri ce0 /dev/ce:0 - ether - -

The second to last field for each of the links is the SAP field, or ethernet type used for the LLT link. This defaults (when specified using -) to 0xCAFE. Two LLT links on the same physical broadcast domain for a cluster cannot share the same SAP ID. If you do this, you may get the above error messages. Assuming this to be your problem (eg, if you run your eri0 and qfe0 links over the same broadcast domain) you can work around the problem by changing your /etc/llttab file to the following:

set-node servera
set-cluster 1
link eri0 /dev/eri:0 - ether 0xCAFE -
link qfe0 /dev/qfe:0 - ether 0xCAFF -
link-lowpri ce0 /dev/ce:0 - ether - -

This tells LLT to use different SAP types for the two links. All cluster members need to have this change made on them and have the cluster node restarted or have llt restarted.

Sun Blade 100 and Registered/buffered memory

By default, the Sun Blade 100 comes with 133Mhz Sync ECC CL3 unbuffered Dimms. By setting jumper JP6 (which is next to the memory slots) you can use registered/buffered memory in the Sun Blade 100.

The SunSolve handbook for the Sunblade 100 shows where the jumper JP6 is located.

You can mix registered and unregistered memory on the system board and it still appears to work ok.

 

Sun Blade 100, power management and ECC errors

The Sun Blade 100 workstations have a problem with their power mangement circuitry. If power management is enabled within Solaris (or Linux I guess) you can get uncorrectable ECC memory errors or other random hangs.

To work around this, you can edit /etc/power.conf and edit the autopm line to be
autopm disable

Obviously you can just uninstall the power management packages.

SunSolve has document #47042 on this issue. Also searching for “Sun blade 100 Alert” reveals a few more tidbits about this machine.

 

Well today I’ve been upgrading a couple of my servers from VMware ESXi 3.5 and ESXi 4.1 to ESXi 5.0. For the most part this went smoothly and without any drama.

The HP DL360 G5 upgrade from ESXi 4.1 to 5.0 went smoothly and the upgrade process maintained all the settings and configuration properly. The hardware health monitors were working before and after the upgrade without the need for any additional fiddling. I used the VMware ESXi 5.0U1 ISO from HP.com for this server.

The HP ML110 G5 needed to be a reinstalled as it was running ESXi 3.5 and there is no direct upgrade path to 5.0. After recreating the vSwitches and associated VM port groups I was up and running. I used the HP.com image once more and to my surprise the hardware health monitoring now shows the RAID status of the SmartArray E200 controller. In the past, when using the HP providers on ML110G5 hardware, purple screens were common. Now, the server seems stable and displays the storage health status. A win for the day!

Note that this server needed a further tweak as the SCSI passthrough of the SCSI attached LTO3 drive stopped working after the installation of ESXi5.0. A bit of Googling revealed that the following would solve this problem:

esxcli storage nmp satp rule add --satp=VMW_SATP_LOCAL --vendor="HP" --model="Ultrium 3-SCSI"

So the VM could now see the attached tape drive. However VMware appear to have changed their passthrough or SCSI subsystem since ESXi3.5 and as a result I’ve had to reduce my tape block size. In the past I was able to read and write 512kB blocks (tar -b 1024)  however I’ve had to drop this to 128kB blocks (tar -b 256). If I get some time, I will attempt  to work out the exact limit and update this post.

For the Dell PE840 upgrade, I used the Dell ESXi 5.0 customised ISO. Again, the upgrade from 4.1 preserved the configuration of the server. To my dismay the RAID status of the PERC 5/i was now missing. Turns out the Dell ISO is lacking the providers for storage health. Long story short, after some searching I got the health status back. I initially tried the Dell OpenManage VIB (OM-SrvAdmin-Dell-Web-6.5.0-542907.VIB-ESX50i_A02.zip) which didn’t appear to change much. The useful info was here on the RebelIT website which referred to using the VIB from LSI.com. This made sense as the Dell PERC 5/i is basically a LSI MegaRAID SAS 8480E. I downloaded the VIB (VMW-ESX-5.0.0-LSIProvider-500.04.V0.24-261033-456178.zip) from LSI.com. Note that the 8480E is not listed as supported by this release, but it works – PHEW! I guess the Perc 5/i is getting old in the tooth now, but given it works like a champ there is no need to upgrade. Note that I had to extract the .zip file and then install the VIB from the server’s console as:

esxcli software vib install -v /vmfs/volumes/datastore1/vmware-esx-provider-LSIProvider.vib

So now all three servers have been upgraded to ESXi 5.0 and have full hardware health status available which is being monitored via Nagios. Now the fun begins, upgrading the hardware version and VMware Tools for all the VMs….

My Tissot PRC200’s “big” second hand has not been aligned to zero for a couple of months and has been annoying me. I figured it needed a trip to the shop to get fixed. One a whim I did a search for this and almost immediately found a fix for this problem! Another “win” for the Internet. The info is from a forum post from “leewmeister” and the useful info is quoted here incase the forum closes.

You can zero each of the hands on the chronograph dials individually. Here’s how:
1) Make sure the chronograph is stopped.
2) Reset the chronograph with the pusher at 4 o’clock. If any of the hands aren’t at their “zero” position they’ll need to be adjusted.
3) Pull out the winding crown to the first position (date setting position).
4) Push the plunger at the 2 o’clock position. This will advance one of the chrono hands a step at a time. Stop it when the hand is at zero.
5) Push the plunger at the 4 o’clock position. This will adjust another of the chrono dials.
6) Pull the crown out to the second position (time setting position) and use the 2 or 4 o’clock plunger to adjust the final chrono dial.
One of the 4 possible crown/plunger combinations doesn’t adjust anything. I don’t have a chrono with me at the moment so I’m not sure which crown/plunger combination is the non-functional one. Anyhow, I hope this helps.