Over the last week I’ve been required to fix four different bugs relating to Xenserver. Not all were major bugs, not all were even Xenserver’s fault.
DVD drive missing
The first bug, and actually one that first showed itself several months ago, is that the option to attach the server’s DVD drive to a VM was not present. This originally happened because the DVD drive in the HP C3000 Blade chassis died, and was replaced. Even after this was replaced, it wouldn’t show up in Xencenter however. There are forum notes around on recreated the VBD and so on, however in this case that wasn’t even required – after reattaching the DVD drive via the Bladecenter ILO to the individual blades and confirmed that the correct CD device appeared in dmesg output, I ran the command xe-toolstack-restart. This command, as you might guess, restarts the xenserver toolstack. The DVD drive now shows up in Xencenter. I’d actually logged a bug report with Citrix for this a while back, and so credit is due to the Citrix engineer that called me back on this issue and suggested trying xe-toolstack-restart before doing anything else.
Xencenter not connecting
The same day as fixing the above bug, I had another customer call me saying they couldn’t connect via XenCenter to their Xenserver Enterprise host. I’d had a similar issue several months ago when someone changed the networking configuration on the host, and the fix then was, as above, to run the xe-toolstack-restart command. All fixed! Well, in this case, the symptoms were fixed, we still don’t know what caused the underlying problem.
VMs not starting, ISO SR failing after upgrade
This one came through on the same day as well. One of our customers had run an upgrade from 4.0.1 to 4.1.0 on their own internal evaluation system of Xenserver Enterprise, which actually had a couple of production hosts on it. They’d run the upgrade and the ISO storage repository failed to reconnect, and a couple of VMs that had previously had ISO images mounted out of the SR failed to boot. Sadly, xe-toolstack-restart didn’t solve anything for me here.
There is a lot of functionality exposed via the CLI however, so I was able to force detach the ISO images from the VMS in question. They were in a suspended state however, so I had to manually force reset them. Once I had these fixed I looked at what caused the ISO SR to die.
One of the things a that a lot of people misunderstand about Xenserver is that it is effectively an appliance. It runs CentOS as the dom0 (priviledged domain), but that doesn’t mean you should consider it to be a useful CentOS server. The upgrade process for a Xenserver system is to duplicate the primary partition into a backup partition (copy /dev/sda1 into /dev/sda2, for example). Once this is done, it basically performs a full install of the new version of Xenserver into /dev/sda1, and migrates the settings it knows about – all the Xenserver state, your networking configuration (in theory anyway), and so on. Things it misses include any custom software you might have installed (iSCSI initiators for tape access, monitoring tools, any custom scripts) – these all get “deleted”. They’re still actually in the backup partition, just not in the active one.
The upshot of this is that when you connect your ISO SR to a CIFS share and use a hostname to refer to the server rather than an IP address, don’t “make it work” by adding an entry to /etc/hosts. If you want to use hostnames, make sure they work via DNS, and make sure your DNS is set up right on your Xenserver host.
I think there’s a lot Xenserver could have done to have prevented this bug from happening, so hopefully they’ll add some smarts to auto-detach VDIs from ISO SRs if the SR doesn’t connect properly. I’m not sure there’s a nice way to auto-migrate all the users settings (eg, do an inplace upgrade rather than an overwrite upgrade) – there’s too much scope for stuff to change.
Upgrade loses network settings on Xenserver
And now my final bugs, and the most annoying. We have a customer with a Xen Enterprise 3.2 host, with a Win2k3 terminal server and a Win2k3 SBS server on it, running their core business infrastructure. We’d scheduled an outage for the upgrade from 3.2 to 4.0.1 to 4.1.0, and it all looked good, except…
Xenserver network settings failed to migrate. Not sure why his happened, it definitely doesn’t seem to always happen. The xe pif-reconfigure-ip command is used in Xenserver 4.1.0 to reconfigure the IP stack on the host however, followed by a xe-toolstack-restart. My favourite command!
Xentools won’t install in 4.1.0 system upgraded from 3.2.0
This one took up basically my entire day yesterday. After the upgrade from 3.2.0 through 4.0.1 and into 4.1.0, the VMs booted, but were running the old version of Xentools. The technician doing the upgrade attempted to install the new Xentools, however on both servers it got as far as uninstalling the 3.2.0 Xentools, and then failed completely to install the 4.1.0 version. We spent a lot of time going back and forth uninstalling and attempting to reinstall the drivers, before eventually completely uninstalling them and leaving the systems running without xentools for the afternoon. I then spent most of my evening on the phone to Citrix support in Australia, both looking at the site in question over a very laggy Gotoassist connection. We finally went through another complete uninstall of xentools, including removing all the hidden device drivers (see here for details), and then installed an internal release of Xentools for 4.0.1, which at least resolved the issue.
The bug appears to be within the Xentools, but it could also be within windows itself, or that’s what I understood from the Citrix engineer I was talking to. We are apparently the second documented occurance of this bug, and Citrix is working on a final resolution. The Citrix engineer in question had managed to replicate the bug on one of his test systems, which is reassuring to me – they can prove they fix it, at least for some permutation of the problem.
It feels like I’m painting a bad picture of Xenserver here, and maybe I am. You can take what you like from what I’ve written, I guess :). I’m not sure that any company could push through as many major changes as quickly as Xensource/Citrix have and not end up with some showstopper bugs, but I think some of the smaller ones should have been avoidable. Others, like the xentools bug I mentioned last, only seem to effect older systems being upgraded, and even then it doesn’t always happen to them, and I don’t really think you can test for that sort of edge case very easily, especially if you don’t know it happens. I’ll post an update when Citrix resolve this last bug, so if anyone is reading this and is put off upgrading their XE 3.2 system, check back for an update!