Looking up .local DNS names under OSX

August 27, 2008 – 12:35 pm

My workplace uses a .local DNS suffix for all internal DNS, which of course causes problems when you’re running a system which uses any form of mdns - such as OSX or Ubuntu (or probably any modern Linux distro, I know SuSE had this problem about 6 years ago). The .local lookups fail, because mdns takes over. (Thanks John and Phil for reminding me of this). This shows up as resolution via host or dig working fine, as they make calls direct to your nameservers, but commands like ping failing, as it uses the NSS to do the lookup.

A quick bit of googling, and I found this gem on Apple’s website, and also this one on www.multicastdns.org. Apple’s suggested fix didn’t seem to work, but I suspect a reboot is required. I’ve applied the second one, and rebooted, and one of them is definitely working.

As an aside, this started with me wishing that it was possible to do per-domain resolver configuration. I initially gave up and set up dnsmasq which forward on requests to specific domains to specific servers, but then hit the mdns issue. This method looks very much like a per-domain resolver configuration however - it’s saying to use my local DNS server for .local lookups. I haven’t tested it, but it looks like it should support setting an arbitrary resolver for an arbitrary domain.

Why I love dmidecode

August 7, 2008 – 1:03 pm

I was asked to provide more ram for a server today, specified only by name. I have login details, but it’s in a datacentre in Auckland, and I’m in Hamilton, so I can’t wander over to check details.

Enter dmidecode:


System Information
Manufacturer: Dell Computer Corporation
Product Name: PowerEdge 860

That’s basically all I need right there. Having a namebrand machine helps, of course - getting the same sort of information from a generic motherboard isn’t as easy or useful. However, while checking which ram banks are populated I can also (typically) get the type of ram as well:

Handle 0×1100, DMI type 17, 27 bytes
Memory Device
Array Handle: 0×1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 1024 MB
Form Factor: DIMM
Set: 1
Locator: DIMM1_A
Bank Locator: Not Specified
Type: DDR2
Type Detail: Synchronous
Speed: 533 MHz (1.9 ns)
Manufacturer: 7F7F7F0B00000000
Serial Number: 7A947291
Asset Tag: 0D0718
Part Number: NT1GT72U8PB0BY-37B

In this case, the other “Memory Device” entries had “No module installed” in the Size: section, so I know that this machine has one (1) 1GB DDR2-533 DIMM installed.

Of course, that output doesn’t seem to tell me that the Dell PowerEdge 860 wants ECC ram (although I know that anyway). And the output from dmidecode on a newer machine:

Handle 0×1100, DMI type 17, 23 bytes.
Memory Device
Array Handle: 0×1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor:
Set: 1
Locator: DIMM 1A
Bank Locator: Not Specified
Type:

Type Detail: Synchronous
Speed: 667 MHz (1.5 ns)

That’s from a brand new HP DL360 with FB-DIMMs, so I guess my version of dmidecode on this machine isn’t new enough to handle that.

In general though, it’s more than good enough :)

Further bugs in switches

August 7, 2008 – 12:48 am

Further to my previous post on the topic of uncovering bugs in switches with network monitoring systems, I have two more bits of news:

The first is that the vendor has given me a new firmware for one of the switch models which fixes the bug. Apparently it was a known but undocumented bug. Still waiting on a firmware for the other switches.

The second is that I have found another bug in the switches - this time with HTTPS. It’s not triggering during autodiscovery this time, and the bug takes a bit longer to manifest (2-4 weeks, it seems), so it’s slightly harder to track down. I’ve got to set up a test rig to hammer some of the switches with HTTPS connection attempts and see what shakes loose.

OpenNMS and buggy switches

July 19, 2008 – 12:40 pm

One of my evening projects has been setting up OpenNMS to monitor a network primarily comprised of VENDORNAME switches. OpenNMS is being put in to replace a bundle of Nagios, Cacti, Smokeping, and Groundwork Fruity for Nagios configuration management. The existing system worked well enough, but the lack of autodiscovery of services/nodes along with the poor integration between cacti and nagios was getting a bit annoying.

After setting up and trialling OpenNMS for a bit, we deployed it on this network. And then the switches started failing. They’d still switch packets, and I believe still responded to SNMP, but you couldn’t connect to them via any of the management interfaces.

So, we started looking at the differences between OpenNMS and Nagios/Cacti/Smokeping. Both do SNMP and ICMP queries, and some TCP port availability checks. The combined stack actually does more SNMP traffic because both Cacti and Nagios ended up querying the same OIDs. I’ve often noticed that Cacti sends individual requests for OIDs however, rather than grouping them, whereas OpenNMS defaults to requesting 10 OIDs per PDU. I changed this in the configuration (and later on changed it for real, as it was being set in a different config file as well), and let OpenNMS run against some test switches… and they locked up.

Perry suggested that it could be a memory leak due to the service polling, and set up a test where he polled the SSH server once a minute forever. This test got cancelled after 4 days or so, but the machines hadn’t died at that point, so we decided it wasn’t anything fundamental about the service checks.

I set up a range of services that were being monitored on 10 switches, and let them go for a bit. Due to power outages and equipment moves this step ended up taking longer than it needed to, but the end result was that no matter which services were being monitored, all the switches all locked up at around the same point.

And then I noticed that the switches had a growing number of stale “telnet-d” connections. These switches have capacity for up to 4 concurrent administrative logins - once all 4 slots are full, you can no longer log in. So, the theory is these stale connections were blocking real connections, and never timing out, thus causing the lockout of the management stack. They don’t time out, and you can’t kill them from the switch console short of rebooting the switch. Most of the switches weren’t being actively monitered for telnet, but OpenNMS does do service discovery periodically (I think once a day, and perhaps under other situations too), and this would probe each service. So I firewalled telnet out, and had the switches restarted, thinking this would solve it.

The switches still locked up.

The switches still had stale telnet connections appearing in them.

I turned off the telnet service on each switch, thinking that perhaps there was something else on the network that was talking to them, and restarted them.

Within 5 minutes of rebooting each switch, there was a stale telnet connection listed. Awesome.

So, we’re down to a service that is being misreported as a telnet service. I go through all of them, and discover that none of the other services - FTP, HTTP, HTTPS - even show up as an active session. Which leaves telnet - firewalled out - and SSH.

The OpenNMS plugin which handles discovery of SSH servers is a bit smarter than a basic “is a service listening on port 22″ sort of discovery - it waits for the SSH banner from the server, then sends it’s own SSH banner back, and verifies that it gets a response back. This means it’s getting part way through the SSH establishment, and then canning the connection.

As a quick test, I telnetted to port 22 on a switch and checked the login listing. With the banner is being displayed, nothing even shows up. When I pasted a valid looking SSH banner back, I got a bunch of binary data echoed into my telnet session, and ssh session to the switch locks up. On reconnecting and checking the login listing, sure enough - a stale telnet session was there.

Further tests reveal that if you ssh to one of these switches, but don’t type your password in, the session gets reported as a telnet session. Furthermore, if you kill your ssh process or shell window while the ssh session is waiting for your password, the session never disappears.

So, we have a very live DOS exploit against VENDORNAME switches here, assuming anyone is unwise enough to allow SSH access from random networks and VLANs to their switches that is. I have to point out that while it’s a particular “feature” of OpenNMS that triggered this problem for us, this isn’t a bug in OpenNMS at all, given that it’s trivial to trigger the same problems with the switches directly.

In regards to the actual problem at hand, OpenNMS is quite configurable, so at least I can change the way it does SSH service discovery to revert to a simple “is the port up” check. I’ve left this running for nearly two weeks now, and the switches on my test bed are all still behaving properly.

I held back from posting this until I could get a response from the vendor. They’ve acknowledged the bug, and a fix will be out in the next firmware release apparently. I might update once they have released a new firmware; I’ve edited out the vendor name from this post because I don’t believe it’s responsible to publish denial-of-service vulnerabilities without giving the vendor a chance to fix them.

I also noticed this post on the OpenNMS blog. The author there had similar problems with monitoring a firewall device, and while the scenario seems different, VENDORNAME makes firewalls as well as switches; I wonder if it’s the same vendor in his case.

Weird autoblogs

July 4, 2008 – 12:50 pm

I just got a pingback on my earlier post, which was from a blog that indexes posts and articles with a particular word - acceptance - in it. Kind of an odd premise for a blog.

UPDATE As per the comment, the author/owner of More Lyrics updated his blog to remove the quote. My original comment was tongue in cheek, but it’s only fair to remove it I think :)

Citrix on Xen

July 4, 2008 – 10:54 am

It seems that the original subject of my post yesterday caught the eyes of much of the virtualisation community, including Simon Crosby, formerly from Xensource, and now working for Citrix.

He’s written a typically well thought out response, which covers off a lot of points:

  • HP have a multi-hypervisor management tool already which signs off on Xenserver, VMWare and Hyper-V support
  • Xenserver Platinum, which is comprised of Xenserver Enterprise and Citrix Provisioning Server, can already provision VMs to not only physical hardware and Xenserver, but to other hypervisors as well
  • He covered off again the ecosystem building around the Xenserver product range, specifically in HA areas - products like Marathon Everrun and Stratus Avance.

He also wrote up a good bit on the position of Xen with regards to KVM. I haven’t really looked into KVM much, due to not ready access to test hardware with VT capable chips (the test hardware I do have is tied up with testing Xenserver), but I’ve always been wary of various claims that it’s a better VM stack than Xen is. (That might just be because I’ve not spent the time looking into it, and it might be because of the general not-invented-here feeling the “linux kernel” community seems to have about Xen. Again, not something I’ve spent a lot of time on). A lot of the stuff Simon writes is high level and enthusiastic of course, but it paints a clear picture - Xen already has massive uptake in mindset, and not just with traditional linux vendors either (Sun xVM and Oracle VM having Xen based stacks as well). I guess the jury might still be out on which technology actually is technically superior, but as history demonstrates, it’s not always the technically superior technology that lasts.

Simon also claims that Xen will be in the BIOS hypervisor offering from Phoenix, which is something I haven’t heard before. It certainly makes some amount of sense for Phoenix to not rewrite an entire hypervisor stack and then stick it some place that’s inherently difficult to upgrade - your BIOS, but I’m not sure how it works out regarding Xen’s requirement for a privileged Xen-aware guest to provide hardware drivers.

Simon also makes another point that I must have heard before from him, because it’s stuck with me and I agree entirely with the premise:

The founding thesis of XenSource, and the continued strategy at Citrix, is to promote fast, free, compatible and ubiquitous hypervisor based virtualization. If the hypervisor is free, why worry about who delivers it? Let the customer pick the implementation method that they want - the real money is in the up-sell with products that make virtualization valuable for customers.

Whether you like it that companies are in this to make money or not, this approach seems a good one. Piggybacking their moneymaking on an opensource product, an action which drives development, acceptance and that horrible word “mindshare”, doesn’t have to be a bad thing.

Citrix Xenserver: Xen or Hyper-V? Does it matter?

July 2, 2008 – 3:40 pm

Seems there’s a bit of debate at the moment about the future of Xen within Citrix’s product range, all sparked by this article by Brian Madden, which he clarified later on.

Brian’s followup clarifies his point:

When I say that Citrix will drop Xen, I mean that Citrix will drop the open source Xen hypervisor. I do not believe that Citrix will drop their XenServer product.

When you consider that Citrix Xenserver is a hypervisor based virtualisation stack (Xen on CentOS), and a virtualisation management tool (XenCenter), then sure, it’s possible for Citrix to change XenCenter so that it manages Windows Hyper-V instead. Xenserver, the product and brand, becomes a Windows 2008 Hyper-V install, and XenCenter manages that instead. It’s possible. Scott’s comments about porting Xen to windows missed the mark - Citrix only need to port the management stack and change the virtualisation layer to windows. RedHat are in the process of doing something similar with their recent move away from Xen to KVM. It’s not as radical a shift as from Xen to Hyper-V, but it’s as radical as you need to be - it’s a completely different virtualisation stack.

I’m still not sure I agree with Brian though. Citrix just dropped $550M on purchasing Xensource, and then promptly rebranded their flagship product to match. Granted, Citrix have a great track record for rebranding every couple of years, but it seems like a colossal waste of money given that Hyper-V, while not released at the time, was defintely public knowledge.

Citrix also have no need to drop the Xen out from under Xenserver. Citrix Workflow Studio already handles some automation tasks for both Xenserver and Hyper-V, and it’s no stretch to see this working on VMWare systems as well. Moreover, XenCenter itself could be modified to manage both Xen-based Xenserver systems as well as Windows Hyper-V systems. The reverse will definitely happen from Microsoft’s point of view - integration with XenServer in Microsoft’s Systems Center Operations Manager has been talked about for months now.

One prediction that is worth making is that cross-VM management stacks will flourish and improve. The example of Hyper-V and Xenserver was mentioned earlier, but they will grow to cover other assorted Xen based stacks from Virtual Iron, Novell, Sun etc, KVM stacks like RedHat, and of course VMWare. Citrix Workflow Studio makes a start in some ways, and products like VMLogix’s Lab Manager. Enomalism is already much of the way there, and goes a step beyond into cloud computing. The hypervisor (or at least, some kind of virtualisation) will be ubiquitous, and the winners will be the management stacks.

OSS Network Imaging / Install services

June 22, 2008 – 11:30 am

I’m very interested in the topic of network deployments of operating systems, specifically the various Microsoft OSs, as I can already install linux via PXEboot. There’s two main groups of software in this field - unattended or scripted installs, and imaged installs.

A while ago I found a tool called Unattended, which is a network based unattended installation tool for Windows. If it works, it looks very promising. It’s basically a DOS boot disk which mounts a network share and executes the windows installer. Simplicity. The basic install seems to require you to enter a number of responses to questions (such as administrator password, timezone and Microsoft product key), but the documentation explains how to customise the script to meet your business needs, including examples. Once the OS install is done, Unattended can be configured to install third party packages, as long as the packages (eg, MSI bundles) also support some level of unattended installation procedure.

Today I discovered Free Online Ghost, or FOG. FOG is network based computer imaging tool, designed to both read images from, and write images to hosts on your network. I’ve used tools like partimage in the past for exactly this purpose - creating a golden image of a lab machine and then reimaging the entire lab every couple of months to keep everything clean. FOG seems to be more polished than partimage does, as it claims to support things like creating AD accounts for the machine and so on.

The Unattended documentation includes a concise explanation of why the approach adopted by FOG, partimage, and commercial tools like Acronis and Ghost is bad, however I think this is really a case of using the right tool for the job. I can see a system like FOG being used with great success in a lab environment, or for periodic backup of individual host OSes to near-line storage, providing bare-metal restore functionality without requiring major investment in tape backup expansion. And Unattended makes a lot more sense for initial deployments, especially for my workplace, as we use such a wide range of hardware that an imaged install would be fairly problematic.

There are other commercial systems for doing these deployments of course - IBM Director, HP ICE, Citrix Provisioning Server are just a few of them, but these systems invariably make more sense for in-house deployment control.

Using monit for system and process monitoring

June 22, 2008 – 10:50 am

One of the servers I maintain is the jabber server at jabber.meta.net.nz. This is a free public service, anyone can use it, and it does get quite a wide range of use - for a long time we seemed to be very popular for south american users, possibly because of the web based clients and the range of transports to other protocols we support. We typically see between 50 and 100 concurrent users, depending on time of day and week, but the active account base is normally in the low thousands.

The transports themselves cause me a lot of problems. In the past they’ve been downright buggy, crashing all the time, but with the current codebase for all four protocols in use (AIM, ICQ, MSN and Yahoo) all being in python, we don’t seem to have as many outright crashes. We do have slow memory leaks however, which prompted me to move the services to a new server a while back. Part of me was hoping that the memory leaks were caused by the gentoo system I was using initially, but this doesn’t seem to be the case.

So, I needed to either fix these memory leaks, or to work around them. Enter monit. I’ve heard about monit quite a bit, but never really looked into it other than thinking it might be interesting. I really wish I’d looked further ages ago. It’s easy to set up, is designed specifically to monitor and restart services, and it solved my memory leak problem in about 5 minutes.

Here’s a snippet from the config file:

check process aim-transport with pidfile /var/jabberd/pid/aim-transport.pid
start program = “/etc/init.d/aim-transport start”
stop program = “/etc/init.d/aim-transport stop”
if cpu > 60% for 2 cycles then alert
if cpu > 80% for 5 cycles then restart
if totalmem > 300.0 MB for 5 cycles then restart
group transport

This is pretty self explanatory really. If CPU usage of this process gets too high, alert, then restart if it stays high for 5 cycles. And if the ram usage is over 300 MB for 5 cycles (a cycle is 2 minutes by default), restart the process. Problem solved. Or rather, the symptoms are solved, but that’s good enough for me at this stage

NoteThis is old, but somehow didn’t get posted

HP AiO iSCSI and Citrix Xenserver

June 10, 2008 – 7:00 pm

A couple of our clients have HP AiO1200 iSCSI systems. These are nice enough units, especially for entry level iSCSI SANs. They’re in a slightly modified HP DL320s chassis, and run Windows 2003 Storage Server, as well as some custom built HP management tools.

I’ve never had an easy run when dealing with their iSCSI target and the open-scsi stack used by Citrix Xenserver. The first problem I had was that the management tools don’t support multiple initiators connecting to the same target LUN. They don’t actually stop you, but it seems that you need to hold your tongue just right to allow multiple initiators connecting to the one target. If you don’t hold it just right, the admin tools will let you do it, but it won’t actually work.[1]

The second problem I had is that Xenserver just refuses to connect, saying “Your Target is probably misconfigured”. There’s not really a lot of configuration you can do with an iSCSI target, so I’m perplexed here. Digging deeper:

# iscsiadm -m discovery -t st -p 1.2.3.4
1.2.3.4:3260,1 iqn.1991-05.com.microsoft:storage-iqn.xen-osdata-target

It seems that iscsiadm can see everything fine!. I tried adding the target via the cli:

# xe sr-create host-uuid=f3b260ab-f8b9-4b52-980d-7b7e93ab8dcf content-type=user name-label=AIO_OSDATA shared=true type=lvmoiscsi device-config-targetIQN=iqn.1991-05.com.microsoft:storage-iqn.xen-osdata-target device-config-target=1.2.3.4

Error code: SR_BACKEND_FAILURE_107
Error parameters: , The SCSIid parameter is missing or incorrect, \
< ?xml version="1.0" ?>
iscsi-target
LUN
vendor
HP
/vendor
LUNid
0
/LUNid
size
42949672960
/size
SCSIid
360003ff646c289389ea2e31c1d419930
/SCSIid
/LUN
/iscsi-target

Unlike the error messages you get in the GUI, that one is quite helpful[2] It tells us we’re missing a SCSIid parameter, and then lists a SCSIid parameter to try:

xe sr-create host-uuid=f3b260ab-f8b9-4b52-980d-7b7e93ab8dcf content-type=user name-label=AIO_OSDATA shared=true type=lvmoiscsi device-config-targetIQN=iqn.1991-05.com.microsoft:storage-iqn.xen-osdata-target device-config-target=1.2.3.4 device-config-LUNid=1 device-config-SCSIid=360003ff646c289389ea2e31c1d419930

And our iSCSI target was happily added. I’m not entirely sure the LUNid parameter is required, this post suggests it isn’t. I found a couple of other posts on the forums which suggest that using the CLI for these tasks should be your first attempt.

[1] Where “work” means “actually let you connect more than one initiator at a time”
[2] Although, being full of less than and greater than signs, doesn’t want to display nicely in wordpress. So it’s a bit sanitised