Categories
linux MetaNET

OpenNMS and buggy switches

One of my evening projects has been setting up OpenNMS to monitor a network primarily comprised of VENDORNAME switches. OpenNMS is being put in to replace a bundle of Nagios, Cacti, Smokeping, and Groundwork Fruity for Nagios configuration management. The existing system worked well enough, but the lack of autodiscovery of services/nodes along with the poor integration between cacti and nagios was getting a bit annoying.

After setting up and trialling OpenNMS for a bit, we deployed it on this network. And then the switches started failing. They’d still switch packets, and I believe still responded to SNMP, but you couldn’t connect to them via any of the management interfaces.

So, we started looking at the differences between OpenNMS and Nagios/Cacti/Smokeping. Both do SNMP and ICMP queries, and some TCP port availability checks. The combined stack actually does more SNMP traffic because both Cacti and Nagios ended up querying the same OIDs. I’ve often noticed that Cacti sends individual requests for OIDs however, rather than grouping them, whereas OpenNMS defaults to requesting 10 OIDs per PDU. I changed this in the configuration (and later on changed it for real, as it was being set in a different config file as well), and let OpenNMS run against some test switches… and they locked up.

Perry suggested that it could be a memory leak due to the service polling, and set up a test where he polled the SSH server once a minute forever. This test got cancelled after 4 days or so, but the machines hadn’t died at that point, so we decided it wasn’t anything fundamental about the service checks.

I set up a range of services that were being monitored on 10 switches, and let them go for a bit. Due to power outages and equipment moves this step ended up taking longer than it needed to, but the end result was that no matter which services were being monitored, all the switches all locked up at around the same point.

And then I noticed that the switches had a growing number of stale “telnet-d” connections. These switches have capacity for up to 4 concurrent administrative logins – once all 4 slots are full, you can no longer log in. So, the theory is these stale connections were blocking real connections, and never timing out, thus causing the lockout of the management stack. They don’t time out, and you can’t kill them from the switch console short of rebooting the switch. Most of the switches weren’t being actively monitered for telnet, but OpenNMS does do service discovery periodically (I think once a day, and perhaps under other situations too), and this would probe each service. So I firewalled telnet out, and had the switches restarted, thinking this would solve it.

The switches still locked up.

The switches still had stale telnet connections appearing in them.

I turned off the telnet service on each switch, thinking that perhaps there was something else on the network that was talking to them, and restarted them.

Within 5 minutes of rebooting each switch, there was a stale telnet connection listed. Awesome.

So, we’re down to a service that is being misreported as a telnet service. I go through all of them, and discover that none of the other services – FTP, HTTP, HTTPS – even show up as an active session. Which leaves telnet – firewalled out – and SSH.

The OpenNMS plugin which handles discovery of SSH servers is a bit smarter than a basic “is a service listening on port 22” sort of discovery – it waits for the SSH banner from the server, then sends it’s own SSH banner back, and verifies that it gets a response back. This means it’s getting part way through the SSH establishment, and then canning the connection.

As a quick test, I telnetted to port 22 on a switch and checked the login listing. With the banner is being displayed, nothing even shows up. When I pasted a valid looking SSH banner back, I got a bunch of binary data echoed into my telnet session, and ssh session to the switch locks up. On reconnecting and checking the login listing, sure enough – a stale telnet session was there.

Further tests reveal that if you ssh to one of these switches, but don’t type your password in, the session gets reported as a telnet session. Furthermore, if you kill your ssh process or shell window while the ssh session is waiting for your password, the session never disappears.

So, we have a very live DOS exploit against VENDORNAME switches here, assuming anyone is unwise enough to allow SSH access from random networks and VLANs to their switches that is. I have to point out that while it’s a particular “feature” of OpenNMS that triggered this problem for us, this isn’t a bug in OpenNMS at all, given that it’s trivial to trigger the same problems with the switches directly.

In regards to the actual problem at hand, OpenNMS is quite configurable, so at least I can change the way it does SSH service discovery to revert to a simple “is the port up” check. I’ve left this running for nearly two weeks now, and the switches on my test bed are all still behaving properly.

I held back from posting this until I could get a response from the vendor. They’ve acknowledged the bug, and a fix will be out in the next firmware release apparently. I might update once they have released a new firmware; I’ve edited out the vendor name from this post because I don’t believe it’s responsible to publish denial-of-service vulnerabilities without giving the vendor a chance to fix them.

I also noticed this post on the OpenNMS blog. The author there had similar problems with monitoring a firewall device, and while the scenario seems different, VENDORNAME makes firewalls as well as switches; I wonder if it’s the same vendor in his case.

Categories
linux

Weird autoblogs

I just got a pingback on my earlier post, which was from a blog that indexes posts and articles with a particular word – acceptance – in it. Kind of an odd premise for a blog.

UPDATE As per the comment, the author/owner of More Lyrics updated his blog to remove the quote. My original comment was tongue in cheek, but it’s only fair to remove it I think :)

Categories
linux

Citrix on Xen

It seems that the original subject of my post yesterday caught the eyes of much of the virtualisation community, including Simon Crosby, formerly from Xensource, and now working for Citrix.

He’s written a typically well thought out response, which covers off a lot of points:

  • HP have a multi-hypervisor management tool already which signs off on Xenserver, VMWare and Hyper-V support
  • Xenserver Platinum, which is comprised of Xenserver Enterprise and Citrix Provisioning Server, can already provision VMs to not only physical hardware and Xenserver, but to other hypervisors as well
  • He covered off again the ecosystem building around the Xenserver product range, specifically in HA areas – products like Marathon Everrun and Stratus Avance.

He also wrote up a good bit on the position of Xen with regards to KVM. I haven’t really looked into KVM much, due to not ready access to test hardware with VT capable chips (the test hardware I do have is tied up with testing Xenserver), but I’ve always been wary of various claims that it’s a better VM stack than Xen is. (That might just be because I’ve not spent the time looking into it, and it might be because of the general not-invented-here feeling the “linux kernel” community seems to have about Xen. Again, not something I’ve spent a lot of time on). A lot of the stuff Simon writes is high level and enthusiastic of course, but it paints a clear picture – Xen already has massive uptake in mindset, and not just with traditional linux vendors either (Sun xVM and Oracle VM having Xen based stacks as well). I guess the jury might still be out on which technology actually is technically superior, but as history demonstrates, it’s not always the technically superior technology that lasts.

Simon also claims that Xen will be in the BIOS hypervisor offering from Phoenix, which is something I haven’t heard before. It certainly makes some amount of sense for Phoenix to not rewrite an entire hypervisor stack and then stick it some place that’s inherently difficult to upgrade – your BIOS, but I’m not sure how it works out regarding Xen’s requirement for a privileged Xen-aware guest to provide hardware drivers.

Simon also makes another point that I must have heard before from him, because it’s stuck with me and I agree entirely with the premise:

The founding thesis of XenSource, and the continued strategy at Citrix, is to promote fast, free, compatible and ubiquitous hypervisor based virtualization. If the hypervisor is free, why worry about who delivers it? Let the customer pick the implementation method that they want – the real money is in the up-sell with products that make virtualization valuable for customers.

Whether you like it that companies are in this to make money or not, this approach seems a good one. Piggybacking their moneymaking on an opensource product, an action which drives development, acceptance and that horrible word “mindshare”, doesn’t have to be a bad thing.

Categories
linux

Citrix Xenserver: Xen or Hyper-V? Does it matter?

Seems there’s a bit of debate at the moment about the future of Xen within Citrix’s product range, all sparked by this article by Brian Madden, which he clarified later on.

Brian’s followup clarifies his point:

When I say that Citrix will drop Xen, I mean that Citrix will drop the open source Xen hypervisor. I do not believe that Citrix will drop their XenServer product.

When you consider that Citrix Xenserver is a hypervisor based virtualisation stack (Xen on CentOS), and a virtualisation management tool (XenCenter), then sure, it’s possible for Citrix to change XenCenter so that it manages Windows Hyper-V instead. Xenserver, the product and brand, becomes a Windows 2008 Hyper-V install, and XenCenter manages that instead. It’s possible. Scott’s comments about porting Xen to windows missed the mark – Citrix only need to port the management stack and change the virtualisation layer to windows. RedHat are in the process of doing something similar with their recent move away from Xen to KVM. It’s not as radical a shift as from Xen to Hyper-V, but it’s as radical as you need to be – it’s a completely different virtualisation stack.

I’m still not sure I agree with Brian though. Citrix just dropped $550M on purchasing Xensource, and then promptly rebranded their flagship product to match. Granted, Citrix have a great track record for rebranding every couple of years, but it seems like a colossal waste of money given that Hyper-V, while not released at the time, was defintely public knowledge.

Citrix also have no need to drop the Xen out from under Xenserver. Citrix Workflow Studio already handles some automation tasks for both Xenserver and Hyper-V, and it’s no stretch to see this working on VMWare systems as well. Moreover, XenCenter itself could be modified to manage both Xen-based Xenserver systems as well as Windows Hyper-V systems. The reverse will definitely happen from Microsoft’s point of view – integration with XenServer in Microsoft’s Systems Center Operations Manager has been talked about for months now.

One prediction that is worth making is that cross-VM management stacks will flourish and improve. The example of Hyper-V and Xenserver was mentioned earlier, but they will grow to cover other assorted Xen based stacks from Virtual Iron, Novell, Sun etc, KVM stacks like RedHat, and of course VMWare. Citrix Workflow Studio makes a start in some ways, and products like VMLogix’s Lab Manager. Enomalism is already much of the way there, and goes a step beyond into cloud computing. The hypervisor (or at least, some kind of virtualisation) will be ubiquitous, and the winners will be the management stacks.