Category: MetaNET

OpenNMS and buggy switches

One of my evening projects has been setting up OpenNMS to monitor a network primarily comprised of VENDORNAME switches. OpenNMS is being put in to replace a bundle of Nagios, Cacti, Smokeping, and Groundwork Fruity for Nagios configuration management. The existing system worked well enough, but the lack of autodiscovery of services/nodes along with the poor integration between cacti and nagios was getting a bit annoying.

After setting up and trialling OpenNMS for a bit, we deployed it on this network. And then the switches started failing. They’d still switch packets, and I believe still responded to SNMP, but you couldn’t connect to them via any of the management interfaces.

So, we started looking at the differences between OpenNMS and Nagios/Cacti/Smokeping. Both do SNMP and ICMP queries, and some TCP port availability checks. The combined stack actually does more SNMP traffic because both Cacti and Nagios ended up querying the same OIDs. I’ve often noticed that Cacti sends individual requests for OIDs however, rather than grouping them, whereas OpenNMS defaults to requesting 10 OIDs per PDU. I changed this in the configuration (and later on changed it for real, as it was being set in a different config file as well), and let OpenNMS run against some test switches… and they locked up.

Perry suggested that it could be a memory leak due to the service polling, and set up a test where he polled the SSH server once a minute forever. This test got cancelled after 4 days or so, but the machines hadn’t died at that point, so we decided it wasn’t anything fundamental about the service checks.

I set up a range of services that were being monitored on 10 switches, and let them go for a bit. Due to power outages and equipment moves this step ended up taking longer than it needed to, but the end result was that no matter which services were being monitored, all the switches all locked up at around the same point.

And then I noticed that the switches had a growing number of stale “telnet-d” connections. These switches have capacity for up to 4 concurrent administrative logins – once all 4 slots are full, you can no longer log in. So, the theory is these stale connections were blocking real connections, and never timing out, thus causing the lockout of the management stack. They don’t time out, and you can’t kill them from the switch console short of rebooting the switch. Most of the switches weren’t being actively monitered for telnet, but OpenNMS does do service discovery periodically (I think once a day, and perhaps under other situations too), and this would probe each service. So I firewalled telnet out, and had the switches restarted, thinking this would solve it.

The switches still locked up.

The switches still had stale telnet connections appearing in them.

I turned off the telnet service on each switch, thinking that perhaps there was something else on the network that was talking to them, and restarted them.

Within 5 minutes of rebooting each switch, there was a stale telnet connection listed. Awesome.

So, we’re down to a service that is being misreported as a telnet service. I go through all of them, and discover that none of the other services – FTP, HTTP, HTTPS – even show up as an active session. Which leaves telnet – firewalled out – and SSH.

The OpenNMS plugin which handles discovery of SSH servers is a bit smarter than a basic “is a service listening on port 22” sort of discovery – it waits for the SSH banner from the server, then sends it’s own SSH banner back, and verifies that it gets a response back. This means it’s getting part way through the SSH establishment, and then canning the connection.

As a quick test, I telnetted to port 22 on a switch and checked the login listing. With the banner is being displayed, nothing even shows up. When I pasted a valid looking SSH banner back, I got a bunch of binary data echoed into my telnet session, and ssh session to the switch locks up. On reconnecting and checking the login listing, sure enough – a stale telnet session was there.

Further tests reveal that if you ssh to one of these switches, but don’t type your password in, the session gets reported as a telnet session. Furthermore, if you kill your ssh process or shell window while the ssh session is waiting for your password, the session never disappears.

So, we have a very live DOS exploit against VENDORNAME switches here, assuming anyone is unwise enough to allow SSH access from random networks and VLANs to their switches that is. I have to point out that while it’s a particular “feature” of OpenNMS that triggered this problem for us, this isn’t a bug in OpenNMS at all, given that it’s trivial to trigger the same problems with the switches directly.

In regards to the actual problem at hand, OpenNMS is quite configurable, so at least I can change the way it does SSH service discovery to revert to a simple “is the port up” check. I’ve left this running for nearly two weeks now, and the switches on my test bed are all still behaving properly.

I held back from posting this until I could get a response from the vendor. They’ve acknowledged the bug, and a fix will be out in the next firmware release apparently. I might update once they have released a new firmware; I’ve edited out the vendor name from this post because I don’t believe it’s responsible to publish denial-of-service vulnerabilities without giving the vendor a chance to fix them.

I also noticed this post on the OpenNMS blog. The author there had similar problems with monitoring a firewall device, and while the scenario seems different, VENDORNAME makes firewalls as well as switches; I wonder if it’s the same vendor in his case.

jabber MetaNET

Moving from ejabberd’s internal database to postgres

Post author By daniel
Post date March 9, 2008
No Comments on Moving from ejabberd’s internal database to postgres

During the initial import of data to the {{post id=”jabbermetanetnz-has-moved” text=”new jabber.meta.net.nz server”}}, I noticed that the roster information failed to be migrated if I was using an ODBC backend – it would just fail entirely. I didn’t think too much of it until I tried to restart ejabberd, and noticed that it spent around 5 minutes initializing it’s mnesia database. There’s around 220,000 roster entries in the database, and that’s after I culled around 9000 accounts that hadn’t been used in more than 180 days.

So, I decided to have another look at moving the content to an SQL backend. I noticed that one of the many commands available from the ejabberdctl cmdline tool is an export2odbc command. It didn’t seem to work at first, but it turns out I had the parameters slightly wrong – this post set me straight. So, I started importing all the tables, including the roster table, into my database

I then noticed that it appears to be doing one, sometimes two deletes, for every insert. I have an empty database, so there’s no real need for this, and there’s 220,000 entries. During the 30 minutes or so I spent playing round with other ways to do this import I only got to around 115,000 entries added! The export command had output the content on the smallest number of lines possible, so after a bit of sed and grep magic I had a file which contained merely the relevant inserts. It’s still running slow, but it’s running a lot quicker than previously.

I know postgres’s dump functions tends to prefer COPY rather than INSERT because it is so much quicker, but that’s the point at which my familiarity with the COPY command ends, and I certainly didn’t want to spend the time to mangle the SQL I have into a couple of COPY commands.

jabber MetaNET

Jabber.meta.net.nz has moved

Post author By daniel
Post date March 9, 2008
2 Comments on Jabber.meta.net.nz has moved

I bit the bullet and moved jabber.meta.net.nz onto its new server last night. DNS records have changed, however the TTL was set to 60 seconds a long time ago so this shouldn’t be a problem. If you’re having problems connecting please verify that you’re actually connecting to jabber.meta.net.nz and that whatever IP you are resolving that to matches the authoritative one – 219.88.251.36.

The new server is running ejabberd, and while this is much better software than the previous version of jabberd2 it was using, it still has some quirks. The most noticeable one is the godawful time it takes to start up when using internal authentication, and a bug when trying to use an ODBC backend which results in roster information being lost entirely. I’ll keep working on these two points, with hopefully only minor outages over the next week or so.

Other new features include an XMPP.net federated SSL certificate, rather than me buying an SSL cert specifically for jabber, and we’ve pulled jabber.meta.net.nz into the web 2.0 age with the addition of a jwchat web-chat client, which makes use of ejabberd’s http binding features.

linux MetaNET

Restarting linux software raid arrays

Post author By daniel
Post date January 16, 2008
2 Comments on Restarting linux software raid arrays

I have a server with different sets of RAID1 arrays, across two different sets of disks. The first set of arrays, on the first pair of disks, always starts fine. The second set never starts on boot. I don’t reboot this server very often, so it’s not often a problem, however I figure it’s worth looking into.

For starters, if you have an existing software RAID array that is merely not running, you can assemble it with mdadm like so:

[code]
mdadm –assemble /dev/md8 /dev/sdc1 /dev/sdd1
[/code]

mdadm also has a config file which is referred to in the system boot process, in order to work out where your arrays are. In theory this is all supposed to be self configuring, but I’ve seen enough cases where some arrays don’t start that it looks like using this config file is a recommended step.

You can generate a useful representation of your existing arrays with

[code]
mdadm –detail –scan
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=…
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=…
ARRAY /dev/md3 level=raid1 num-devices=2 UUID=…
[/code]

You can use similar syntax but specifying the specific devices used in each array, instead of the UUID, if you like. In theory, with this information appended to /etc/mdadm.conf, next time I reboot this server it should bring all the RAID devices up for me.

jabber MetaNET WLUG

Upgrading jabber.meta.net.nz

Post author By daniel
Post date October 24, 2007
1 Comment on Upgrading jabber.meta.net.nz

I’ve finally gotten sick of the jabber.meta.net.nz services having a negative impact on the rest of the server due to the transports chewing up lots of ram, and so have put together another server and will start migrating things over. The new server is a dual 1GHz Coppermine P3, with 2 GB of ram and 36 GB of SCSI RAID-1 disks.

This seems like a great opportunity to tidy things up again. The server stack has been a bit messy since we upgraded to jabberd2, partly because of the server codebase, and at least partly because of gentoo. So, the new plan is to move jabber.meta.net.nz to ejabberd, sometime soon. Along with the move will come a new webbased chat system ( jwchat ), and newer versions of the transports if they are available.

Ejabberd seems to be a pretty robust system. Jabber.org moved their system to it several years ago, and jabber.org.au made the switch more recently. I’ve seen some pretty impressive benchmarks regarding message latency / cpu loading / ram usage under high protocol load and it beats jabberd1.x and jabberd2.x hands down. Not that jabber.meta.net.nz really notices, or compares with jabber.org or jabber.org.au in terms of volume. We’ve currently got 112 clients connected, and someone noticed we had 203 clients connected a couple of weeks ago, but I have no idea what the high water mark for client connects is, much less any idea about message throughput statistics.

I aim to have the following done in the next few weeks:

Install and test base system – ejabberd, transports
Install and test jwchat
Script the migration of registration information and rosters from the jabberd2 pgsql database to something ejabberd can read.
Update the website with better information
Setup transports to prevent OOM issues (ulimit/regular restarts)
Test migration of registry/rosters
Pick a day and cutover.

The trickiest bit will be the migration of roster info from pgsql to ejabberd’s format, but I’ve already done at least three quarters of that already, as I considered doing this migration back in March.

Tags ejabberd, jabber.meta.net.nz

MetaNET Tool of the Week WLUG

Restricting ssh password auth by group or shell

Post author By daniel
Post date October 30, 2006
No Comments on Restricting ssh password auth by group or shell

Matt Brown asked if I could think of any way to allow a certain group of users to scp into a host and use a password, while requiring a valid key pair for most other users. Perry suggested a solution to this a while ago, so I sat down and had a quick look at it, and got it working.

I configured sshd such that:

[code]

PasswordAuthentication no
ChallengeResponseAuthentication yes
UsePAM yes
[/code]

This bypasses direct /etc/passwd auth, but allows standard PAM based auth via the ChallengeResponseAuthentication mechanism. This will allow everyone to login with a password if possible, so we need to configure pam. For this, I used the pam_listfile module, checking that the user had a particular shell, /usr/bin/scponly, as their shell:

[code]

cat “/usr/bin/scponly” > /etc/scpshells

[/code]

I then edited /etc/pam.d/sshd:

[code]

auth required pam_env.so
auth required pam_listfile.so item=shell sense=allow file=/etc/scpshells onerr=fail
auth sufficient pam_unix.so likeauth nullok
auth required pam_deny.so
auth required pam_nologin.so

session required pam_limits.so
session required pam_unix.so

account required pam_unix.so

password required pam_cracklib.so difok=2 minlen=8 dcredit=2 ocredit=2 retry=3
password sufficient pam_unix.so nullok md5 shadow use_authtok
password required pam_deny.so

[/code]

I probably don’t need all of that in the sshd pam snippet, but I just dumped the contents of the included files into to to make editing it easier.

To test this I added /bin/bash to /etc/scpshells, and verified that I could ssh in by using a pasword. I then removed it, and verified that I could no longer ssh in with a password. Combine this with a suitable shell (/usr/bin/scponly), and I can create users that can scp in with a password – or with a key if they care – but cannot get a local shell; all other users cannot authenticate via PAM, and so must provide a valid key.