One of the servers I maintain is the jabber server at jabber.meta.net.nz. This is a free public service, anyone can use it, and it does get quite a wide range of use – for a long time we seemed to be very popular for south american users, possibly because of the web based clients and the range of transports to other protocols we support. We typically see between 50 and 100 concurrent users, depending on time of day and week, but the active account base is normally in the low thousands.
The transports themselves cause me a lot of problems. In the past they’ve been downright buggy, crashing all the time, but with the current codebase for all four protocols in use (AIM, ICQ, MSN and Yahoo) all being in python, we don’t seem to have as many outright crashes. We do have slow memory leaks however, which prompted me to move the services to a new server a while back. Part of me was hoping that the memory leaks were caused by the gentoo system I was using initially, but this doesn’t seem to be the case.
So, I needed to either fix these memory leaks, or to work around them. Enter monit. I’ve heard about monit quite a bit, but never really looked into it other than thinking it might be interesting. I really wish I’d looked further ages ago. It’s easy to set up, is designed specifically to monitor and restart services, and it solved my memory leak problem in about 5 minutes.
Here’s a snippet from the config file:
check process aim-transport with pidfile /var/jabberd/pid/aim-transport.pid
start program = “/etc/init.d/aim-transport start”
stop program = “/etc/init.d/aim-transport stop”
if cpu > 60% for 2 cycles then alert
if cpu > 80% for 5 cycles then restart
if totalmem > 300.0 MB for 5 cycles then restart
This is pretty self explanatory really. If CPU usage of this process gets too high, alert, then restart if it stays high for 5 cycles. And if the ram usage is over 300 MB for 5 cycles (a cycle is 2 minutes by default), restart the process. Problem solved. Or rather, the symptoms are solved, but that’s good enough for me at this stage
NoteThis is old, but somehow didn’t get posted