Tuesday, December 1, 2009
Thesh17′s been running on a linode VPS for a month or so now. It works great until it gets railed, and apache dies. Then a few hours later I notice the site’s down and login to restart it. Originally I wrote a shell script that would just check if httpd was running. If it wasn’t, it’d start it. It’d also check the output of ‘free -m’ and if memory usage(not counting buffers) was greater than 90%, restart apache.
You’d think that’d be good, right? Wrong. I don’t know why, but occasionally things would get confused. It’d see httpd running, but it wasn’t serving any pages on port 80. service httpd status would show it was stopped, but the script didn’t check http connectivity or the actual pid file. I honestly didn’t troubleshoot this much, and just manually killed the processes/restarted httpd.
I decided to search for some scripts similar to mine(and better), but instead ended up landing on a utility called Monit. It’s basically a service monitor you can run(locally or remotely) and depending on different conditions, it will restart the service and make sure it’s running. This can be as simple as the pid in the pid file not existing, or more complex like the process has been using 90% of the cpu for the last 30 minutes.
As an example, here’s what my apache monit config looks like right now:
check process httpd with pidfile /var/run/httpd.pid group apache start program = "/sbin/service httpd start" stop program = "/sbin/service httpd stop;pkill -9 httpd" if failed host 127.0.0.1 port 80 protocol http then restart if cpu is greater than 60% for 2 cycles then alert if cpu > 80% for 5 cycles then restart if children > 150 then restart if loadavg(5min) greater than 10 for 8 cycles then alert if memory > 90% for 2 cycles then restart if 5 restarts within 5 cycles then timeout
Monit is pretty simple to install. It’s available in most distro’s repositories. On centos:
root@server:# emerge -av monit
I’m going to give Monit a try for a few weeks and see how it works out. I’ve been using nagios lately as well, but it seems kind of cumbersome to me and mainly one used for monitoring/reporting. Monit will actually restart a service if it detects a problem which *usually* is a good thing.