Monday, June 28, 2010

Monitoring and the Art of Sleeping Through the Night

One of my goals at Lolapps is to be as bored as possible. Not in the manner of ignoring site issues, but rather in creating automation to do my job for me. You've seen my earlier post on the load balancers we utilize so that we aren't rushing to fix up our webservers, but what about getting the system to repair the webservers for us.

Here's where Nagios comes in.

Nagios is a system that allows you to monitor and alert for site related issues. It's a highly flexible system that allows you to write code in any language for checking on the health of your site. But, for the purposes of this article, we're going to talk about a powerful feature of the system that probably doesn't get as much use as it should, support for event handlers.

So, what exactly is an event handler? An event handler is a command that gets run whenever the state of a service changes. This change can mean that it switches between any of the following states, OK, WARNING, CRITICAL, UNKNOWN, as well as substates. By substate, I refer to SOFT and HARD problem states, as well as when there is an increment in the check attempt during one of the problem states.

While this does add complexity to the options, it also gives you the ability to fine tune when your response commands get run. Let's look at an example of an event handler script:


#!/bin/sh

# define nagios command as:
# restart_service.sh $HOSTNAME$ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$

case "$2" in
WARNING)
# Service is going warning
# We only want to take action if it's the 20th attempt

case "$4" in

20)
ssh $1 "/etc/init.d/$5 stop ; sleep 2 ; /etc/init.d/$5 start"
;;
esac
esac

As you can see, this command takes 5 arguments. They are:
HOSTNAME - this is the hostname where the service is running
SERVICESTATE - This can be one of OK, WARNING, CRITICAL, or UNKNOWN
SERVICESTATETYPE - This can be one of the problem states of SOFT or HARD
SERVICEATTEMPT - which check attempt we are on
ARG1 - the name of the linux init.d service that we want to restart

Go ahead and setup the event handler command as suggested in the comments:

define command {
command_name restart_service
command_line $USER2$/restart_service.sh $HOSTNAME$ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$
}

Then, all that's left is to hook it into your service definition:

define service {
... [your service definition go here]
event_handler restart_service!httpd
}

Done! You now have Nagios automatically restarting the httpd service after it shows problems 20 checks in a row. Admittedly, this example is far from complete and requires many more pieces, but I'll leave that as an exercise for the reader.

For more information on Nagios, go to their website (http://www.nagios.org)

1 comment: