« Installing Snort and Barnyard2 in Ubuntu 9.10: Part 2 | Main | Timelox and TheHand »

Sunday, November 07, 2010

Installing nrpe on OS X

This is a guide to installing the nrpe plugin service on OS X. For getting this to work, I am endebted to the following:

If you don't know what nrpe is, you problably don't want it. What nrpe will do for you is allow your nagios server to contact a computer over the network and query it's state, using scripts to check on things such as cpu load, drive space, pretty much anything you can script together. Also, I'm walking through the steps of this process so as to provide a reasonable accounting of what I did, but I've also scripted pretty much all of it, and you can download the whole schmear. There's link to the script at the bottom of this article.

Create a user for the nrpe service to use

The nrpe service, as everything on the mac, will have to run under a user account, and it really shouldn't run under your account, since we wouldn't want it to be able to reach out and hurt your files. There are two ways you can create an account, either via the GUI using System Preferences, or via the command line. I use the latter, but we'll cover the former as well, since I'm not entirely sure of my kung fu creating accounts. I also deviant somewhat from standard unix practice--normally, you'd run nrpe under an account named nagios, but in this case, I'm going to create a general account used to do some housekeeping tasks in addition to just running the nrpe service. Think of this as a maid or butler, with limited power, to whom you can assign household tasks.

Creating the account in the GUI

In the GUI, create the account, and give it a password. Do not make it an administrator. For the purposes of this doc, it will be called susi.

Creating the account via command line

I found a good script online for creating accounts, and modified that, but the meat of it is:

username="susi";
homedir="/var/susi";
# Create the user account
dscl . -create /Users/${username};
dscl . -create /Users/${username} UserShell /usr/bin/false;
dscl . -create /Users/${username} UniqueID ${new_uid};         
dscl . -create /Users/${username} RealName "${username}";
dscl . -create /Users/${username} PrimaryGroupID "${new_gid}";
dscl . -create /Users/${username} Password "*"        
dscl . -create /Users/${username} NFSHomeDirectory ${homedir};

# Create the group
dscl . -create /Groups/${username};
dscl . -create /Groups/${username} RecordName "_${username} ${username}";
dscl . -create /Groups/${username} PrimaryGroupID "${new_gid}";
dscl . -create /Groups/${username} RealName "${username}";
dscl . -create /Groups/${username} Password "*";

# Create the home dir
mkdir ${homedir};
chown ${username}:${username} ${homedir};

# Test results
echo "Testing results";

echo "User entry for ${username}:";
dscl . -read /Users/${username};

echo;
echo "Group entry for ${username}";
dscl . -read /Groups/${username};

Once we have the account in place, we can install the npre bits. The two main parts are the nagios plugins, and the nrpe service.

Install nagios plugins

First, make yourself a working space:
mkdir nrpe_install
cd nrpe_install

Now, get the plugins. The source forge site is http://sourceforge.net/projects/nagiosplug/files/ Or you can try this direct download from a mirror on the command line:

curl -o nagios-plugins-1.4.15.tar.gz \ 
http://voxel.dl.sourceforge.net/project/nagiosplug/nagiosplug/1.4.15/nagios-plugins-1.4.15.tar.gz

Unpack the tarball, go into the directory and build the code then come back up:

tar -xvf nagios-plugins-1.4.15.tar.gz 
cd nagios-plugins-1.4.15
./configure
make
sudo make install
cd .. 
That was easy.

Install nrpe

You can fine nrpe here: http://sourceforge.net/projects/nagios/files/ Or try a direct mirror link:

curl -o nrpe-2.12.tar.gz \
http://superb-sea2.dl.sourceforge.net/project/nagios/nrpe-2.x/nrpe-2.12/nrpe-2.12.tar.gz

Unpack and jump into the folder:

tar -xvzf nrpe-2.12.tar.gz
cd nrpe-2.12

Now, we have to make a minor change to the configure file, using vi or your favorite text editor, open the configure file and find the line that tests for the libssl.so file, and comment that out, and add a test for libssl.dylib instead:

vi ./configure 
It should look like this when you are done:
   ssllibdir="$dir"
   #if test -f "$dir/libssl.so"; then
   if test -f "$dir/libssl.dylib"; then
found_ssl=yes
break
fi

Now, if you created the user account via System Preferences, configure nagios to use the susi account, and staff as the group:

./configure \
--with-nagios-user=susi --with-nagios-group=staff \
--with-nrpe-group=staff --with-nrpe-user=susi
If, on the other hand,you created the user account via command line, configure nagios to use the susi account, and susi as the group:
./configure \
--with-nagios-user=susi --with-nagios-group=susi \
--with-nrpe-group=susi --with-nrpe-user=susi
If the configure fails complaining about not being able to find the ssl libraries, double check the configure file--I got held up for a while missing that my browser and editor "helped" me by using smart quotes instead regular double quotes. Now run a make and install the parts we want.
make all
sudo make install-plugin
sudo make install-daemon
sudo make install-daemon-config

Ok, if everything went well, great. Next we edit nrpe.cfg.

nrpe.cfg

The nrpe config file is at /usr/local/nagios/etc/nrpe.cfg, let's take a quick look:

less /usr/local/nagios/etc/nrpe.cfg
In the hardcoded commands section, note the structure:
command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10

In this case, if the nagios server contacts this machine asking for the nrpe service to run the command check_user, the nrpe service (running as susi) will run:

/usr/local/nagios/libexec/check_users -w 5 -c 10
and pass the results and error codes back to the nagios server. Try running the command now as yourself and see if it works ok. Next, let's try starting up an instance of the daemon to make sure it works:
sudo /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

Now, what nagios uses on the server end to check the nrpe service running on the various machines to be monitored is itself a plugin, check_nrpe. That plugin is installed, so we can check to see if the nrpe service is working ok by calling it with check_nrpe thusly:

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_users
Assuming this works, you should get something back like:
USERS OK - 3 users currently logged in |users=3;5;10;0

Now, what we've just done is verify that we have a working version of nrpe that can be started on this machine, and can be queried using the check_nrpe plugin from nagios. What we need to do now is get this working so that our nagios server can query this machine using that check_nrpe over the network. Next, kill the daemon, and we'll make some changes in the basic configuration. Run:

ps -A | grep nrpe
You should get back something like this:
28582 ??         0:00.03 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
28608 ttys000    0:00.00 grep nrpe

The process id for the nrpe command in this case is 28582, so we use that to kill it.

sudo kill -9 28582

Make a backup of the config file:

sudo cp /usr/local/nagios/etc/nrpe.cfg \
/usr/local/nagios/etc/nrpe.cfg.orig

Open the config file:

sudo vi /usr/local/nagios/etc/nrpe.cfg

Find the line with allowed hosts, and add the ip number of your nagios server after 127.0.0.1:

allowed_hosts=127.0.0.1,xxx.xxx.xxx.xxx

We also want to be able to add our own scripts for queries, and we want to do that in such a way as to allow for easy administration. So find the section for include config directory, and add a line:

include_dir=/usr/local/nagios/etc/nrpe.d

This dir doesn't exist yet, but it is where we will put our plugin configuration. Any file ending in .cfg and place in this directory will be loaded when nrpe starts. Save the file and quit, then run:

sudo mkdir /usr/local/nagios/etc/nrpe.d

Next create a file, say:

sudo vi /usr/local/nagios/etc/nrpe.d/cs-nrpe-mac.cfg

And add your own commands:

#
# These are local CS check commands for macintosh systems
#
command[check_light_load]=/usr/local/nagios/libexec/check_load -w 6.00,6.00,6.00 -c 10.00,10.00,10.00
command[check_heavy_load]=/usr/local/nagios/libexec/check_load -w 12.00,12.00,12.00 -c 24.00,24.00,24.00
# Check for free space in /Users, exclude /afs
command[check_User_disk]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -x /afs -p /Users
command[check_uptime]=/usr/bin/uptime
command[check_kernel]=/usr/bin/uname -spr
command[check_zombies]=/usr/local/nagios/libexec/check_procs -s Z -w 3 -c 6
command[check_run]=/usr/lib/nagios/plugins/check_procs -s R -w 3 -c 6

If you write a script in bash or whatever, and you follow the guidelines on how the check_nrpe plug works, you can make custom queries. See for example the check_temper plugin. Restart the daemon:

sudo /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

Now, login to your nagios server, and call the check_nrpe command against the workstation you've just set up with the check_users command, using the ip name or number of the machine you're running nrpe on:

/usr/lib/nagios/plugins/check_nrpe -H myhost.mydomain -c check_users

You should get a response similar to what you get when you run the check_nrpe plugin locally.

Troubleshooting

  • If you get an error "Connection refused by host" immediately, double check to make sure that you have the correct ip for the nagios server in the nrpe.cfg, and that the nrpe service is running normally.
  • If the connection appears to hang, check your firewall settings to make sure the firewall will accept connections from the nagios server.
  • If you get an error that check_users is not defined, make sure you are starting nrpe with the correct path to the nrpe.cfg file.

Starting nrpe via launchd

For more on this topic, see Starting-nrpe-via-launchd, but basically, you create a file in /Library/LaunchDaemons/ that contains the instructions for running the nrpe service. By convention, the name of this file begins with the reverse of the domain "owning" it, so in my case the file is named edu.unc.cs.nrpe.plist. Here's a copy of my version, which is a bit different from the one on the web site above.

<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/
PropertyList-1.0.dtd">
   <plist version="1.0">
  <dict>
  <key>KeepAlive</key>
  <dict>
  <key>NetworkState</key>
  <true/>
  </dict>
  <key>UserName</key>
  <string>susi</string>
  <key>GroupName</key>
  <string>staff</string>
  <key>ProgramArguments</key>
  <array>
  <string>/usr/local/nagios/bin/nrpe</string>
  <string>-c</string>
  <string>/usr/local/nagios/etc/nrpe.cfg</string>
  <string>-i</string>
  </array>
  <key>Sockets</key>
  <dict>
  <key>Listeners</key>
  <dict>
  <key>SockServiceName</key>
  <string>5666</string>
  <key>SockType</key>
  <string>stream</string>
  <key>SockFamily</key>
  <string>IPv4</string>
  </dict>
  </dict>
  <key>inetdCompatibility</key>
  <dict>
  <key>Wait</key>
  <false/>
  </dict>
  <key>Label</key>
  <string>edu.unc.cs.nrpe</string>
  </dict>
</plist>

Ok, to start the service, you go to the folder where the launchdaemons are storted and use launchctl to load the plist and to start the daemon:

cd /Library/LaunchDaemons/
sudo launchctl load edu.unc.cs.nrpe.plist 
sudo launchctl start edu.unc.cs.nrpe

And here's how you stop it:

sudo launchctl stop edu.unc.cs.nrpe 
sudo launchctl unload edu.unc.cs.nrpe.plist 

Try starting the daemon, and then run the check again:

/usr/lib/nagios/plugins/check_nrpe -H myhost.mydomain -c check_users

If you don't get a response, check the system log:

sudo tail -f /var/log/system.log
If you see a bunch of stuff similar to this run by:
Oct 27 11:28:32 myhost.mydomain com.apple.launchd[1] (com.apple.launchd.peruser.504): Throttling respawn: Will start in 10 seconds
Oct 27 11:28:42 myhost.mydomain com.apple.launchd[1] (com.apple.launchd.peruser.504[12283]): getpwuid("504") failed
Oct 27 11:28:42 myhost.mydomain com.apple.launchd[1] (com.apple.launchd.peruser.504[12283]): Exited with exit code: 1

you probably have a problem with the account. Double check to make sure that the nrpe.cfg file is specifying the correct user and group. Also, if you start over at some point and delete or create an account, one thing to be aware of is account data is cached in the system. The errors above I got after deleting an account via commnand line, that account had the UID of 504 but the cache had not cleared, so I rebooted and the errors stopped.

An installer script

I've put together some scripts to do all of this automagically, but use them at your own risk. There's a 00readme, but the short version is you download it, unpack it, and then run it, giving it the userid you would like nrpe to run under and the ip number of your nagios server. You can download it here.

myworkstation:~ hays$ tar -xzf nrpe_installer.tgz 
myworkstation:~ hays$ cd nrpe_installer
myworkstation:nrpe_installer hays$ sudo ./install.sh

Usage: install.sh [OPTIONS] 
   -c = Configure nrpe packages on this system
   -m = Make nrpe packages on this system
   -i = Install made verison
   -a = Configure, Make, Install all
This should work in all cases
for a fresh installation, and possibly all generic cases
   -u = Userid to use for installation, who the installed software
will run as (required)
   -s = Ip number of the nagios server (required)

   The most common usage would be:
   ./install.sh -a -u [userid] -s [nagios server ip number]

myworkstation:nrpe_installer hays$ sudo ./install.sh -u susi -s x.x.x.x -a
Posted by bil at 10:33 AM
Categories: My Software, Other Software, Work
Comment by Matt Urbanski - Thursday 02nd February 2012 07:19:40 PM

Thanks Bil! You saved me from what was likely going to be a night of figuring out how to get nrpe working on a Mac OSX box. We only have one and only need one, so it would have been knowledge I would have gained grudgingly!
I really appreciate it.

matt.

iflowfor8hours.info
Comment by Mark Clayton - Monday 20th February 2012 05:47:59 PM

Maybe my post will help someone else out. I upgraded to Lion form SL the other day. After the upgrade my MBP's fan was running full speed and the battery drain was excessive. Turning off the network brought the symptoms back to normal. Eventually, I figured out that launchd was hogging the cpu by looking at 'top -o cpu'. Looking in system.log, I found getpwnam was failing on nrpe startup. Apparently the Lion upgrade removed the nrpe user and group. Once I recreated the user and group, cpu usage, the fan speed and the battery drain returned to normal. This error occurred on 2 different macs.

Mark
Comment by bil - Monday 20th February 2012 06:13:21 PM

Thanks, that's a good tip to know!
Comment by Mahesh - Tuesday 19th June 2012 04:09:33 PM

Thank you so much Bil. Was struggling with installing Nagios on Mac and you saved me now.. .. : )
Comment by John McAdams - Tuesday 03rd July 2012 12:04:36 PM

Thank you for this. I'm still new to Nagios and have a dozen OS X servers to monitor. The macosx-nrpe-agent in NagiosXI doesn't work on Mountain Lion (yet) and this was a great way to get my test machines into Nagios.
Comment by Raouf - Thursday 19th July 2012 07:10:56 PM

Thank you so much for this. It was a great help for me .. .
Comment by JohnO - Sunday 23rd December 2012 01:05:27 AM

Thanks for posting these excellent instructions! They worked very well in OS X 10.8.2 Mountain Lion with two changes:

1) After installing Xcode, you need to install the command line tools from within Xcode so that the makefiles can find the C compiler. This can be found via XCode --> Preferences --> Downloads: Install Command Line Tools.

2) With the current nrpe (nrpe-2.14) you no longer need to modify the ./ configure script to deal with the libssl dynamic libraries. That check has been added to the configure script already.

One suggestion: You might wish to highlight the lines in the plist file that are likely to need to be edited. For example, I saved my plist with a name that made sense for me. I also created a different username than susi. Straightforward, I know, but it might save future person a little head scratching since the errors you receive do not clearly point you to what you did wrong when you try to load or start via launchctl.

Thanks again!

JohnO