Wednesday, May 30, 2012

BMC: Enable SNMP Traps for Hardware Failures

You can enable SNMP traps to be sent from a BMC/iDRAC/Whatever using a few IPMI commands.  This works on every Dell server I have ever tested but didn't work on HP systems I have tried.  I imagine it would work on servers from most other vendors.  I know there are vendor-specific tools to do this, but I prefer using industry standard protocols to administer systems and really don't like having to install a lot of extra vendor tools and daemons.

All you need to install is freeipmi.  The following kernel modules may need to be loaded: ipmi_si, ipmi_devintf, ipmi_msghandler.

If you aren't familiar with how to use freeipmi's bmc-config or pef-config, please check out my post entitled freeipmi *-config tools primer.

Enable trap destination:
pef-config -c -e Lan_Alert_Destination_1:Alert_IP_Address=
pef-config -c -e Community_String:Community_String=public
pef-config -c -e Alert_Policy_1:Policy_Enabled=Yes

Enable alerting:
bmc-config -c -e Lan_Channel:Non_Volatile_Enable_Pef_Alerting=Yes
bmc-config -c -e Lan_Channel:Volatile_Enable_Pef_Alerting=Yes


Good luck.  There are a few ways of testing this that might work.  Some are vendor-specific and some are standard but I haven't found a one-size-fits-all method of testing SNMP traps.  If it's a rackmount server with redundant power supplies (and isn't in production) you could pull the power to one PSU.  Blades are more difficult.

On Dell, you may be able to do this from the iDRAC's ssh interface or via the racadm tool from the OS:
racadm testtrap -i 1

I don't know if there's a freeipmi version of the following test.  This basically tells the BMC to act as if a certain condition just occurred.  That includes sending out SNMP traps, inserting an entry into the SEL, and taking power-related actions (if configured to do so).
ipmitool event  # list available events
ipmitool event <1|2|3>  # trigger it. 1 might be dangerous if your server is set to power off for over-temp conditions

Here's a crazy test to try.  This uses the IPMI watchdog timer to actually fire.  Be careful not to accidentally reboot your server if you normally use the watchdog timer.  I've seen an instance or two where configuring the watchdog timer doesn't seem to work remotely, but it should work locally.  What we'll do is set the watchdog timer to be short (just 3 seconds), log to the SEL, and not take any action besides log the expiration of the timer.  This should be safe but I wouldn't recommend it for a production system.
ipmitool mc watchdog get # show settings before we mess them up
ipmi-raw 0 6 0x24 0x04 0x00 0x00 0x3E 0x20 0x00 # We all love magic numbers. There's probably a better way to do this but it works. This does all the configuration.
ipmitool mc watchdog get # check that the "Initial Countdown" is "3 sec" and "Watchdog Timer Actions" is "No action (0x00)"
ipmitool mc watchdog reset # activate the timer

After the timer expires, an event should added to the SEL (check with ipmi-sel and clear with ipmi-sel --clear).  If the configuration is correct you should also receive an SNMP trap.  If you plan on using the watchdog timer in the future, you can configure it as usual.

Almost all of the testing above will result in an entry in a SEL.  You can clear it with ipmi-sel --clear or ipmitool sel clear.

Even if you try everything above, you still may not get a test message.  I have successfully configured servers that wouldn't send test traps no matter how hard I tried, but they do send traps for real hardware failures.  If you pull a redundant PSU on a rackmount and don't get alerted, things are configured wrong.


There are other things you may need to enable depending on the defaults of the system you are using.  This was only really tested on Dell blades and rackmounts.  It won't work on the one model of HP system I tried because ipmi-pef-config doesn't run on it (HP doesn't implement the necessary features, AFAIK).

If you encounter problems, play around with it for a while.  If you used vendor-specific tools to configure a server and it worked, try running diff on the output of "bmc-config -o" and "ipmi-pef-config -o" from the working server and a non-configured server.  That may point you in the right direction.

You may want to check out my other posts for other ideas: Enable IPMI Over LAN from the OS and freeipmi *-config tools primer.

Let me know if you get a system working and needed to change some other settings.

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.