Thursday, November 11, 2010

Diagnosing Throttled or "Slow" Systems (Processors to be Precise)

Background

Also see part 2 for more information. Our original issue has been resolved.

First I will give the lengthy debugging process as background before giving the commands that others can use to diagnose and fix problems. I include this so that other systems administrators can see some of the tools available to them and partially for my own documentation. This documents over a month of work, so hopefully others will benefit.

We began to notice a few blades that were running slower than others for some reason but had no idea why. These are Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures.  Users would complain about "slow nodes" that some of their batch jobs ran on. Some users could tell us on a per-node basis which ones were slow. Unfortunately, some users with ~1000 processor MPI jobs could not pinpoint it but only knew their jobs ran much slower than usual.

Debugging Steps

All we knew was there were some blades out there that ran slower than the rest but had no clue why or how to identify them. The Linpack benchmark is standard in the HPC industry for benchmarking and burn-in so we naturally used it to identify slow nodes. The problem was that benchmarking was a reactive measure and didn't identify the root cause. You can't very well run Linpack on several hundred nodes every few days just see if there is a problem.

dset
Dell was able to tell us how to find the problems (with some caveats) by using their OpenManage suite. First they had us run the dset utility to send back reports to their support team. At this point Dell was able to point out that the processors were in a "throttled" state. This was somewhat good news since we now knew that there was at least a particular component that could be blamed. Our natural next question was how to find this across our clusters. Sending off diagnostic reports to the vendor for >900 nodes every few days didn't seem like an ideal way. cpufreq-info was suggested but cpufreq wouldn't work in our installation.

omreport (OpenManage)
Fortunately the support tech was able to find a way to use OpenManage to return the same information as dset in a much quicker and more concise way that was most importantly scriptable. omreport (part of OMSA) was able to return the necessary information. The exact command is omreport chassis processors. This was a much needed tool to find throttled CPUs. However, we began to notice some inconsistencies. Some nodes reported a processor as "Present" and others as "Enabled". Normal nodes with Westmere reported "Present" for the CPU state and Nehalem nodes reported "Enabled". Some of the throttled Westmere nodes also reported "Enabled" for some reason. So this left us with a partial answer, an answer that was still better than nothing though. We never did figure out the cause of the interesting reporting but didn't spend much time on it either.

OMSA is a good suite of tools, but it has always seemed a little heavy for our needs. It runs a few daemons that do something or other and then talks to the hardware over IPMI, something that can also be done with simple CLI tools like ipmitool. A few years back we used to see problems with OMSA processes sometimes eating up a lot of CPU time (a big no-no on an HPC cluster) so we removed it. To be fair, I'm guessing Dell has improved the code a lot since then so it's possible it doesn't happen anymore. Side note: Now that I have dealt with IPMI a lot, I also think that the problems back then may have been due to the kernel interface with the BMC going berserk (shown as kipmi0 eating up a CPU) and thus causing OMSA to likewise freak out.

I asked Dell if we could get ahold of either the source code for whatever gave the throttling information from omreport. A C function or two, a program to run, a kernel interface to query, a decent description of what happens, anything to help us get that information on our own. Without waiting for a response I followed what I thought was the most logical path to take: IPMI.

IPMI
ipmitool has been a favorite tool of mine for quite a while now. In the past I mostly used it to query and control server power, read the SEL (System Event Log), check firmware levels, and occasionally glance at the ambient temperature sensors and fan speeds, etc. Serial over LAN is also a very nice feature. I played around with ipmitool sdr for a while since I knew it could get me temperature readings and more. However, the command was painfully slow to run since it queried tons of existent and non-existent sensors. After playing around with sub-options for ipmitool sdr I came across ipmitool sdr type Processor and ipmitool sdr entity 3. These turned out to be exactly what I was looking for. It could even be done over the network since each iDRAC/BMC has an IP address.

Using ipmitool sdr entity 3 | grep Throttled became my preferred method of checking for throttling (add "-v" to ipmitool for extra information). It worked great at finding throttled nodes, but not finding the root cause of the throttling. The important thing was that we now knew how to find throttled nodes. It turns out that there were several blades that we did not know about that had been throttled. We quickly marked these offline in our scheduling system so that no more batch jobs would be scheduled on them. I also wrote up a quick health check script to tie into the Torque pbs_mom on each node that would take the node offline if throttling was detected.

Fixing the constant throttling
Now the task was to find a reliable way to fix the blades. To start off, we have >900 Dell PowerEdge M610 blades in M1000e chassis. It turns out that simply doing a reseat of the blades (physical or "virtual") fixed the issue every time (or so we thought at the time). There is a way in the chassis CMC (Chassis Management Controller) web interface to do a virtual reseat in a Troubleshooting section somewhere. The easy way is to run the following command over ssh: serveraction -m server-<bladenumber> -f reseat. That cuts power to the blade and causes the CMC to treat it as if the blade were just inserted. This was of course inconvenient since the blade has to be empty of batch jobs before doing a reseat, but it's much better than needing to replace hardware.

Later on we discovered that doing a racreset on a chassis can also clear most throttling issues, though we didn't understand why at the time. For those who don't know, a racreset resets the chassis CMC(s) but keeps the blades powered up. While the racreset is in progress, the fans spin up and blades get throttled while the intelligence is missing. It is a safe operation for powered on blades.

If you didn't catch it, let me point out that a racreset causes throttling to occur temporarily. It also occurs during a CMC firmware upgrade. It turns out that throttling can occur under normal operating conditions. That discovery necessitated the addition of intelligence to our health checking script that checked for throttling. One might describe it as a bug if a script marks every blade in the cluster offline during a firmware upgrade... So I added logging to the script. I added quick-and-dirty logging functionality to the scripts by storing an sqlite database on each node. Every time the health check script ran, it logged the current state of the processors to the database. It worked well, except IPMI can be a little slow and sometimes kipmi0 likes to freak out (hopefully fixed in newer kernels and hopefully not a BMC firmware issue). I figured we could ignore throttling if it lasted for only a few minutes since a racreset or CMC firmware upgrade might cause throttling for at most a few minutes.

Intermittent throttling
This functionality worked well, that is until I discovered intermittent throttling across several nodes. It turns out that the throttling was more frequent than we thought. At this point we had resolved throttling on nodes that were consistently throttled and haven't seen that constant throttling ever since. Dell said that a firmware upgrade very likely fixed it. However, we didn't know what to do about the intermittent throttling. That necessitated even more intelligence in the script, including calculating a rate over the last few hours, but only if enough data points had been collected. This got to be fun, especially when trying to pick what threshold to alert us at. We still didn't know what was causing it.

Thermal Throttling
Around this same time, we had another throttling issue that was probably unrelated: thermal throttling. This is kind of a tangent, but it actually helped us find a better way to monitor all kinds of throttling. To begin with, let's just say that we have sufficient cooling capacity and then some for the next few years in our server room. We are still unsure of the cause, but we noticed it by looking at the logs on the blades (grep -i throttled /var/log/messages). Thermal throttling didn't show up in the ipmitool output, of course (that would have been too easy). So now I added another quick-and-dirty health check script on the nodes to see if "throttled" showed up in the logs. I figured that if the Linux kernel could figure out throttling was occuring, so could I. I set off on a search of the Linux source and was quickly rewarded by the functions that output messages like "kernel: CPU3: Temperature above threshold, cpu clock throttled". The messages were a result of the thermal interrupt handler intel_thermal_interrupt() in arch/x86/kernel/cpu/mcheck/therm_throt.c.

MSRs
It turns out that the Intel processor fires an interrupt to indicate a transition to or from a thermally-throttled state for the processor. The actual state is returned with rdmsrl(). It reads MSR (Model Specific Registers) values from the processor. Since I was only vaguely familiar with MSRs, I read up on what they were and then figured out how to query them. The line of code I found with rdmsrl made use of the define MSR_IA32_THERM_STATUS which I then found in arch/x86/include/asm/msr-index.h. The location of that register is 0x19c and THERM_STATUS_PROCHOT showed that bit 0 was the value that indicated thermal throttling. Now I just needed a way to query it. I could have written a short C program to grab the data, but my health check script was in bash, so I decided to use the already available rdmsr program (msr-tools package) to grab it: rdmsr -p$cpunum 0x19c. Now I had a quick and easy solution to check for thermally throttling on Intel processors. I then updated my health check script to check the processors for thermal throttling using the MSR values instead of reading the logs. Side note: the CMC in the chassis may have gotten confused somehow, possibly due to a power spike when a trucker knocked down several power poles near campus. A simple racreset immediately cleared it. Virtual reseats of the blades did too. Several months back we found a processor with a thumb screw loose, so sometimes it really is a physical problem.

I began to play around with rdmsr to query other values I found in msr-index.h. I figured if the BMC knows that a processor is (non-thermally) throttled, the processor itself must too. The problem is that I can't find any documentation about MSR values beyond the contents of msr-index.h in the Linux kernel. I decided that instead of logging just bit 0 of MSR 0x19c, I would log the entire default output of rdmsr -p$cpunum 0x19c and look at each bit on nodes that I knew were having lots of intermittent non-thermal throttling (the kind visible with ipmitool sdr entity 3). Over time I collected enough data to find a correlation between bit 2 (i.e. 1<<2) and the throttled state according to ipmitool. I am very confident now that the MSR at 0x19c bit 2 indicates generic throttling of some kind. I have tried to find documentation on the Intel MSRs but apparently it is only given out under NDA to some developers. What $(($(rdmsr -c -p0 0x19c)&4)) actually means according to Intel is a mystery to me, but according to the empirical data I gathered it means the CPU is throttled. If anyone happens to know for sure, please post a comment below. My guess is that it indicates throttling requested by a piece of hardware other than the processor, such as by the iDRAC/BMC or CMC.

At this point I was now able to consolidate my scripts into one that checks the MSRs at 0x19c bits 0 and 2 with a fallback to ipmitool if the processor is non-Intel. I'm not sure how non-Intel systems report throttling since we have almost zero non-Intel chips. I should also add that this seems to work fine on Harpertown, Nehalem, and Westmere chips. Looking at arch/x86/include/asm/msr-index.h it appears that bit 10 (THERM_STATUS_POWER_LIMIT) is what I want but I have never seen that bit set.

Does it matter that a blade is occasionally throttled?
The short answer to this important question is: sometimes. A short period of throttling is not completely unexpected if the chassis firmware is upgraded. Maybe a transient throttle state here and there even but not much more. A threshold to consider is if the state is checked every 90 seconds, send an alert if the number of throttles detected reaches double digits in 24 hours. That should give generous allowance for firmware upgrades and maybe a transient or two. Anything more than that should be looked at with a benchmarking program like Linpack.

We are still trying to figure out the optimum setting to alert at, but right now anything around 100 throttles or higher in 24 hours (checked at 90 second intervals) will come in with a Linpack score of 82-87% of the Linpack score for an identical node. Not good. Note that the comparison is between actual benchmarks of known-good and throttled nodes, not theoretical numbers. We are still working on finding a better threshold and a more permanent fix.

Next Steps
The next step I would like to take is to move away from polling and to generate actual numbers for how long a CPU is throttled over a period of time by tracking the interrupts as they come in. I'm looking into doing this through the kernel somehow either through an interrupt handler or asking the kernel to forward events to userspace (possible?). I haven't even begun to look at this yet so if anyone has any experience in this area, please feel free to point me in the right direction.

Update: A few days after I wrote the draft version of this post I found "/sys/devices/system/cpu/cpu<cpu#>/thermal_throttle/count". This is only for thermal throttling. The information from bit 2 is not populated anywhere as far as I can tell. This was not available on our production 2.6.18 kernels but should be available in the newer kernel that RHEL 6 has.

Update 2 (11/18/2010): I discovered some more interesting MSR features after reading part of Intel's Software Developer's Manual Volumes 3A and 3B.  I'll write this up within a few days, but the short explanation is that my guess as to the meaning of bit 2 is correct (throttling is initiated by "another agent on the platform").  Start reading around page 625 of 3A.  The interesting part is that the odd-numbered bits are a flag that gets set if that kind of throttling occurred since the odd-numbered bits were last reset.  That will be a much more effective way of monitoring the throttled state instead of polling the current state.  See my new post, part 2.

Update 3:  There is now a part 2 article.

Update 4: Check out the related article Reading Intel Uncore Performance Counters from User Space

Update 5:  The root cause of our intermittent throttling problems?

Update 6: Resolved


Summary and code

Also check out the next article (Part 2) in this series.

To summarize, we encountered three different kinds of problems and noticed them all around the same time. They were:
  1. Thermal throttling
  2. Throttling (not necessarily thermal) that showed up 100% of the time
  3. Throttling (not necessarily thermal) that was intermittent

Symptoms
  • Batch jobs run slower on some nodes than others, even with identical hardware
  • Benchmarks (e.g. Linpack xhpl) run slower on some nodes
  • A node is "slower"
Solutions
  • Check thumb screws on CPU heatsink
  • Check the thermal paste distribution between the processor and heat sink
  • Upgrade the BIOS, chassis, and BMC firmware (Dell-specific: >=3.03 on M1000e CMC, >=3.02 on M610 iDRAC, and >=2.1.15 on M610 BIOS). A lot was fixed in the firmwares around that time
  • Reset the chassis management if it is a safe operation (racreset on a Dell M1000e is safe. HP or others?)
  • Reseat blades (virtual if possible, physical if necessary). This almost always fixes issues.
  • Unplug power and plug it back in on rackmounts
  • Replace CPUs and system board if all else fails. This has fixed everything else so far.

The most likely solutions are a reseat of the blade (if it's a blade) or a replacement of the CPUs and system board.

Commands to check for throttling

For Dell hardware:
omreport chassis processors

Anything with IPMI available:
ipmitool sdr entity 3 | grep Throttled
ipmitool -H remotehost.example.com -U root -I lan sdr entity 3 | grep Throttled
ipmitool -H remotehost.example.com -U root -I lanplus sdr entity 3 | grep Throttled

Preferred for Intel CPUs:
for a in /dev/cpu/[0-9]*; do rdmsr -c -p$(basename "$a") 0x19c; done
# Then bitwise and the values with 1 and 4 to detect thermal and generic throttling. The script below is a more thorough example

Check logs for thermally throttled messages:
grep -i throttled /var/log/messages

Check count of thermally throttled events (not sure when it was introduced but it doesn't work on 2.6.18):
cat /sys/devices/system/cpu/cpu<cpu#>/thermal_throttle/count

Miscellaneous Commands
Perform a virtual reseat of a blade on a Dell PowerEdge M1000e chassis:
serveraction -m server-<bladenumber> -f reseat

Reset the CMC on an M1000e chassis (safe):
racreset

Read the core temperatures on Nehalem or newer (includes Westmere):
for a in /dev/cpu/[0-9]*
do
  cpu=$(basename $a)
  printf "%2d: " $cpu
  echo $(($(rdmsr -f 23:16 -p$cpu -d 0x1a2) - $(rdmsr -f 22:16 -p$cpu -u 0x19c)))
done

Script to check the throttling state

I happen to like bash.  This should be extremely easy in any other programming language.  Just read the output of rdmsr and do the math.  This script also loads the necessary modules if they aren't already loaded.

=== detect_cpu_throttling ===

#!/bin/bash

# Public domain code by Ryan Cox. http://tech.ryancox.net

ipmimods="ipmi_si ipmi_devintf ipmi_msghandler"
retval=0

if [ "$(grep -m 1 vendor_id /proc/cpuinfo | awk '{print $3;}')" != "GenuineIntel" ]
then
 echo "Not an Intel CPU. $(grep -m 1 vendor_id /proc/cpuinfo | awk '{print $3;}') is unsupported.  Will try IPMI for basic support."
 needsleep=0
 for m in $ipmimods
 do
  if ! grep -q "^$m " /proc/modules
  then
   modprobe $m
   needsleep=1
  fi
 done
 if (($needsleep))
 then
  sleep 7 #some arbitrary magic number that allows time to load
 fi
 ipmi=$(ipmitool sdr entity 3 2>/dev/null)
 echo $ipmi | grep -q Throttled && (echo -e "ERROR: Throttling detected with ipmitool:\n$ipmi"; retval=1) || echo "CPU(s): OK"
 exit $retval
fi

#could also check /proc/modules but msr might not be a module
if [ ! -e /dev/cpu/0/msr ]
then
 modprobe msr
fi

# this is available when msr.ko is loaded (or builtin)
for cpu in /dev/cpu/[0-9]*
do
 cpunum=$(basename "$cpu")
 # if reading this msr doesn't work, return 0 so it never shows as throttled
 msr=$(rdmsr -c -p$cpunum 0x19c 2>/dev/null || echo -n 0)

 
 thermthrottled=$((($msr>>2)&1))
 genericthrottled=$(($msr&1))
 if (($thermthrottled|$genericthrottled))
 then
  retval=1
  if (($thermthrottled))
  then
   echo "ERROR: CPU$cpunum is thermally throttled: $msr"
  fi
  if (($genericthrottled))
  then
   echo "ERROR: CPU$cpunum is throttled but not necessarily thermally: $msr"
  fi
 else
  echo "CPU$cpunum: OK"
 fi
done

exit $retval

=== end of detect_cpu_throttling ===

Also check out the next article (Part 2) in this series.

4 comments:

  1. Thank you for this diagnosis, it helped us to pinpoint a similar problem on Supermicro hardware.

    ReplyDelete
  2. We got similar issue here and just follow your steps to catch the root cause, brilliant diagnosis and very helpful!

    ReplyDelete
  3. Very nice, thank you. Just installed your script on all our servers in our diagnosis directory.

    ReplyDelete
  4. Thank you so much for this post. I had the exact same issue after a firmware upgrade, and I couldn't figure it out for the life of me. This was it a simple reseat

    ReplyDelete

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.