Saturday, December 4, 2010

Flaws with Intel Westmere X5650?

Update (Dec. 29, 2010): CPUs were tested in an HP blade
Update2 (Jan 12, 2011)Quantifying Our Westmere Turbo Mode Problems
MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which... see the "Contradictions" section in this post) did contribute to a mis-diagnosis. The issue has been fixed by a simple iDRAC firmware update. More info.

For months now we have been dealing with throttling and slow performance of our Dell PowerEdge M610 blades with Intel Xeon X5650 Westmere processors.  For background on this issue, please see previous articles I wrote about it: Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2Reading Intel Uncore Performance Counters from User Space will also be useful reading.

First of all, I'm not posting this in an attempt to make a particular vendor look bad.  That is simply not my intention.  At the time of this posting our issues are unresolved and I am merely posting this so that people with similar issues can see if they are affected.  I am also soliciting feedback from anyone else with similar problems.

I'm fairly confident now that this is a problem with the processors themselves (and is possibly aggravated by a management controller somewhere that assumes the processor works as designed/documented?).  It's possible I'm misunderstanding some of the values from the processor that I'm reading, but there is a very strong correlation with actual results from the Linpack benchmarks we do.  Hopefully the information below is clear enough to demonstrate why I came to the conclusions I did.  This is one of the most difficult issues I have had to tackle in a long time so hopefully the explanations make sense.  Please post a comment below if you have any thoughts, insights, criticism, etc.

Most of my understanding is from Intel's Software Developer's Manuals, Volumes 3A and 3B at http://www.intel.com/products/processor/manuals/ (about ~1800 pages of fun reading).  3B was particularly useful when reading MSRs from the processors themselves.

I have used MSR IA32_THERM_STATUS extensively to check for throttling.  Bits 0 (state) and 1 (since last counter reset) are used to show thermal throttling due to the core's temperature sensors.  It also gets read by the Linux kernel.  The only time this value ever gets set on our blades is when there is an actual thermal event, such as a thumb screw not being screwed in properly.

IA32_THERM_STATUS bits 2 (state) and 3 (since last counter reset) "Indicates whether PROCHOT# or FORCEPR# is being asserted by another agent on the platform".  This occurs frequently on throttled nodes and is how we detect throttling and know to run benchmarks.

I have done a lot more reading of the manuals since then and discovered the ability to read performance counters from the processors.  It's fairly convoluted, but I have been reading several values and comparing them to other blades (listed in Vol. 3B, Appendix A.3, Table A-5, page 488).  Also see: Reading Intel Uncore Performance Counters from User Space.

Confusing Results from Performance Counters
Here are my findings with the performance counters I have checked so far.  I was using the values for Non-Architectural Performance Events In the Processor Uncore for Next Generation Intel Processor (Intel microarchitecture codename Westmere).  On throttled nodes, this is what I see:
UNC_DRAM_THERMAL_THROTTLED: nothing detected
UNC_THERMAL_THROTTLING_TEMP.CORE_0: nothing detected unless there is a real thermal event
UNC_THERMAL_THROTTLED_TEMP.CORE_0: increments rapidly
UNC_PROCHOT_ASSERTION: increments rapidly
UNC_THERMAL_THROTTLING_PROCHOT.CORE_0: increments rapidly

I also checked the values for the other cores (0-3).  You can't check all six cores on Westmere using performance counters, at least not in a way I can figure out (anyone know differently?).  The values for all the cores incremented in similar ways to core 0.

Contradictions
After looking at the meaning of those values on page 488, I really think this is a CPU problem and here is why:
UNC_THERMAL_THROTTLING_TEMP.CORE_0 (which does not increment) measures "Cycles that the PCU records that core is above the thermal throttling threshold temperature."
UNC_THERMAL_THROTTLED_TEMP.CORE_0 (which increments rapidly on throttled nodes) measures "Cycles that the PCU records that core is in the power throttled state due to core’s temperature being above the thermal throttling threshold."

Based on those definitions, the first one (THROTTLING) should increment any time that core 0 is above the thermal throttling threshold.  It never increments.  THROTTLED should increment when core 0 is above the thermal throttling threshold (same condition as THROTTLING) but with the extra condition that core 0 is power throttled due to it meeting the criteria for THROTTLING.  Thus if it is THROTTLED, it should be THROTTLING.  I can even imagine unlikely situations where it is THROTTLING (core temp above threshold) but not THROTTLED because it may be above threshold but isn't throttling as a result.  This is not what I am observing however.  It shows that it is power throttled because the core is over temperature, but at the same time the core is not over temperature.  This is either a bug in the processor or the documentation.

I have also measured the core temperatures to be consistently ~30-40 degrees below the 96 degree C threshold when under load.  The cores themselves report that they are never over threshold.  We even lowered the temperature in the room an extra five degrees and saw no measurable effect on benchmarks.  This is in the same room as our fully-functional Nehalem cluster.  The ambient temperature sensors on the blades are consistently cool in the entire room.  Throttling issues are distributed throughout the Westmere racks and show no patterns as far as distribution (common chassis, circuits, PDUs, busways, blade position in chassis, assignment of batch jobs, location in the rack, location in the room, specific to a server room, etc).

The other interesting performance counter to note is UNC_PROCHOT_ASSERTION.  It means: "Number of system assertions of PROCHOT indicating the entire processor has exceeded the thermal limit."  I don't know how it is even possible that the processor is over temperature when the cores themselves are 30-40 degrees C under the thermal threshold and indicate no problems themselves.  Unfortunately I don't know how to read the actual value of the processor temperature sensors, just the core sensors.

Debugging steps
We have tried a number of steps to rule out other possibilities.  These include, but are definitely not limited to:
  • Swapping CPUs within a blade
  • Replacing DIMMs (we did find a few odd scenarios where some CPU/DIMM combinations resulted in throttling going away for a while)
  • Swapping CPUs between blades
  • Swapping CPUs between the original M610 blades with Nehalem and the current generation M610 blades (the system boards aren't 100% the same)
  • Replacing CPUs (this seems to work if the CPUs are new)
  • Replacing system boards
  • Moving blades to different enclosures
  • Moving blades to a lightly-loaded staging enclosure
  • Using six power supplies in an enclosure instead of the usual four
  • Toggling the C-states setting in the BIOS
  • Toggling the C1E setting in the BIOS
  • Toggling the Turbo Mode setting in the BIOS (see "Stopgap Solution: Disable Turbo Mode" below)
  • Using the latest BIOS, iDRAC, CPLD, and CMC firmware
  • Installing a newer version of the OS and kernel 
  • Testing good and bad CPUs in an HP blade
  • Combinations of the items above

(Thanks to Landon Orr for doing much of the hard work on those.)

If you are diagnosing similar issues, be sure to heavily load the system for at least an hour or two while measuring.  We have seen some blades give good initial performance numbers but then tank over time.  That indicates to me that there really is a thermal issue, real or imagined.  I think it's the latter of the two.

Stopgap Solution: Disable Turbo Mode
One solution that kind of works is to disable Turbo Mode (and maybe C1E and C-states??).  It uniformly results in Linpack scores of ~114 gflops on our M610s (dual-socket 6-core X5650 with 24 GB 1066 MHz quad-ranked DDR3 RDIMMS).  Good nodes typically have scores of 119-120.  The "slow" ones are anywhere from ~98 and up.

Disabling Intel's turbomode consistently achieves ~114 gflops on both "good" and "bad" CPUs.  The theoretical max is ~127 with turbomode disabled and ~140 with it enabled.  The measured max with it enabled is ~120.  It also results in no reported throttling except during the initial boot of the system.  My understanding is that the initial throttling (readable at MSR 0x19c) is due to some as-designed blade power footprint analysis that occurs at boot.

Side note: We haven't put tons of effort into squeezing every last drop of performance out of this particular Linpack configuration but it is pretty good.  There is probably a little more we can do to bump up the numbers slightly but the interesting thing to note is the difference between results on identical hardware and the difference in results with different processor features enabled/disabled.  We're not just seeing marginal differences.

Conclusion:  Disabling Turbo Mode brings up the performance of slow blades.  It also decreases the performance of good blades.  Is Turbo Mode broken?  It, of course, relies on thermal conditions so if something is broken on the thermal monitoring-side of things it is quite possible that Turbo Mode would break.  I can verify using performance counters (UNC_TURBO_MODE.CORE_0, etc.) that it does get activated when it is enabled.  In my mind, all signs still point to faulty thermal monitoring somewhere on the processor.  Or, instead of positive increments for turbo mode, a bit is flipped somewhere which results in negative increments when turbo mode is activated.

UNC_PROCHOT_ASSERTION/PROCHOT# Notes
Recently we had an M1000e go haywire.  The CMC was unable to talk to the components in the enclosure and was completely uncontactable.  An interesting side effect was that the blades were throttled down so severely they got ~8 gflops in Linpack ("8" is not a typo).

I checked MSR 0x19c and saw that bits 2 and 3 were active, as expected.  Out of the performance counters listed above, only UNC_PROCHOT_ASSERTION incremented.  I read some more from Intel's Xeon Processor 5600 Series Datasheet Volume 1 and found that PROCHOT# is bi-directional, meaning "that it can either signal when the processor (any core) has reached its maximum operating temperature or be driven from an external source to activate the TCC [Thermal Control Circuit]" (see section 7.2.4, page 133).  I already assumed this to some degree but the other documentation I read wasn't quite as clear.

So in this case with mostly dead chassis management, something clearly external to the processor (the chassis or iDRAC) signaled PROCHOT# but only UNC_PROCHOT_ASSERTION incremented.  In situations where the CMC was operational but throttling occurred, the cores were reported to be throttled due to the core temperature being over-threshold (though, as reported before, the cores also reported that their temperature sensors were not over-threshold).  It's interesting to note the difference in results.


Update (Dec. 29, 2010): CPUs were tested in an HP blade
Update2 (Jan 12, 2011)Quantifying Our Westmere Turbo Mode Problems
MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which... see the "Contradictions" section in this post) did contribute to a mis-diagnosis.  I can't state what was causing it, but it was not a flaw with Westmere.  Work through your normal support channels if you think you have this issue.

 


Summary
  1. The room is consistently cool
  2. The ambient temperature sensors on the blades report they are cool
  3. We tried lowering the room temperature an extra five degrees with no measurable effect
  4. The cores report that they are cool (temperature reading is 30-40 deg C under threshold even under heavy load)
  5. The cores report _no_ throttling due to the core being over its temperature threshold
  6. The cores report that they _are_ throttled because "PROCHOT# or FORCEPR# is being asserted by another agent on the platform"
  7. The processor uncores report that DRAM is not thermally throttled
  8. The processor uncores report that the cores are not over temperature threshold according to the cores' temperature sensors
  9. The processor uncores report that the cores are not over temperature threshold according to the cores' temperature sensors BUT that the cores are power throttled because they are over temperature threshold according to the cores' temperature sensors. Contradicts #4,5,8
  10. The processor uncores report a large "number of system assertions of PROCHOT indicating the entire processor has exceeded the thermal limit." Contradicts #4?
  11. According to the above information, the processors are hot
  12. According to the above information, the cores are not hot
  13. According to the above information, the cores are throttled because the cores are hot. (#9 contradicts #4,5,8)
I'm probably missing a detail or two but I'm pretty sure the conclusion is: The processor thinks it is hot even though it isn't.

Are the processor package temperature sensors reporting bad values?  Is the package temperature threshold way too low?  Are the cores actually hot for some reason but the core temperature sensors report values that are way too low?  Bad microcode?  Are negative instead of positive increments accidentally used when turbo mode is activated?

I would love to hear from anyone else experiencing these issues or even people who think I'm off-base in my conclusions.

1 comment:

  1. Thank you for all your work on this issue. We had the exact same issue stumping us. The virtual reseat of the blades got them back from a crippled state to perfect health.

    ReplyDelete

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.