Tuesday, January 11, 2011

Quantifying Our Westmere Turbo Mode Problems

MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which) did contribute to a mis-diagnosis.  I can't state what was causing it, but it was not a flaw with Westmere or turbo mode. The issue was fixed by a simple iDRAC firmware update. More info.

This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors.  Please see the other articles I have written, especially Flaws with Intel Westmere X5650?  The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.

During a maintenance window we were able to benchmark all of our systems.  We benchmarked our nodes multiple times with Intel's Turbo mode both enabled and disabled on the processors.  The results showed that 18% of our blades had worse performance with Turbo mode enabled than with it disabled.  That's 93 out of 515 blades included in the benchmark. These tests were done on Dell PowerEdge M610 blades with dual Intel X5650 Westmere processors (2.66 GHz) with 1066 MHz DDR3 DIMMs.  The graph below shows the result of the benchmark run with turbo mode enabled.

xhpl Linpack benchmark results in GFLOPS (rounded to the nearest integer) with turbo mode enabled.  Y-axis is number of blades. Each system is a Dell PowerEdge M610 with dual 2.66 GHz Intel Xeon X5650 processors with 24 GB 1066 MHz DDR3 RAM



The benchmark results with turbo mode disabled produced numbers between 113.2 and 114.3 gflops and has no outliers.  We could possibly tune our Linpack config a little more but it is pretty good at this point.  The interesting part is comparing the numbers.

Enabling turbo mode introduced lots of variability, including awful performance on many blades.  According to Intel's documentation, turbo mode is meant to overclock cores when thermal conditions allow for it.  For some reason, turbo mode harms performance at least 18% of the time on our systems and introduces little gain for many of the others.


It is possible this is a Dell issue but, due to reasons I have stated in previous articles, I think it is an Intel problem that may be exacerbated by Dell firmware.  An HP blade also showed some negative but different effects.

It would be reasonable to expect a little variation with turbo mode enabled, but the results should never be worse than having turbo mode disabled.  By the very definition of turbo mode in all the Intel manuals I have read, turbo mode is supposed to increase the core frequencies.  The documentation I have read (and I have read a lot) never mentions the possibility of performance degradation with turbo mode enabled.

The variation between 113 and 120 that we get is also interesting.  Why aren't those numbers more consistent?  We have more than adequate cooling.  The server room has stable temperatures throughout the room.  We see no correlation of score with position in the room or even with the ambient temperature sensor on a blade.  The performance numbers follow the CPU.  Swapping the blade or CPUs to a different location results in the same benchmark numbers.


The sum of the benchmark numbers below 113 is GFLOPS is 552 gflops.  If our systems are all supposed to be able to reach 119 GFLOPS, then this issue is causing us to lose 2.1 TFLOPS. 

Summary
  • 18% of the blades we tested achieved worse performance with turbo mode than without
  • That's 93 of 515 blades with seemingly broken turbo mode
  • We are losing at least 552 GFLOPS across those 93 blades
  • If we should actually be able to get 119 GFLOPS per blade, this issue is causing us to lose 2.1 TFLOPS
  • Turbo mode produces very different results per processor that follows the processor and doesn't seem to be affected much by our server room temperature
  • UPDATE: This issue has been resolved.

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.