Wednesday, December 29, 2010

Useful commands for Dell servers

Here are some commands that may come in handy for Dell systems.  I don't use OpenManage for anything because we had trouble with it a few years ago.  After figuring out how to do the same things with lower-level commands, we never bothered with it again.  The issues were due to installation/configuration problems and the occasional instance of a daemon chewing up CPU.  Looking back I would now guess it was due to kipmi0 going out of control (can sometimes be fixed by a reset of the iDRAC/BMC or a virtual reseat).

Testing Throttled Intel Westmere X5650 CPUs in an HP blade

This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors. (It has since been resolved).  Please see the other articles I have written, especially Flaws with Intel Westmere X5650?  The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.
Fortunately we were able to borrow an HP blade to test with (ProLiant BL460c G6).  We swapped in our Westmere CPUs and our RAM.  We had the blade for a very limited time, so this testing was about as unscientific as you can get.  We did find some interesting results but I definitely do not consider them conclusive.

Saturday, December 4, 2010

Flaws with Intel Westmere X5650?

Update (Dec. 29, 2010): CPUs were tested in an HP blade
Update2 (Jan 12, 2011)Quantifying Our Westmere Turbo Mode Problems
MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which... see the "Contradictions" section in this post) did contribute to a mis-diagnosis. The issue has been fixed by a simple iDRAC firmware update. More info.

For months now we have been dealing with throttling and slow performance of our Dell PowerEdge M610 blades with Intel Xeon X5650 Westmere processors.  For background on this issue, please see previous articles I wrote about it: Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2Reading Intel Uncore Performance Counters from User Space will also be useful reading.

First of all, I'm not posting this in an attempt to make a particular vendor look bad.  That is simply not my intention.  At the time of this posting our issues are unresolved and I am merely posting this so that people with similar issues can see if they are affected.  I am also soliciting feedback from anyone else with similar problems.

I'm fairly confident now that this is a problem with the processors themselves (and is possibly aggravated by a management controller somewhere that assumes the processor works as designed/documented?).  It's possible I'm misunderstanding some of the values from the processor that I'm reading, but there is a very strong correlation with actual results from the Linpack benchmarks we do.  Hopefully the information below is clear enough to demonstrate why I came to the conclusions I did.  This is one of the most difficult issues I have had to tackle in a long time so hopefully the explanations make sense.  Please post a comment below if you have any thoughts, insights, criticism, etc.

Wednesday, November 24, 2010

Reading Intel Uncore Performance Counters from User Space

This article came about as a result of trying to resolve a systemic CPU throttling issue on a lot of blades.  Please see Part 1 and Part 2 of that series of articles for more information.  The issue since has been resolved.

I finally figured out how to query performance counters from user space.  It can be done with rdmsr and was quite complicated to figure out.  I really wish I had gotten paper copies of Intel's Software Developer's Manuals Volumes 3A and 3B.  Cross-referencing dozens of pages in PDFs is not my idea of fun but it worked.  If you ever want to do something like this, I recommend dead-tree manuals.

Objective
My objective was to read values such as UNC_DRAM_THERMAL_THROTTLED, UNC_THERMAL_THROTTLING_TEMP.CORE_[0-3], UNC_THERMAL_THROTTLED_TEMP.CORE_[0-3], UNC_PROCHOT_ASSERTION, UNC_THERMAL_THROTTLING_PROCHOT.CORE_[0-3].  My main goal was to see if the DIMMs were throttled since I assumed the IA32_THERM_STATUS MSR (0x19c) showed CPU core throttling well enough (see Part 2).  The problem was how to do this.

Uncore
Intel processors have an "uncore" that can be monitored.  The uncore is the processor package and its components, not including the cores, but it does include components shared between cores.  There is probably a better definition available but that is the gist of it.  The MSRs I referenced prior to this article are per core but the new values I want to read are per "uncore", meaning two per blade on our two socket blades.  There are per core counters as well but those aren't what I was interested in.


Tuesday, November 23, 2010

Diagnosing Throttled Processors - Part 2

This is Part 2 of an article about resolving CPU throttling on our Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures.  Please see Part 1 for more information.  I may also have found an answerUPDATE: The issue has since been resolved.

Read The Fine Manual
Where I last left off, I tried to find a suitable way to eliminate the need for polling.  I put scripts in place that check the IA32_THERM_STATUS MSR (0x19c) but was could only guess at the meaning of one of the values and only knew how to query the instantaneous value.  I then went on a hunt to find the official meaning of bit 2 as well as a better solution than simple polling of the instantaneous values.  I finally found what I was looking for in Intel's Software Developer's Manuals.  I read through parts of volumes 3A and 3B (a combined 1800 pages of light reading material).
The Meaning of MSR IA32_THERM_STATUS Bits
The manuals describe IA32_THERM_STATUS in detail in Volume 3A, Section 14.5.5.2 Reading the Digital Sensor (p. 625).  The documentation shows that my guess about the meaning of bit 2 is correct in that it "Indicates whether PROCHOT# or FORCEPR# is being asserted by another agent on the platform."  Now I can tell that the processors we have were not being throttled by the on-die temperature sensors (indicated by bit 0).  To reiterate the differences: bit 0 means that the temperature sensor on the die (of the core, not the processor package) is above threshold and bit 2 means that "another agent" is asserting PROCHOT# or FORCEPR#.  FORCEPR means "Force Power Reduction".  Each core has its own temperature sensor and that is what is shown at 0x19c.

Thursday, November 11, 2010

Diagnosing Throttled or "Slow" Systems (Processors to be Precise)

Background

Also see part 2 for more information. Our original issue has been resolved.

First I will give the lengthy debugging process as background before giving the commands that others can use to diagnose and fix problems. I include this so that other systems administrators can see some of the tools available to them and partially for my own documentation. This documents over a month of work, so hopefully others will benefit.

We began to notice a few blades that were running slower than others for some reason but had no idea why. These are Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures.  Users would complain about "slow nodes" that some of their batch jobs ran on. Some users could tell us on a per-node basis which ones were slow. Unfortunately, some users with ~1000 processor MPI jobs could not pinpoint it but only knew their jobs ran much slower than usual.

Debugging Steps

All we knew was there were some blades out there that ran slower than the rest but had no clue why or how to identify them. The Linpack benchmark is standard in the HPC industry for benchmarking and burn-in so we naturally used it to identify slow nodes. The problem was that benchmarking was a reactive measure and didn't identify the root cause. You can't very well run Linpack on several hundred nodes every few days just see if there is a problem.