Wednesday, November 24, 2010

Reading Intel Uncore Performance Counters from User Space

This article came about as a result of trying to resolve a systemic CPU throttling issue on a lot of blades.  Please see Part 1 and Part 2 of that series of articles for more information.  The issue since has been resolved.

I finally figured out how to query performance counters from user space.  It can be done with rdmsr and was quite complicated to figure out.  I really wish I had gotten paper copies of Intel's Software Developer's Manuals Volumes 3A and 3B.  Cross-referencing dozens of pages in PDFs is not my idea of fun but it worked.  If you ever want to do something like this, I recommend dead-tree manuals.

My objective was to read values such as UNC_DRAM_THERMAL_THROTTLED, UNC_THERMAL_THROTTLING_TEMP.CORE_[0-3], UNC_THERMAL_THROTTLED_TEMP.CORE_[0-3], UNC_PROCHOT_ASSERTION, UNC_THERMAL_THROTTLING_PROCHOT.CORE_[0-3].  My main goal was to see if the DIMMs were throttled since I assumed the IA32_THERM_STATUS MSR (0x19c) showed CPU core throttling well enough (see Part 2).  The problem was how to do this.

Intel processors have an "uncore" that can be monitored.  The uncore is the processor package and its components, not including the cores, but it does include components shared between cores.  There is probably a better definition available but that is the gist of it.  The MSRs I referenced prior to this article are per core but the new values I want to read are per "uncore", meaning two per blade on our two socket blades.  There are per core counters as well but those aren't what I was interested in.

Tuesday, November 23, 2010

Diagnosing Throttled Processors - Part 2

This is Part 2 of an article about resolving CPU throttling on our Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures.  Please see Part 1 for more information.  I may also have found an answerUPDATE: The issue has since been resolved.

Read The Fine Manual
Where I last left off, I tried to find a suitable way to eliminate the need for polling.  I put scripts in place that check the IA32_THERM_STATUS MSR (0x19c) but was could only guess at the meaning of one of the values and only knew how to query the instantaneous value.  I then went on a hunt to find the official meaning of bit 2 as well as a better solution than simple polling of the instantaneous values.  I finally found what I was looking for in Intel's Software Developer's Manuals.  I read through parts of volumes 3A and 3B (a combined 1800 pages of light reading material).
The Meaning of MSR IA32_THERM_STATUS Bits
The manuals describe IA32_THERM_STATUS in detail in Volume 3A, Section Reading the Digital Sensor (p. 625).  The documentation shows that my guess about the meaning of bit 2 is correct in that it "Indicates whether PROCHOT# or FORCEPR# is being asserted by another agent on the platform."  Now I can tell that the processors we have were not being throttled by the on-die temperature sensors (indicated by bit 0).  To reiterate the differences: bit 0 means that the temperature sensor on the die (of the core, not the processor package) is above threshold and bit 2 means that "another agent" is asserting PROCHOT# or FORCEPR#.  FORCEPR means "Force Power Reduction".  Each core has its own temperature sensor and that is what is shown at 0x19c.

Thursday, November 11, 2010

Diagnosing Throttled or "Slow" Systems (Processors to be Precise)


Also see part 2 for more information. Our original issue has been resolved.

First I will give the lengthy debugging process as background before giving the commands that others can use to diagnose and fix problems. I include this so that other systems administrators can see some of the tools available to them and partially for my own documentation. This documents over a month of work, so hopefully others will benefit.

We began to notice a few blades that were running slower than others for some reason but had no idea why. These are Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures.  Users would complain about "slow nodes" that some of their batch jobs ran on. Some users could tell us on a per-node basis which ones were slow. Unfortunately, some users with ~1000 processor MPI jobs could not pinpoint it but only knew their jobs ran much slower than usual.

Debugging Steps

All we knew was there were some blades out there that ran slower than the rest but had no clue why or how to identify them. The Linpack benchmark is standard in the HPC industry for benchmarking and burn-in so we naturally used it to identify slow nodes. The problem was that benchmarking was a reactive measure and didn't identify the root cause. You can't very well run Linpack on several hundred nodes every few days just see if there is a problem.