This is Part 2 of an article about resolving CPU throttling on our Dell PowerEdge M610 blades with Intel Westmere processors (Xeon X5650 2.66 GHz) in M1000e enclosures. Please see Part 1 for more information. I may also have found an answer. UPDATE: The issue has since been resolved.
Read The Fine Manual
Where I last left off, I tried to find a suitable way to eliminate the need for polling. I put scripts in place that check the IA32_THERM_STATUS MSR (0x19c) but was could only guess at the meaning of one of the values and only knew how to query the instantaneous value. I then went on a hunt to find the official meaning of bit 2 as well as a better solution than simple polling of the instantaneous values. I finally found what I was looking for in Intel's Software Developer's Manuals. I read through parts of volumes 3A and 3B (a combined 1800 pages of light reading material).
The Meaning of MSR IA32_THERM_STATUS Bits
The manuals describe IA32_THERM_STATUS in detail in Volume 3A, Section 126.96.36.199 Reading the Digital Sensor (p. 625). The documentation shows that my guess about the meaning of bit 2 is correct in that it "Indicates whether PROCHOT# or FORCEPR# is being asserted by another agent on the platform." Now I can tell that the processors we have were not being throttled by the on-die temperature sensors (indicated by bit 0). To reiterate the differences: bit 0 means that the temperature sensor on the die (of the core, not the processor package) is above threshold and bit 2 means that "another agent" is asserting PROCHOT# or FORCEPR#. FORCEPR means "Force Power Reduction". Each core has its own temperature sensor and that is what is shown at 0x19c.
Something external to the processor core is requesting that it be throttled for whatever reason. I have several theories but have not been able to get anything confirmed yet. Our hardware technician has worked hard at swapping CPUs, DIMMs, and system boards to diagnose the issue and eliminate some of the theories. So far the most we can say from our thorough investigation is that a certain combination of Intel Westmere processor and DIMMs trigger the throttling. It doesn't seem to be the system board or the BMC but that hasn't been ruled out yet either.
The likely causes of the issues we have seen are bad Westmere processors or bad DIMMs, a combination of the two, or bad BMC code (less likely) that goes haywire when interacting with the other parts. One other theory I have is that there is a bad batch of temperature sensors for one of the components that triggers the BMC to throttle the processors. The problem with this theory is that these values aren't reported anywhere that I can find so I have no way to test it. I should also note that the CPU core temperature sensors are not above threshold.
A Silent Killer
One thing I always reiterate when speaking to support techs is that other people may be affected by this throttling and have no idea of its existence. It is extremely hard to find the throttling unless you benchmark your servers and compare them to others with the exact same hardware. The other option is to use code similar to what I published in the original article. I have new and improved code now but don't have it in a form that is ready to publish.
HPC centers are much more likely to notice throttling since running Linpack benchmarks is a normal part of life. One question I have is whether or not these problems can suddenly appear and then get progressively worse over time. If that is the case, regular benchmarking or polling of MSRs is required in order to detect it. A single burn-in period probably won't catch it. We don't have enough data over enough time to demonstrate or even guess if this happens. My guess is that many non-HPC centers won't have an easy way to detect throttling issues and will just end up with under-performing hardware. Even HPC centers might not notice this if they see no issues during the burn-in period and then assume everything is okay.
In order for most of the throttling to occur, the processors must be under heavy load. Be sure to load up the system with CPU (and RAM?) intensive applications before assuming throttling is non-existent. Well-tuned Linpack runs are probably the best.
The fact that throttling occurs more frequently when the processors are under load leads me to think that some component somewhere has a defective temperature sensor. The increased load on CPU and RAM would cause an increase in temperature. The only thing that makes me think this is wrong is that this is a systemic problem. How could it even be possible to get so many bad parts? It was all ordered and delivered at the same time so I suppose it's possible.
Enhanced Throttling Detection Abilities (with a history!)
Based on my new understanding of IA32_THERM_STATUS MSR, I learned that most of the odd-numbered bits are "history" bits. When I say history, I mean a single-bit history. The value is set to "1" if throttling has occurred since the last time the history bit was reset to "0". Using wrmsr you can set all the history bits for 0x19c to zeros and then read the values again a short time later with rdmsr to see if any are "1" again.
I have a script now that clears the history bits using wrmsr and then reads them again a little while later using rdmsr. I am still fine-tuning the thresholds and I will post it here when it is ready in a few days.
When we upgrade to a newer kernel, this information might be available under /sys/devices/system/cpu/cpu<cpu#>/thermal_throttle/. I read through arch/x86/kernel/cpu/mcheck/therm_throt.c in a newer kernel release and it appears to ignore bit 2. I'll email the maintainer of the code after Thanksgiving to see if I am reading that right. It will at least correctly keep a count of the thermal throttling and another throttling condition.
I wrote a Linux kernel module this week to try reading Intel performance counters with rdpmc. The module part wasn't that bad and I quickly managed to get it to export values under /sys/devices/system/cpu/cpu<cpu#>/.
The problem I am encountering now is how to read the performance counters at all. I have tried looking at and copying all the kernel code I can find on how to read performance counters. I run set_in_cr4(X86_CR4_PCE) before reading the values. I just want to query 0x67 (UNC_DRAM_THERMAL_THROTTLED) and the PMCs in the range 0x80 - 0x84 (other throttling values) to see if some "uncore" or DRAM temperature sensor is returning bad data. The values are documented in Intel's SDM Vol 3B, Table A-5, p. 488.
If you know how to query performance counters (specifically 0x67 and 0x80-0x84), please let me know. I used to get segfaults and GPFs until I wrote code that is similar to rdmsr_safe(). I can never get a useful value with it and the code just crashes when I call it in the non-"safe" method like it is elsewhere in the kernel.
I don't care if it's a kernel module or not. I just want something that works and I am open to other suggestions.
This was Part 2 about CPU throttling. Check out Part 1 for more information. I also posted a newer article in which I zero in on the Intel Westmere processor as a possible cause. The issue has since been resolved.