Wednesday, November 24, 2010

Reading Intel Uncore Performance Counters from User Space

This article came about as a result of trying to resolve a systemic CPU throttling issue on a lot of blades.  Please see Part 1 and Part 2 of that series of articles for more information.  The issue since has been resolved.

I finally figured out how to query performance counters from user space.  It can be done with rdmsr and was quite complicated to figure out.  I really wish I had gotten paper copies of Intel's Software Developer's Manuals Volumes 3A and 3B.  Cross-referencing dozens of pages in PDFs is not my idea of fun but it worked.  If you ever want to do something like this, I recommend dead-tree manuals.

My objective was to read values such as UNC_DRAM_THERMAL_THROTTLED, UNC_THERMAL_THROTTLING_TEMP.CORE_[0-3], UNC_THERMAL_THROTTLED_TEMP.CORE_[0-3], UNC_PROCHOT_ASSERTION, UNC_THERMAL_THROTTLING_PROCHOT.CORE_[0-3].  My main goal was to see if the DIMMs were throttled since I assumed the IA32_THERM_STATUS MSR (0x19c) showed CPU core throttling well enough (see Part 2).  The problem was how to do this.

Intel processors have an "uncore" that can be monitored.  The uncore is the processor package and its components, not including the cores, but it does include components shared between cores.  There is probably a better definition available but that is the gist of it.  The MSRs I referenced prior to this article are per core but the new values I want to read are per "uncore", meaning two per blade on our two socket blades.  There are per core counters as well but those aren't what I was interested in.

I read all over the place that performance counters can only be read from ring 0.  After reading on many forums and mailing lists that performance counters _must_ be read from ring 0, I started writing a kernel module to call rdpmc.  Fortunately I figured out that the performance counters can be read like standard MSRs.  Reading MSRs with rdmsr (which reads from /dev/cpu/$cpu/msr in Linux) allows all of this to happen without the need to write kernel code.  It also turns out my understanding of performance counters was skewed anyway.  Lots of manual reading later I finally had a clue.

One thing that threw me off a lot was figuring out which locations to read from.  Read Volume 3B, Chapter 30 very thoroughly if you are just starting off.  Be sure to distinguish between architectural, non-architectural, core, uncore, etc.  For what I want to accomplish, the UNCORE values were needed.

Steps to Read the Counters
Here is what needs to happen for each socket when reading uncore values:
  • Determine how many general-purpose performance counters are available on the architecture (it's possible but I haven't bothered with it yet programmatically since I know the current number is eight).
  • Pick an unused performance counter which can currently be from 0-7 (hereafter referred to as $PMCNUM)
  • Configure the counter by writing a value to MSR_UNCORE_PERFEVTSEL$PMCNUM (e.g. MSR_UNCORE_PERFEVTSEL2).  The value to write is discussed below under section MSR_UNCORE_PERFEVTSEL Details.
  • Enable counter $PMCNUM by writing a "1" to MSR_UNCORE_PERF_GLOBAL_CTRL (0x391) in the relevant location: 1<<$PMCNUM. Be sure to OR with the current value so as not to disable current performance counters.
  • Read from the performance counter at 0x3b0 + $PMCNUM.  Addresses increment by 1 starting at 0x3b0 (MSR_UNCORE_PMC0).
  • Remember to do each of the above for each socket (including configuring, enabling, and reading).
  • Enjoy

For more information about addresses, see Table B-6 in Volume 3B, Appendix B.4.1 (p. 750).

The address for each of these performance counter configuration registers starts at 0x3c0 (MSR_UNCORE_PERFEVTSEL0) and ends at 0x3c7 (MSR_UNCORE_PERFEVTSEL7) in Nehalem and newer.  Each of these configuration registers has a corresponding counter from 0x3b0 (MSR_UNCORE_PMC0) to 0x3b7 (MSR_UNCORE_PMC7).  Each also corresponds to a bit in MSR_UNCORE_PERF_GLOBAL_CTRL (0x391) that enables the counter (1<<$PMCNUM).  It also corresponds to a few other registers as well but I am ignoring those for now.

I have found that the Intel manual describes this best in Volume 3B, Section "Uncore Performance Event Configuration Facility".  I have chosen to enable bits 17 (OCC_CTR_RST: reset the counter when writing this value), 18 (Edge Detect), and 22 (enable: very useful).  If I ever write kernel code to do this, I will have it set bit 20 to send an interrupt when the counter overflows.  The result of these values when OR'd together is 0x420000.

The other necessary values are the address of the counter and the "umask" value.  This is available in Volume 3B, Appendix A (A.3, Table A-5 for Westmere).  The "Event Num" is the first bits 0-7 and the "Umask value" is bits 8-15.

The end result of the value is: (0x420000 | (UmaskValue<<8) | EventNum).  For example, UNC_THERMAL_THROTTLING_PROCHOT.CORE_0 would result in 0x420183.  This value can then be written (assuming $PMCNUM is 2) using: wrmsr -p$cpunum 0x3c2 0x420183.  If you read this register after writing, be aware that bit 17 (OCC_CTR_RST) always reads "0" since the reset occurred at write-time.  Thus you will see 0x400183 when reading the value.

Make sure msr.ko is loaded or compiled into the kernel.  Some values in the code below need to be changed if you're using something other than the "uncore" values.  Look at the heading of the table you are using in Appendix A to get a clue as to what the values should be.  The values I have used are under Non-Architectural Performance Events In the Processor Uncore for Next Generation Intel Processor (Intel microarchitecture codename Westmere).  Yours might be different depending on what you want to read.



PMCNUM=0 # an available performance counter. one of these days I will automate the checking of availability
EVENTNUM=0x82 # UNC_PROCHOT_ASSERTION event number (see Intel SDM Vol. 3B, Appendix A)

# quick-and-dirty. 0 and 1 are on different sockets in our configuration
for cpu in {0..1}
    # configure the counter at MSR_UNCORE_PERFEVTSEL$PMCNUM

    # enable the counter at MSR_UNCORE_PMC$PMCNUM
    wrmsr -p$cpu $MSR_UNCORE_PERF_GLOBAL_CTRL $(($(rdmsr -c -p$cpu $MSR_UNCORE_PERF_GLOBAL_CTRL)|(1<<$PMCNUM)))

#watch the counter increment or add your own code to do something useful
watch -n 5 "for cpu in {0..1}; do rdmsr -p\$cpu -c $(($MSR_UNCORE_PMC0+$PMCNUM)); done"

Note: If you see any typos or mistakes in this article, please let me know.  There's a whole lot of variables and hex digits in this article so it is quite possible I made a mistake or two.


  1. Thanks for the post. I am just learning about using rdmsr and wrmsr to adjust the CPU clock rate for under-clocking (for power management). It is great to see real examples.

  2. Charles,
    Thanks for the comment. I've been meaning to add an additional example or two at some point, though it probably wouldn't be too much more useful than what's already posted. Good luck with the under-clocking

  3. just found this. awesome. I've been struggling with similar issues for ~1/2 yr. I found the following to be very nice:
    but I remain interested in the low level details you go into in this post. Thank you!!


Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.