Tuesday, February 15, 2011

Extrapolating Benchmark Scores Using MSRs on Intel CPUs

It turns out that you can get a very accurate estimate of Linpack benchmark scores by simply reading MSRs and comparing against a baseline score.  I have only tried it on Westmere and Nehalem so far but achieved a minimum of 98.9% accuracy using this method.  It's not good enough to submit to TOP500, but it can be useful when diagnosing problems on your systems.

I had hoped that this could be run in the background (properly niced) in a way that shouldn't impact any other users or processes on the system you are benchmarking.  However, the problem with this method is that it only seems to be valid for whatever benchmark you are using.  In this case I was using Linpack.  I had hoped to try spiking all CPUs with something else for a few minutes and have it return the same numbers.  Alas, it doesn't work as expected.  My guess is that the memory also needs to be hammered at the same time, but that is something that should probably not be done while legitimate processing is happening.

Preliminary benchmarks do need to be run on a few systems to establish a baseline score.  The best approach to establishing that baseline is to disable Turbo mode and run Linpack on several systems with identical hardware.  Identical means no variation in CPU model, core count, DIMM type, DIMM speed, DIMM size, BIOS settings, etc.  The number of DIMM ranks could conceivably have an effect as well.

The GFLOPS number that is obtained should be very similar between all tested systems.  Take the mean or median of the scores (both should be about the same) and use that as the baseline score for that particular hardware configuration.  The next step is to query information from the Intel CPU cores in the system.

Intel cores contain two counters that we are interested in reading.  They provide a ratio that describes the performance of a logical processor.  The first value is IA32_MPERF (0xe7) which "increments in proportion to a fixed frequency, which is configured when the processor is booted."  Think of this number as representing 2.8 GHz on a 2.8 GHz processor.  The second MSR is IA32_APERF (0xe8) which "increments in proportion to actual performance, while accounting for hardware coordination of P-state and TM1/TM2; or software initiated throttling."

The performance ratio is found by calculating IA32_APERF / IA32_MPERF.  Intel's documentation states "Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should not attach meaning to the content of the individual of IA32_APERF or IA32_MPERF MSRs."

It is a good idea to set both MSRs to "0" and wait a long time to read them again.  The CPUs (and most likely memory) should also be fully loaded to achieve accurate results.

By multiplying the baseline score and the performance ratio, you can extrapolate the actual score you would be able to benchmark on that system.  In thousands of tests on several hundred servers, the least accurate guess was only 1.1% off from the actual score.

For more information on these MSRs, see volumes 3A and 3B from Intel's Software Developer's Manuals.  Another post I wrote may also be instructive: Reading Intel Uncore Performance Counters from User Space.

Actual results in one configuration

Configuration:  Dell PowerEdge M610 with dual 2.66 GHz Intel Xeon X5650 hex-core processors and 6 quad-ranked 4GB 1066MHz DDR3 DIMMs (24GB).
Baseline: 113.2 GFLOPs
Maximum observed: 120.8 GFLOPS (1.07346 performance ratio)

Conclusion

Using the performance counters can give you a very accurate idea of how your system is performing compared to a baseline.  Even without a baseline, you can still know that turbo mode is helping you out if the performance ratio is greater than 1.0.  It may be worth looking at your typical application to see what ratio it normally gets when under 100% load and compare it to other systems.

This number doesn't seem all that practical to use, unfortunately.  I'm sure there's a use case somewhere, but I can't think of a huge one.  The one I have been considering is periodically scheduling a very brief test (maybe two minutes while empty of batch jobs) on each compute node to start up Linpack and look at the ratio.  That would let us know if the system is performing well or has degraded performance.  The ratio will also show if the CPU is scaled back to conserve power during periods of low load or to lower the temperature.

The performance ratio will tell you if turbo mode is in use or not.  That will be indicated by a ratio above 1.0.

One must also keep in mind that not all parts were manufactured equally.  Some variation should be expected between processors.


Code to calculate the performance ratio (that might look a little ugly...)

This example is in bash (as are most of my quick-and-dirty scripts).  If you even partially understood what this post is about, you have the technical skills to rewrite this cleanly in your language of choice.  This code uses wrmsr and rdmsr from msr-tools.  It calculates the performance ratio over a 60 second period.

#!/bin/bash
for a in /dev/cpu/[0-9]*
do
    cpu=$(basename $a)
    # zero out the counters with wrmsr
    wrmsr -p$cpu 0xe8 0
    wrmsr -p$cpu 0xe7 0
done

#wait a while
sleep 60

#check the counters and return the lowest performance number
for a in /dev/cpu/[0-9]*
do
    cpu=$(basename $a)
    #ugly, but it works. If you understand what this post talks about, you know how to code up something nicer
    echo "scale=5; $(rdmsr -p$cpu -d 0xe8) / $(rdmsr -p$cpu -d 0xe7)" | bc
done | sort -n | head -1

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.