Wednesday, December 29, 2010

Useful commands for Dell servers

Here are some commands that may come in handy for Dell systems.  I don't use OpenManage for anything because we had trouble with it a few years ago.  After figuring out how to do the same things with lower-level commands, we never bothered with it again.  The issues were due to installation/configuration problems and the occasional instance of a daemon chewing up CPU.  Looking back I would now guess it was due to kipmi0 going out of control (can sometimes be fixed by a reset of the iDRAC/BMC or a virtual reseat).

Testing Throttled Intel Westmere X5650 CPUs in an HP blade

This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors. (It has since been resolved).  Please see the other articles I have written, especially Flaws with Intel Westmere X5650?  The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.
Fortunately we were able to borrow an HP blade to test with (ProLiant BL460c G6).  We swapped in our Westmere CPUs and our RAM.  We had the blade for a very limited time, so this testing was about as unscientific as you can get.  We did find some interesting results but I definitely do not consider them conclusive.

Saturday, December 4, 2010

Flaws with Intel Westmere X5650?

Update (Dec. 29, 2010): CPUs were tested in an HP blade
Update2 (Jan 12, 2011)Quantifying Our Westmere Turbo Mode Problems
MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which... see the "Contradictions" section in this post) did contribute to a mis-diagnosis. The issue has been fixed by a simple iDRAC firmware update. More info.

For months now we have been dealing with throttling and slow performance of our Dell PowerEdge M610 blades with Intel Xeon X5650 Westmere processors.  For background on this issue, please see previous articles I wrote about it: Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2Reading Intel Uncore Performance Counters from User Space will also be useful reading.

First of all, I'm not posting this in an attempt to make a particular vendor look bad.  That is simply not my intention.  At the time of this posting our issues are unresolved and I am merely posting this so that people with similar issues can see if they are affected.  I am also soliciting feedback from anyone else with similar problems.

I'm fairly confident now that this is a problem with the processors themselves (and is possibly aggravated by a management controller somewhere that assumes the processor works as designed/documented?).  It's possible I'm misunderstanding some of the values from the processor that I'm reading, but there is a very strong correlation with actual results from the Linpack benchmarks we do.  Hopefully the information below is clear enough to demonstrate why I came to the conclusions I did.  This is one of the most difficult issues I have had to tackle in a long time so hopefully the explanations make sense.  Please post a comment below if you have any thoughts, insights, criticism, etc.