Friday, March 4, 2011

Performance Problems Resolved

The performance problems that we were having have been resolved.  Without making you read this whole thing to get to the conclusion:
  • The problem was solved with an iDRAC firmware update provided by Dell (contact Dell to get the right version)
  • The odds that you, the reader, are affected are extremely low (unless you have a PowerEdge M610 with dual Intel Xeon 5650 Westmere processors, six 4GB quad-ranked 1066 MHz DDR3 RDIMMs, no mezzanine cards, and only one 10K SAS hard disk. Even then, there may be another mitigating factor that ensures you aren't affected.)
  • If you think you are affected, Dell support should be able to quickly tell you if that is the case or not
  • We consider the issue to be resolved

The initial blaming of Westmere

We had initially thought that Intel processors were to blame due to a Westmere processor or Westmere documentation flaw (see "Contradictions" in Flaws with Intel Westmere X5650?)  The processor kept claiming that it was throttled due to being over temperature.  This was definitely not true.  That seemed to indicate a processor flaw.  As far as I could determine, the available conclusions were that (a) the processor was flawed in that it throttled itself due to thinking it was over temperature when it was not, (b) the processor was throttled due to other reasons but reported it as being due to temperature, or (c) the documentation was completely and totally incorrect for the values I read.  We have ruled out (a). (b) is true if (c) is false.  (b) may still be true even if (c) is true. (c) is also possible which means (b) might not be true.  In easier terms, there is a minor processor bug with performance counters and/or in its documentation.  That bug only seems to affect statistics and not function (which I wish I had known about months ago...).

So there was a flaw with Westmere (or the documentation for Westmere).  It just turned out to be a minor bug that made us think the other problem was also a Westmere problem.  As of now, I haven't gotten the bug confirmed but I have seen nothing that even hints that my conclusion is wrong.

Fixed with a firmware upgrade

Without going into too much detail, there was a problem with the very specific hardware configuration that we have and the interaction with the iDRAC.  The issue involved the iDRAC firmware in use previously.  Dell has since fixed the bug and provided updated firmware.  We hit a very small corner case which they now know to check for.

The problem is now completely fixed as far as we can tell.  If you think you are experiencing problems due to this same issue, contact Dell support.  They should be able to quickly determine if you are affected and get you the right iDRAC build.  It is extremely unlikely that you are affected by this.

4 comments:

  1. can you please provide the firmware versions for the IDRAC we had the issue before 8 weeks and we did update then to 3.10 for the CMC and 3.02 of the iDRAC.

    can you please update me walid"AT"melinux"DOT"com

    ReplyDelete
  2. I have a similar issue with m1000e's and both m600's and m610s. I found that many of our nodes get throttled from the idrac. Are you positive that you aren't still getting throttled and don't know it?

    I found I can only see the throttling by doing dumplogs or benchmarking.

    ReplyDelete
    Replies
    1. I actually have seen a few more instances of it recently but I'm pretty sure it's not from the same issue. The original issue was due to a particular bug in the iDRAC firmware that is resolved in custom builds (or "ICS builds" as Dell calls them). It sounds like that particular issue will require talking to Dell to get custom iDRAC builds. One thing to note is that I have never experienced throttling with an M600.

      The current problems we have had with throttling are due to a different bug in either the iDRAC or CMC. I have found that either a virtual reseat of the blade, a racreset on the CMC (_should_ be safe for other blades), or a combination of the two will resolve the throttling.

      If either or both of those steps solve the throttling, it is almost certainly a different problem than the one that the custom firmware fixed for us. I'd be interested in knowing if those two steps fix the problem for you or if this is something else.

      Delete
  3. so much suffering and such a simple solution

    ReplyDelete

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.