We experienced several problems when we upgraded to Red Hat Enterprise Linux 6.2 from CentOS 5.4. A user of ours started reporting slowness on some of his larger HPC jobs. We looked at tons of things then started noticing that one or more nodes would start swapping for no reason. His job would only use about 60-70% of the memory on each node but some nodes would inexplicably swap (diagnosed with vmstat). I talked to people at other universities and HPC sites and verified that a similar problem was occurring on their RHEL 6.2 installations.
tech.ryancox.net
A technical blog by Ryan Cox
Tuesday, May 14, 2013
Wednesday, May 30, 2012
BMC: Enable SNMP Traps for Hardware Failures
You can enable SNMP traps to be sent from a BMC/iDRAC/Whatever using a few IPMI commands. This works on every Dell server I have ever tested but didn't work on HP systems I have tried. I imagine it would work on servers from most other vendors. I know there are vendor-specific tools to do this, but I prefer using industry standard protocols to administer systems and really don't like having to install a lot of extra vendor tools and daemons.
Thursday, May 10, 2012
Enable IPMI Over LAN from the OS using FreeIPMI
Maybe you just heard about how wonderful it is to control your hardware remotely. Maybe you forgot to configure IPMI Over LAN on a production system's BMC and you don't want to reboot. Fear not! Enabling IPMI Over LAN can (usually) be done from the OS using freeipmi.
Wednesday, May 9, 2012
freeipmi *-config tools primer
Basic usage of bmc-config, pef-config, and other freeipmi *-config tools
The documentation below should work for all freeipmi *-config commands. It only discusses how to connect to a BMC (including iDRAC, iLO, etc) then read from and modify its configuration. I picked bmc-config to demonstrate program usage, though the same parameters will work for all the related commands.
In some places I reference the "-f" parameter which allows you to specify a filename. In newer versions it appears to still work but have been deprecated by "-n".
Wednesday, November 2, 2011
BMC: Change Temperature Thresholds
This post shows how to update a server's Baseboard Management Controller (or iDRAC or maybe an iLO or something else) to power the server off at a different temperature threshold than the manufacturer default. This is done using ipmitool and freeipmi commands. We use it to lower the set points for some of our servers in a less-capable room that we have. The servers will then do a hard shutdown if the thermal threshold we set is reached.
Friday, March 4, 2011
Performance Problems Resolved
The performance problems that we were having have been resolved. Without making you read this whole thing to get to the conclusion:
- The problem was solved with an iDRAC firmware update provided by Dell (contact Dell to get the right version)
- The odds that you, the reader, are affected are extremely low (unless you have a PowerEdge M610 with dual Intel Xeon 5650 Westmere processors, six 4GB quad-ranked 1066 MHz DDR3 RDIMMs, no mezzanine cards, and only one 10K SAS hard disk. Even then, there may be another mitigating factor that ensures you aren't affected.)
- If you think you are affected, Dell support should be able to quickly tell you if that is the case or not
- We consider the issue to be resolved
Tuesday, February 15, 2011
Extrapolating Benchmark Scores Using MSRs on Intel CPUs
It turns out that you can get a very accurate estimate of Linpack benchmark scores by simply reading MSRs and comparing against a baseline score. I have only tried it on Westmere and Nehalem so far but achieved a minimum of 98.9% accuracy using this method. It's not good enough to submit to TOP500, but it can be useful when diagnosing problems on your systems.
I had hoped that this could be run in the background (properly niced) in a way that shouldn't impact any other users or processes on the system you are benchmarking. However, the problem with this method is that it only seems to be valid for whatever benchmark you are using. In this case I was using Linpack. I had hoped to try spiking all CPUs with something else for a few minutes and have it return the same numbers. Alas, it doesn't work as expected. My guess is that the memory also needs to be hammered at the same time, but that is something that should probably not be done while legitimate processing is happening.
I had hoped that this could be run in the background (properly niced) in a way that shouldn't impact any other users or processes on the system you are benchmarking. However, the problem with this method is that it only seems to be valid for whatever benchmark you are using. In this case I was using Linpack. I had hoped to try spiking all CPUs with something else for a few minutes and have it return the same numbers. Alas, it doesn't work as expected. My guess is that the memory also needs to be hammered at the same time, but that is something that should probably not be done while legitimate processing is happening.
Tuesday, January 11, 2011
Quantifying Our Westmere Turbo Mode Problems
MYSTERY SOLVED! (Jan 21, 2011): The throttling turned out not to be a processor flaw, though some minor bugs in the CPU or its documentation (not sure which) did contribute to a mis-diagnosis. I can't state what was causing it, but it was not a flaw with Westmere or turbo mode. The issue was fixed by a simple iDRAC firmware update. More info.
This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors. Please see the other articles I have written, especially Flaws with Intel Westmere X5650? The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.
During a maintenance window we were able to benchmark all of our systems. We benchmarked our nodes multiple times with Intel's Turbo mode both enabled and disabled on the processors. The results showed that 18% of our blades had worse performance with Turbo mode enabled than with it disabled. That's 93 out of 515 blades included in the benchmark. These tests were done on Dell PowerEdge M610 blades with dual Intel X5650 Westmere processors (2.66 GHz) with 1066 MHz DDR3 DIMMs. The graph below shows the result of the benchmark run with turbo mode enabled.
This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors. Please see the other articles I have written, especially Flaws with Intel Westmere X5650? The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.
During a maintenance window we were able to benchmark all of our systems. We benchmarked our nodes multiple times with Intel's Turbo mode both enabled and disabled on the processors. The results showed that 18% of our blades had worse performance with Turbo mode enabled than with it disabled. That's 93 out of 515 blades included in the benchmark. These tests were done on Dell PowerEdge M610 blades with dual Intel X5650 Westmere processors (2.66 GHz) with 1066 MHz DDR3 DIMMs. The graph below shows the result of the benchmark run with turbo mode enabled.
Wednesday, December 29, 2010
Useful commands for Dell servers
Here are some commands that may come in handy for Dell systems. I don't use OpenManage for anything because we had trouble with it a few years ago. After figuring out how to do the same things with lower-level commands, we never bothered with it again. The issues were due to installation/configuration problems and the occasional instance of a daemon chewing up CPU. Looking back I would now guess it was due to kipmi0 going out of control (can sometimes be fixed by a reset of the iDRAC/BMC or a virtual reseat).
Testing Throttled Intel Westmere X5650 CPUs in an HP blade
This is a continuation of our search to find the cause of slowness with our Dell M610 blades using dual Intel X5650 Westmere processors. (It has since been resolved). Please see the other articles I have written, especially Flaws with Intel Westmere X5650? The other relevant articles are Diagnosing Throttled or "Slow" Systems (Processors to be Precise) and Diagnosing Throttled Processors - Part 2.
Fortunately we were able to borrow an HP blade to test with (ProLiant BL460c G6). We swapped in our Westmere CPUs and our RAM. We had the blade for a very limited time, so this testing was about as unscientific as you can get. We did find some interesting results but I definitely do not consider them conclusive.
Fortunately we were able to borrow an HP blade to test with (ProLiant BL460c G6). We swapped in our Westmere CPUs and our RAM. We had the blade for a very limited time, so this testing was about as unscientific as you can get. We did find some interesting results but I definitely do not consider them conclusive.
Subscribe to:
Posts (Atom)
