We experienced several problems when we upgraded to Red Hat Enterprise Linux 6.2 from CentOS 5.4. A user of ours started reporting slowness on some of his larger HPC jobs. We looked at tons of things then started noticing that one or more nodes would start swapping for no reason. His job would only use about 60-70% of the memory on each node but some nodes would inexplicably swap (diagnosed with vmstat). I talked to people at other universities and HPC sites and verified that a similar problem was occurring on their RHEL 6.2 installations.
We also looked at the possibility of the job overloading one NUMA memory node on the affected hosts (this is on dual socket servers with Intel Sandy Bridge). That didn't appear to be the case. Tons of other ideas also came up short.
Finally, we got a recommendation from a counterpart elsewhere to try RHEL 6.4's kernel. We had opened a support case with Red Hat that wasn't making progress, so we decided to try it out and see what happens. We upgraded the kernel to RHEL 6.4's kernel while keeping the base 6.2 image. All of our problems were solved.
Unfortunately we don't know what the problem was or if RHEL 6.3's kernel would have fixed it (probably not according to this particular counterpart of ours). Unfortunately we made two changes at the same time: new hardware with Sandy Bridge and RHEL 6.x. It may be a combination of the two.
I've heard some rumors as to what the real cause of the problems were but I'll decline to pass that along to the Internet at large until I can link to a public posting of some kind.
Consider this a public service announcement that I would recommend checking out RHEL 6.4's kernel instead of an earlier 6.x release.
No comments:
Post a Comment
Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.