Friday, August 29, 2014

Fair Tree Slurm Fairshare Algorithm

That's right.  Levi Morrison and I created a second Slurm fairshare algorithm, Fair Tree.  Our first algorithm, LEVEL_BASED, was accepted into Slurm and became available in 14.11.0pre3 about one month ago.

When given the same inputs, both algorithms produce effectively equivalent outputs.  The objective of both algorithms is the same:  If accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B.

So why bother writing a new algorithm three months after the first one if the first algorithm successfully solved the same problems?

Friday, June 20, 2014

LEVEL_BASED Slurm Prioritization Algorithm

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  To see why it is necessary, please see my other post about the problems with algorithms that existed at the time of its creation.

DEPRECATED:  This has been deprecated by our new algorithm, Fair Tree.  Yes, we really did replace this algorithm within a few months even though it worked great for us.  See the post about Fair Tree for details.


Problems with Existing Slurm Prioritization Methods

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  In order to understand why LEVEL_BASED is necessary, I have chosen to write this post about our issues with the existing options and a separate post about LEVEL_BASED.  If you just want to see information about LEVEL_BASED, see the post LEVEL_BASED Slurm Prioritization Algorithm.

We want users from an account that has higher usage to have lower priority than users from an account with lower usage.  There is no existing algorithm that consistently does this.

Thursday, May 22, 2014

Job Script Generator for Slurm and PBS published on Github

We published version 2.0 of our batch job script generator on Github.  It is a Javascript library (LGPLv3) that allows users to learn Slurm and PBS syntax by testing various inputs in an easy-to-understand manner.  Links: git repo, demo, other github projects of ours.

Thursday, April 17, 2014

Scheduler Limit: Remaining Cputime Per User/Account

Update May 9, 2014:  Added a link to our GrpCPURunMins Visualizer

I have discussed this with several interested people recently so it's time for me to write it up.  When running an HPC batch job scheduling system such as Slurm, Moab, Maui, or LSF, there are many ways to configure user limits.  Some of the easiest limits to understand are on the number of jobs a user can run or the maximum cores or nodes that they can use.  We have used a different limit for several years now that is worth sharing.

No one likes to see a cluster that is 60% utilized while users sit in the queue, unable to run due to a core count limit they are hitting.  Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a cluster with $MAX_WALLTIME day jobs during a brief lull in usage.  Obviously, other users are displeased when they then submit jobs five minutes later that will now have to wait for all of the other user's job to finish in $MAX_WALLTIME days.  This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.

Thursday, October 17, 2013

User Fencing Tools (UFT) on github

I just published a set of scripts, programs, config file examples, etc that I wrote for use at BYU but should be useful to other HPC sites.  I couldn't think of a better name for it, so I called it the User Fencing Tools (UFT).  It is available in our github repo at https://github.com/BYUHPC/uft.

The tools are used to control users on HPC login nodes and compute nodes in various ways.  The tools make use of cgroups, namespaces, and cputime limits to ensure that users don't negatively affect each others' work.  We limit memory, CPU, disk, and cputime for users.

UFT also has examples for how to control ssh-launched processes on compute nodes.  You can account for those with Torque but can't control them (just like normal).  SLURM will have accounting and resource enforcement for these in 13.12 (Dec. 2013).

Wednesday, August 21, 2013

IPMI over LAN vulnerability and some BMC "features"

I don't want to pull away credit or page views from Dan Farmer's great work, but this needs more exposure...
For those of you who manage servers with IPMI over LAN enabled, there is a very severe vulnerability that may allow anyone full root access to your iLO/iDRAC/IMM/ILOM/whatever (aka BMC).  This is independent of the OS, though once rooted the attacker can then take over the OS in the same way they would as if they have physical access.  They can control power, boot settings, serial over LAN, BIOS settings (via serial), KVM, and can even read/write arbitrary system memory.

For those of you who do not have IPMI over LAN enabled, there may be some stuff that affects you too...

Wednesday, July 24, 2013

Server Room and Three Phase Power for Systems Administrators

There doesn't seem to be much educational material about server room power that is comprehensible to systems administrators.  I don't think there is a "typical" sysadmin type out there but I'm guessing that most have had little to no formal training about server room power.  Three phase power may seem like black magic and lots of incorrect assumptions are made, thus I decided to write this post.  Hopefully this will be useful to some sysadmins out there.

Tuesday, July 16, 2013

Per-user /tmp and /dev/shm directories

Updated Oct 7, 2013: Tons of updates
Updated: March 19, 2014: The recommended configuration has been in production for months now and works great

I recently discovered a great feature in Linux that allows for per-process namespaces (aka polyinstantiation).  Different processes on the same machine can have different views of a filesystem, such as where /tmp and /dev/shm are.  You can easily make it so that each user on a shared system has a different /tmp that, to each of them, really looks like (and is) /tmp.  This isn't done by setting an environment variable; this redefines mount points on a per-process basis such that each users' processes are using their own directory as /tmp.

Wednesday, July 10, 2013

Installing a Xeon Phi (MIC) Card in a Dell PowerEdge R720

We got an early release of Dell's Phi installation kit with installation instructions that weren't all that great (to say the least). Dell told me that they are working on better instructions.  In case you're confused, here you go.

A few things to note:
  • We have dual 95W CPUs.  These instructions might be different (correct?) for higher wattage CPUs (larger heat sinks, different plastic baffles?)
  • The extra heat sinks are for the CPUs, not the Phi.  Our 95W CPUs did not need them.
  • The 2.5" and 3.5" mounting brackets are not necessary in our configuration
  • We used a different bracket that was provided
Here are some pictures of what it should look like: