Update May 9, 2014: Added a link to our GrpCPURunMins Visualizer
I have discussed this with several interested people recently so it's
time for me to write it up. When running an HPC batch job scheduling
system such as Slurm, Moab, Maui, or LSF, there are many ways to
configure user limits. Some of the easiest limits to understand are on
the number of jobs a user can run or the maximum cores or nodes that they
can use. We have used a different limit for several years now that is worth sharing.
No one likes to see a cluster that is 60% utilized while users sit in
the queue, unable to run due to a core count limit they are hitting.
Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a
cluster with $MAX_WALLTIME day jobs during a brief lull in usage. Obviously, other users are displeased when they then submit jobs five minutes later that will
now have to wait for all of the other user's job to finish in
$MAX_WALLTIME days. This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.