Thursday, April 17, 2014

Scheduler Limit: Remaining Cputime Per User/Account

Update May 9, 2014:  Added a link to our GrpCPURunMins Visualizer

I have discussed this with several interested people recently so it's time for me to write it up.  When running an HPC batch job scheduling system such as Slurm, Moab, Maui, or LSF, there are many ways to configure user limits.  Some of the easiest limits to understand are on the number of jobs a user can run or the maximum cores or nodes that they can use.  We have used a different limit for several years now that is worth sharing.

No one likes to see a cluster that is 60% utilized while users sit in the queue, unable to run due to a core count limit they are hitting.  Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a cluster with $MAX_WALLTIME day jobs during a brief lull in usage.  Obviously, other users are displeased when they then submit jobs five minutes later that will now have to wait for all of the other user's job to finish in $MAX_WALLTIME days.  This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.