Thursday, April 17, 2014

Scheduler Limit: Remaining Cputime Per User/Account

Update May 9, 2014:  Added a link to our GrpCPURunMins Visualizer

I have discussed this with several interested people recently so it's time for me to write it up.  When running an HPC batch job scheduling system such as Slurm, Moab, Maui, or LSF, there are many ways to configure user limits.  Some of the easiest limits to understand are on the number of jobs a user can run or the maximum cores or nodes that they can use.  We have used a different limit for several years now that is worth sharing.

No one likes to see a cluster that is 60% utilized while users sit in the queue, unable to run due to a core count limit they are hitting.  Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a cluster with $MAX_WALLTIME day jobs during a brief lull in usage.  Obviously, other users are displeased when they then submit jobs five minutes later that will now have to wait for all of the other user's job to finish in $MAX_WALLTIME days.  This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.

The limit that I refer to is a limit on the remaining cputime per user or account.  It is called GrpCPURunMins in Slurm and MAXPS in Moab/Maui.  Since we use Slurm, I will use its terminology going forward.  Most people do not refer to this limit as a limit on the remaining cputime but I think it's the best description.  The other way of describing it is as a limit on the cputime that can be allocated to running jobs at any given time, minus the elapsed cputime.  Both are correct, but the "remaining cputime" is a simple description that reminds you that the number will decrease over time.

What is GrpCPURunMins (or whatever your scheduler calls it)?

It is a limit on the remaining cputime per account or user.  You can think of remaining cputime as sum(job_core_count * job_remaining_time) for all of a user's or account's jobs.

If a user has 10 jobs that each use 2 cores and each have 24 hours remaining, the remaining cputime is 10 * 2 cores * 24 hours = 480 cpuhours = 28800 cpuminutes.  As the jobs continue to run, the remaining cputime will decrease because the 24 hours in the equation will decrease.

GrpCPURunMins (or equivalent) is a limit on this number.  In Slurm, you can set this on an account or user level.  This limits jobs for all children of an account, if set on an account.

Once a user/account reaches this limit, no more jobs are allowed to start for that association.  As time goes on, the remaining cputime will decrease and eventually allow more jobs to start without exceeding the GrpCPURunMins limit.

Why is this limit beneficial?

The short answer:
  • Better system utilization (for most definitions of "better")
  • Staggers job start and end times to a large degree
  • Rewards shorter walltimes
  • Increases job turnover which decreases the average queue time
  • Allows users with short walltimes to burst to fill much of the system (and relinquish it quickly) during any weekend/holiday lulls
  • No (or reduced) need for a limit on the core or node count
The next sections will provide some of the reasoning.

Weekend/holiday lulls in utilization?

This is sometimes a problem for us because not everyone queues up enough work for weekends or holidays.  It varies by semester but this can be a problem.  Let's pick a user, userbob, whose workload (computational kickboxing) is embarrassingly parallel and has decades of work left.

Scenario 1:  No limits / Wild West

Users submit work earlier in the week but don't queue up enough work to keep the system full for the weekend.  userbob has a script to maintain 1000 jobs in the queue at all times.  As other users' jobs finish, only userbob has enough work queued up.  Even though his priority is extremely low, his jobs start because there are free resources.

On Monday morning, everyone else returns to work and submits all of their jobs.  userbob has now been allocated many of the resources for $MAX_WALLTIME.  Everyone else sends you hate mail as they wait for userbob's job to finish.

Scenario 2:  Limit on core count per user/account

Users submit work earlier in the week but don't queue up enough work to keep the system full for the weekend.  userbob has a script to maintain 1000 jobs in the queue at all times.  Your scheduler has a limit on the core count such that no user may use more than 10% of the cores.  Usage by everyone else accounts for 50% and userbob is able to use 10% for a grand total of 60% overall usage.  40% of your resources are idle and userbob gets less work done.  At least your other users are happy on Monday morning and don't send you hate mail (at least about this).

Scenario 3:  GrpCPURunMins limit in place

Users submit work earlier in the week but don't queue up enough work to keep the system full for the weekend.  userbob has a script to maintain 1000 jobs in the queue at all times.  Your scheduler uses the GrpCPURunMins limit and does not have a core or node count limit.  As utilization drops, userbob is able to start more jobs in a staggered fashion.  As a matter of fact, many of his jobs were started in a staggered fashion.

By lunchtime on Sunday, userbob is using 45% of the system and other users are using 40%.  When everyone else starts submitting jobs on Monday morning, many of userbob's jobs are close to completion and end in a staggered fashion.  Other users are able to start a few jobs almost immediately and the rest of theirs start in a staggered fashion as some of userbob's other jobs finish.  Users are happy that some of their jobs start relatively quickly even if many of them do not do so immediately.

The key in the third scenario is that the start time of jobs can be staggered.  The lower the jobs' walltimes, the better the effect.

Staggering job start times

As demonstrated in Scenario 3 above, the great thing about GrpCPURunMins is that it staggers start and end times of jobs.  The limit allows users to start up to a certain amount immediately then start some more in a staggered fashion over time.

I will include an example of what happened within a research group (aka Slurm account) when they hit their GrpCPURunMins limit.  The graph below has time on the X-axis and allocated CPU cores on the Y-axis.  The light green user and red user are in the same account.  We set the GrpCPURunMins limit at the account level, thus they are competing with each other for resources.

The light green user had been using a lot of cores for a long time.  The red user hadn't run any jobs for a few weeks so his priority was much higher than the light green user.  When the red user submitted jobs he was able to start almost immediately since the light green user's jobs had been started in a staggered fashion.

The red user is very happy because his jobs started immediately.  If he checks how many of his jobs are running, they steadily increase over time.  That makes most people happy.
The slope of the lines would be steeper and taller if the walltimes were shorter.

Here is an actual demonstration of an account that reduced the maximum walltime from seven days down to three days.  Seven day jobs were allowed to run until completion but all new jobs were three days.  Note that the usage at seven days each had not yet plateaued and would have increased for a little while longer.

(Snapshot taken during finals week, thus the low utilization)

Encourages shorter walltimes

In case you haven't picked up on it yet, it is worth repeating: this limit encourages users to shorten their walltimes (if you explain it to them).  We like that because it increases the turnover rate and decreases the average wait time in the queue.

Users can run more jobs if they shorten their walltimes.  Everyone benefits if everyone else shortens their walltimes because queue times are lower.  Admins benefit because everyone else is happier since everyone feels they are being treated fairly.

Just remind everyone that remaining_cpu_time = sum(job_core_count * job_remaining_time).  They naturally want to maximize job_core_count and the way to do that is by reducing job_remaining_time.  That is accomplished by reducing their walltimes.

I put together a video on our YouTube channel that includes an explanation of this limit for users.  It might be helpful in educating users of the benefit of this limit and why they should reduce their walltimes.  Enjoy:  How to game the scheduler(Since I made the video, it isn't the most exciting thing in the world but it should be educational).

Choosing a value for the limit

First, try out our GrpCPURunMins Visualizer.  It shows you how many jobs of a certain size and walltime can be run, assuming no contention for infinite resources.

After that you may want to create a spreadsheet to simulate different workloads that you support.  Basically, you need to figure out how many $MAX_WALLTIME length jobs to allow from one user/account.  Then figure out how many you're okay with if many of the jobs are 0.5 * $MAX_WALLTIME, a mix of 0.2 and 0.7 of the $MAX_WALLTIME, etc.  Then throw in a different job type (e.g. big MPI versus lots of serial).


While it won't solve world hunger or cure cancer, appropriate usage of the GrpCPURunMins limit may enable the research that does.  ;-)

Okay, it works well for us but may not work well for all use cases.  It's not a perfect solution but close enough for us.

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.