Friday, June 20, 2014

LEVEL_BASED Slurm Prioritization Algorithm

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  To see why it is necessary, please see my other post about the problems with algorithms that existed at the time of its creation.

DEPRECATED:  This has been deprecated by our new algorithm, Fair Tree.  Yes, we really did replace this algorithm within a few months even though it worked great for us.  See the post about Fair Tree for details.

Introduction

Objective

The LEVEL_BASED algorithm prioritizes users such that users in an under-served account will always have a higher fair share factor than users in an over-served account.

Status

This has been accepted upstream.  It became available starting in 14.11.0pre3.  If you want it in 14.03.x, get it from our github repo.

Documentation

Until it is merged, we will maintain a copy of the documentation here.  (It is now merged in the master branch but there is no place to easily link to).

Concept

After seeing several problems with the other algorithms in production, Levi and I whiteboarded the TICKET_BASED algorithm to see where the problems were.  We figured out what was wrong, saw that it couldn't be fixed, then started working on a solution.

Our solution is simple in concept:  take a 64-bit unsigned integer and divide it up into "buckets" and fence off the usage at each level of associations using bitwise math.

For example, with two levels that we care about (accounts and users), we dedicate the 32 higher bits to accounts and the 32 lower bits to users.  In hex:
Account | User
AAAAAAAA BBBBBBBB

As you can see, a single bit set on the account level will always outweigh the effect of priority at the user level (where a user is competing only with siblings within the account).  This also works with more levels of accounts, e.g. college, department, faculty, user.

We wrote up some detailed documentation of the algorithm so I won't give a full description here.  However, I'll give a quick overview of some things and dive deeper in others.

Basic algorithm

The code starts at the root and recurses through the association tree, calculating fair share as it goes.  It calculates fair share (F = 2(-Usage/Shares)) using only that association and its siblings, shifts it the proper amount, then performs a bitwise OR with the parent account.  The result is set on that association as priority_fs_raw.  Associations deeper than PriorityLevels, the configuration parameter for max depth, are set to their parent's value.  Basically, you end up with something like this example that we included in the documentation.

The shares calculation is also modified.  LEVEL_BASED only normalizes shares at the association's level, meaning it is Snorm = S /  Ssiblings_and_me.

Example of priority_fs_raw calculation

Consider a four-tiered structure with colleges, departments, faculty, and users (faculty member himself/herself plus students). PriorityLevels=4 should be set. priority_fs_raw is divided into four 16-bit buckets:

bucket_width_in_bits = 16
unused_bucket_bits = 0
bucket_max = 65535

This results in the following example calculation:

root
||
\/
Life Sciences    level_fs = 0xAAAA000000000000, priority_fs_raw = 0xAAAA000000000000
||
\/
Biology       level_fs = 0x0000123400000000, priority_fs_raw = 0xAAAA123400000000
||
\/
Dr. Bob       level_fs = 0x0000000077770000, priority_fs_raw = 0xAAAA123477770000
||
\/
Grad Student    level_fs = 0x000000000000CCCC, priority_fs_raw = 0xAAAA12347777CCCC

The final value for Grad Student is 0xAAAA12347777CCCC. This represents:

College | Dept   | Faculty  | User
AAAA      1234      7777      CCCC

The priority_fs_raw values are visible as Fairshare Raw in the output of sshare -l.  This modification makes it easy to see fairshare calculations at each level.

Sort then rank

Once the recursion is complete, all users are then sorted according to their raw priority number.  Their rank in the list is turned into their fair share factor.  This got us around precision problems that exist at the extremes, a common occurrence.  One nice side effect of this is that you can have a very small value for PriorityWeightFairshare, possibly as low as 1 or 2 times the number of user associations.

Of Shares and Asymptotes

We noticed some interesting problems while working on LEVEL_BASED, namely that the traditional fair share calculation has some big problems.  The algorithm is F = 2(-U/S), where F is the fair share factor, U is the Usage and S is the sharesU and S are both numbers between 0.0 and 1.0.

As S gets small, U/S becomes very large which then becomes very tiny when calculating F = 2(-U/S).  When that number is small enough, you may be unable to differentiate between accounts, for example, with 0.30 usage and accounts with 0.15 usage.

We saw real issues with less than 150 accounts.  With 150 accounts that are treated equally, the shares value for each account is about 0.0067.  Let's see how F looks when S is set to 0.0067:
Note that it is actually much worse than that for non-LEVEL_BASED algorithms since each of those 150 accounts can have multiple users.  Let's say that a professor teaches a class that involves HPC plus he has some research assistants, so maybe 40 users.  Now you end up with (1/150)*(1/40) = 0.000166667 for each user.  Ouch.

Those are some pretty tiny numbers for the fair share factor, F.  It gets worse when you realize that F still needs to be multiplied by PriorityWeightFairshare to determine the final fair share value.  The usable range of U values is quite limited; the final F value can easily overlap with others, even for big differences in U.

FairShareDampeningFactor: Almost there

The FairShareDampeningFactor was introduced to solve this problem, which it mostly does for most situations.  It was a great idea but doesn't completely solve the problem, plus it requires a manual tuning parameter.  Instead of F = 2(-U/S), it is F = 2(-U/S/damp_factor).  Basically, you multiply S by some number so it isn't so small.

FairShareDampeningFactor usually works fine, except you have another setting to fiddle with.  It's also a problem if there are big variations in shares values, something that faculty sometimes impose on their students.  We have some users in an account with 1/16 of the default and some are 16x the default.  This does cause problems when dampening is used, especially when each account coordinator can create their own very different situations.

Linear Interpolation

If the goal is to reduce asymptotic equations, why not do it directly by making sure that the denominator is never tiny?  Through linear interpolation, we chose to map S from its initial range of 0.0 to 1.0 to a new range of 0.1 to 1.0.  That means that in the worst case, S is 0.1.

Now we end up with a much better looking graph.  Here is the range of F values that can be achieved given a worst case S value of 0.1 (blue) compared to 0.0067 (red):

We chose to retain the S value until immediately before the F calculation, then normalize it from 0.1 to 1.0 when calculating F.  As the graph showed, even with a shares value of 0.1 you can easily distinguish between different U values which then results in sane F values.

Below are three-dimensional plots of what F looks like for given U and S values.  Note that when S approaches 0.0, the available F values are also almost 0.0 even with low U values.  Compare when S is normalized to the range 0.1 to 1.0.

 F=2(-U/S), U from 0.0 to 1.0, S from 0.0 to 1.0
 F=2(-U/S), U from 0.0 to 1.0, S from 0.1 to 1.0

Try it out

LEVEL_BASED works perfectly for us in our testing and in production.  If this approach sounds like it will work for you, give it a shot and let us know how it goes.  There are some features that we could potentially add, even some that sound quite complicated but wouldn't be in reality (e.g. have the algorithm skip the first few levels).  Contact me or Levi directly or on the slurm-dev list with any questions.

The code has been accepted upstream and became available as of 14.11.0pre3.  If you want it in 14.03, check out the level_based branch in our github repo.  It was initially branched from the SchedMD Slurm repo at the commit for 14.03.4-2.  It has since been updated by merging newer releases from slurm-14.03.

P.S. There are some other things that we released that you may find useful:
Slurm/PBS Script Generator (Javascript)