Friday, June 20, 2014

Problems with Existing Slurm Prioritization Methods

UPDATELevel-Based was replaced by Fair Tree, an even better algorithm that we created.

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  In order to understand why LEVEL_BASED is necessary, I have chosen to write this post about our issues with the existing options and a separate post about LEVEL_BASED.  If you just want to see information about LEVEL_BASED, see the post LEVEL_BASED Slurm Prioritization Algorithm.

We want users from an account that has higher usage to have lower priority than users from an account with lower usage.  There is no existing algorithm that consistently does this.

We had high hopes for the ticket-based plugin (aka TICKET_BASED); it did work much better than the traditional plugin in most scenarios but it has some flaws.  DEPTH_OBLIVIOUS is designed for a different use case and the traditional multifactor algorithm doesn't come close.

Before beginning, I should point out that anytime I say "doesn't work" you can usually assume that I mean "doesn't work as we want it to".

Account Structure at BYU

We arrange our accounts in a relatively simple way.  Each faculty member has a Slurm account.  Each account consists of the faculty member and his/her students.  All faculty members are treated equally.  All users within an account are treated equally by default, though faculty can use their account coordinator status to modify shares and limits within the account.  It looks something like this:
root
/        \
acctA        acctB            ACCOUNTS
/  |  \         |  \
profA  tim  phd4     u5   profB      USERS

With such a simple setup, we thought it would be simple to prioritize users the way we wanted.

Shares Calculation

All current algorithms suffer from a problem with the normalized shares calculation.

This is the formula that is used for all algorithms except LEVEL_BASED:

S =    (Suser / Ssiblings) *
(Saccount / Ssibling-accounts) *
(Sparent / Sparent-siblings) * ...

The problem with that is that the shares number is multiplied by the shares numbers at each level of the account hierarchy.  Let's pretend you have the following setup:

Account | User | Shares
a1      | --   | 100
a1      | u11  | 100
a1      | u12  | 100
a2      | --   | 100
a2      | u21  | 100

Sa1 = (100 / 200) = 0.50
Sa2 = (100 / 200) = 0.50
Su11 = (100 / 200) * (100 / 200) = 0.5 * 0.5 = 0.25
Su12 = (100 / 200) * (100 / 200) = 0.5 * 0.5 = 0.25
Su21 = (100 / 100) * (100 / 200) = 1.0 * 0.5 = 0.50

We can see that u21 is favored since he is the only user in a2.  Now let's say that a1 has an account coordinator that wants to prioritize u12 over u11 because u12 has a paper due tomorrow.  By doing so, u11 is actually penalized compared to other accounts while u12 is only helped somewhat.

Account | User | Shares
a1      | --   | 100
a1      | u11  | 100
a1      | u12  | 200
a2      | --   | 100
a2      | u21  | 100

Sa1 = (100 / 200) = 0.50
Sa2 = (100 / 200) = 0.50
Su11 = (100 / 300) * (100 / 200) = 0.33 * 0.5 = 0.17
Su12 = (200 / 300) * (100 / 200) = 0.67 * 0.5 = 0.33
Su21 = (100 / 100) * (100 / 200) = 1.00 * 0.5 = 0.50

We don't want shares acting this way, so we improved it in LEVEL_BASED.  Shares weren't the only problem we experienced in production, however.

Fairshare Calculation

I debated which post to place this in and decided to write about it in the LEVEL_BASED post.  Here's the basic problem that is discussed in detail in the other post:

We noticed some interesting problems while working on LEVEL_BASED, namely that the traditional fair share calculation has some big problems.  The algorithm is F = 2(-U/S), where F is the fair share factor, U is the Usage and S is the sharesU and S are both numbers between 0.0 and 1.0.

As S gets small, U/S becomes very large which then becomes very tiny when calculating F = 2(-U/S).  You mou may be unable to differentiate between accounts, for example, with 0.30 usage and accounts with 0.15 usage.

TICKET_BASED algorithm

I first wish to state that I'm appreciative of the hard work that was put into TICKET_BASED.  It did a much better job than traditional multifactor (for our use case) and usually worked well during the year and a half that we used it.  However, it does not always do what we want and we couldn't figure out a way to fix it to do so.

The ticket-based plugin documentation stated:
The goal is to make sure that the priority strictly follows the account hierarchy, so that jobs under accounts with usage lower than their fair share will always have a higher priority than jobs belonging to accounts which are over their fair share.

That was exactly what we wanted.  If account A had higher priority as a whole than account B, we wanted all users of account A to have higher priority than all users of account B.  However, we found reality to be different.  Specifically, we saw that a user's fair share factor depends on the number of active users in each account.

Some of this is due to TICKET_BASED features and some is due to built-in Slurm functionality that is overridden only by LEVEL_BASED, such as the shares calculation.

The main problem that is specific to TICKET_BASED is how it distributes the tickets that end up being used for the fair share factor.  The formula for distribution to an association is T = Tparent * S * F / SUM((S*F)active_siblings).  Note that active_siblings refers to you and your siblings.  If there is only one user in an account, that user ends up with 1.0 of the account's tickets.  Accounts with more than one active user are at a disadvantage since they will each get less than 1.0 of the parent account's tickets.

I am including an example here using values from the first example where all users have shares=100, except let's add two more users under account a1.  Unfortunately I ended up having to make up F since it would have been a lot more work to calculate and throw in here.  (See the next section if you don't like hypothetical numbers.)

Account | User | Norm Shares | F
a1      | --   | 0.500       | 0.20
a1      | u11  | 0.125       | 0.20
a1      | u12  | 0.125       | 0.20
a1      | u13  | 0.125       | 0.20
a1      | u14  | 0.125       | 0.20
a2      | --   | 0.500       | 0.05
a2      | u21  | 0.500       | 0.05

T = Tparent * S * F / SUM((S*F)active_siblings)

Ta1 = MAX_TICKETS * 0.50  * 0.2 / ((0.50  * 0.2) + (0.50 * 0.05)) = MAX_TICKETS * 0.8
Ta2 = MAX_TICKETS * 0.50  * 0.05 / ((0.50  * 0.2) + (0.50 * 0.05)) = MAX_TICKETS * 0.2

Note that the "4 *" below is because there are four identical users, some of whom are omitted for brevity:
Tu11 = Tparent * 0.125 * 0.2 / ( 4 * (0.125 * 0.2) ) = Tparent * 0.25
....
Tu21 =  Tparent * 0.5 * 0.05 / (0.5 * 0.05) = Tparent * 1.0

Let's say that there are 65536 tickets to distribute.
Ta1 = MAX_TICKETS * 0.8 = 52428
Ta2 = MAX_TICKETS * 0.2 = 13107

Tu11 = 52428 * 0.25 = 13107
Tu12 = 52428 * 0.25 = 13107
Tu13 = 52428 * 0.25 = 13107
Tu13 = 52428 * 0.25 = 13107
Tu21 =   8192 * 1.00 = 13107

As you can see, all users ended up with the same number of tickets.  If one more user from account a1 submits a job, the scale would tip in u21's favor.  That's not good, but those were just made up numbers.

Not just hypothetical...

See bug 498 for a real demonstration of the effect, especially Moe's reply in comment 1.  This is by far the best explanation I have seen for why the algorithm doesn't achieve the plugin's goals.  If you have any doubts, read through comment 1 and follow along with the calculations.

While testing LEVEL_BASED, we did a comparison with TICKET_BASED.  You can clearly see here that TICKET_BASED has a problem.  fslcollab8 (account glh43) beat out all users in dhe's account even though glh43's account had 0.62 effective usage and account dhe had 0.19!!!  The last output in the file shows that LEVEL_BASED properly handles this scenario, even though several key features of LEVEL_BASED were not yet implemented at the time of the test.

Can't we just fix the shares calculations?

I keep asking myself that question, thinking that a simple fix to the shares calculation (optional via config parameter) would have been better than writing an entirely new algorithm.  However, every time I reread bug 498 comment 1, I realize that no existing Slurm algorithm does what we want or could be fixed by only touching the shares calculation.  You just can't multiply by something that includes a parent or child and have it work 100% of the time.

LEVEL_BASED

See my other post, LEVEL_BASED Slurm Prioritization Algorithm, for details of the new algorithm we designed.

UPDATELevel-Based was replaced by Fair Tree, an even better algorithm that we created.  I need to update this post at some point to reflect that...