## Friday, August 29, 2014

### Fair Tree Slurm Fairshare Algorithm

That's right.  Levi Morrison and I created a second Slurm fairshare algorithm, Fair Tree.  Our first algorithm, LEVEL_BASED, was accepted into Slurm and became available in 14.11.0pre3 about one month ago.  Fair Tree was accepted into Slurm in time for 14.11 and replaced LEVEL_BASED.

When given the same inputs, both algorithms produce effectively equivalent outputs.  The objective of both algorithms is the same:  If accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B.

So why bother writing a new algorithm three months after the first one if the first algorithm successfully solved the same problems?

### Was the LEVEL_BASED algorithm insufficient?

Well, I'm going to go with "it was probably sufficient".  It was sufficient for almost all use cases and we were quite pleased with it.  It worked very well in our environment and it should work well in most environments.  Performance was good and we have used it in production for three months.  Our slides for the upcoming Slurm User Group meeting (less than a month from now) were pretty much complete.

The former algorithm would divide a 64-bit unsigned integer into "buckets" that stored the results of the fairshare equation 2^(-Usage/Shares) between an association and its siblings.  As the algorithm traverses the association tree, the results at each level are shifted to less and less significant bits.  Sorting this number guaranteed that the calculations at higher levels of the tree always outweighed the results of lower levels.

The first problem was that we forced admins to select how many levels to support, an admittedly minor inconvenience.  The second problem is that the aforementioned 64-bit integer was subdivided into buckets small enough to allow for the proper number of levels.  The more levels in your association tree, the lower the precision at any one level.

A 64-bit unsigned integer should be enough to satisfy 2, 3, or 4 levels.  That gives you 32, 21, or 16 bits to use at each level.  With 16 bits, you can store values up to 65,535 per level.  When you have a few or even 100 siblings, that's almost certainly sufficient to differentiate between fairshare equation results.  But "almost certainly sufficient" isn't good enough for us.

As we were finishing up our slides for next month's Slurm User Group Meeting, one of the last slides we added was a list of several ideas for improvement.  We talked about a particular idea for a little while then decided to try it out.  After less than an hour, a new algorithm was born (then fine-tuned and tested for a few more days).

### Fair Tree Algorithm

The new algorithm is much easier to describe in classic computer science terminology.  It uses some of the same ideas from LEVEL_BASED, including that fairshare is calculated for each association only considering the shares and usage of it and its siblings (stored as level_fs).  A rooted plane tree [PDF], also known as a rooted ordered tree, is logically created then sorted by fairshare with the highest fairshare values on the left.  The tree is then visited in a depth-first traversal.   Users are ranked in pre-order as they are found.  The ranking is used to create the final fairshare factor for the user.

The algorithm performs a single traversal of the tree since all the steps can be combined.  The very basic idea is to start at root then:
• Calculate level_fs for the subtree's children: Shares / Usage
• Sort children of the subtree
• Visit the children in order
• If user, assign a final fairshare factor similar to (ranking-- / user_assoc_count)
• If account, descend to account

It is a little more complicated, though only slightly

#### Visual Examples

I posted an example of the tree traversal algorithm; it is from a draft of our slides for the Slurm User Group Meeting:
Animated gif
PDF (click through at your own pace)

### Ties

Ties are allowed between associations, so you can end up with rankings like 80, 80, 80, 77, 76.  Ties typically only occur when competing associations have 0.0 usage.  The raw usage of an association is stored by Slurm as a long double, so the odds of two associations having the same fairshare factor despite non-zero usage are almost zero.  Once one association accumulates any usage at all, the tie will be broken.  The odds of anyone 1) being negatively affected, (2) noticing the problem and (3) caring about the issue seem to be pretty low but we solved it anyway.

There are three different scenarios:
1. User ties user
2. Account ties account
3. User ties account

#1 was easy to solve.  #2 and #3 were difficult to solve until we came up with the right solution.  It was hard to figure out what should even happen for #3.

The solution for #1 (user ties user) was to assign the same fairshare number to each sibling user that has the same level_fs.  #2 (account ties account) was solved by creating a temporary list of child associations and appending all children of all tied accounts to the temporary list.  The list is then sorted by each child's fairshare.  Each user that is found is given a final fairshare number and accounts are handled recursively.

The solution for problem #2 may seen like overkill or maybe inefficient but it mostly had to be done anyway.  Sorting the actual association tree in-place might have bad side effects, so we had to clone the list of an account's children anyway.  The solution to ties simply adds more children to the temporary list then skips over the tied accounts.  The list is a bunch of pointers, not actual structs, so it's a fairly cheap operation.

#3 (user ties account) was handled by sorting the list with users at the left (highest ranked) and not decrementing the rank when descending to a tied sibling account.  That allows a user tied with an account to tie its highest ranked user.

The problems were complicated but the solution ended up being simple when chained together.

The tie handling can be summed up as:
1. Sibling users with the same level_fs receive the same rank
2. Sibling accounts with the same level_fs have their children lists merged before sorting
3. A user with the same level_fs as a sibling account will receive the same rank as the account's highest ranked user

### Documentation

Documentation is at http://slurm.schedmd.com/fair_tree.html.  Our presentation at the 2014 Slurm User Group Meeting is also available: http://slurm.schedmd.com/SUG14/fair_tree.pdf.

### Results

Levi and I were very pleased with our original algorithm.  We are even more pleased with this one.  It works great and it should allow for calculations with extreme amounts of precision at arbitrary tree depths.  With the exception of the tie handling code, it is a very simple algorithm with great results.  sshare also works and the debug information that goes into the logs, if requested, is very nice.

The Fair Tree algorithm was accepted as a replacement for LEVEL_BASED and is available starting in the 14.11 series.

1. Hello Ryan,

Great work with FAIR_TREE. Just curious about your "Priority*" parameters from your slurm.conf file. Do you think you can share them here?

thanks
-k

2. PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=1
PriorityMaxAge=14-0
PriorityWeightAge=0
PriorityWeightFairshare=10000
PriorityWeightQOS=1000000
PriorityWeightJobSize=0
PriorityWeightPartition=0

We may adjust the PriorityDecayHalfLife at some point as we observe the queue.

We would use Age, but it needs some work IMO. I think it should only apply to the top x jobs from a particular user. Otherwise a user can submit tons of large job arrays that are all accruing queue time, so the users who already run the most often have more of an advantage.

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.