Monday, November 16, 2015


pam_slurm_adopt is a PAM module I wrote that adopts incoming ssh connections into the appropriate Slurm job on the node. The module allows Slurm to control ssh-launched processes as if they were launched under Slurm in the first place and provides for limits enforcement, accounting, and stray process cleanup when a job exits.

Some MPI implementations don't provide for Slurm integration or are not compiled with Slurm support, so the fallback is ssh.  Some code doesn't even use MPI and instead directly calls ssh.  The ideal solution is always to utilize properly-compiled MPI that supports Slurm, but realistically that's not going to happen all the time.  That's where pam_slurm_adopt comes in.

As I see it, this PAM module solves three main problems.  ssh-launched processes now:
  1. Have proper limit enforcement
  2. Will be accounted for in the accounting database
  3. Will be cleaned up when the job exits
We have been using the code in production or more than three weeks now and things work great, except a few minor bugs still need to be fixed to get all the intended benefits.  See the "Inclusion in Slurm" section for details.

Extern Step

The module makes use of the extern step that is created when PrologFlags=contain is set (thanks to SchedMD for adding this feature to support pam_slurm_adopt).

Early versions of pam_slurm_adopt required the use of cgroups for process adoption but no longer do due to the new stepd_add_extern_pid function and RPC.  The extern step is just like any other step, though its accounting information is only recorded if there is some CPU usage in it.  The sshd process handling the incoming connection is placed into that step.  All processes launched from that ssh connection are then tracked, limited, and accounted for by Slurm as part of the extern step.

Identification of the Associated Job

When an ssh connections is made, there is no information available about which Slurm job initiated the ssh connection.  There are also some situations where this determination would be impossible to make anyway, such as when a user on a login node connects to a compute node that is running several jobs from that user.

There are several situations that pam_slurm_adopt has to handle: zero jobs, one job, multiple jobs and we can figure out which job it's a part of, and multiple jobs and we cannot figure out which job it's a part of.  Several parameters control the PAM module's behavior in these situations and are explained below.

The user has zero jobs

By default (action_no_jobs), the module rejects the connection.  It is worth noting that other PAM modules can still allow the connection through depending on your configuration.

The user has only one job

The only reason the user has access to this node is because of the one job, therefore we assume that the incoming connection should be adopted into the job's extern step.  This approach skips other more expensive methods.  If you really want to change this behavior (let me know why so I can find out why it's useful to you), check the source code for the undocumented single_job_skip_rpc parameter.

The user has multiple jobs: Successful RPC

When multiple jobs from the user have allocations on the node, pam_slurm_adopt makes use of the new CallerID RPC that I wrote earlier this year.  The PAM module first determines the source and destination IP addresses and port numbers (IPv4 and IPv6 are supported).  It then uses the CallerID RPC to contact the slurmd that (hopefully) is listening at the source IP address and sends the network connection information to it.

The remote slurmd then uses the network connection information to track down the ssh process that initiated the connection to the other compute node.  Once it locates the process, it then looks up the Slurm job that it is a member of; Slurm's process tracking makes this possible.

After the RPC determines the remote job ID, the PAM module then adopts the ssh process into that job's extern step.

The user has multiple jobs: Unsuccessful RPC

Sometimes the RPC cannot be successful.  If the user initiated the connection from a login node, no slurmd will be running at the IP address that the RPC is made to.  The RPC will either time out or will be rejected by the host OS, preferably the latter.  If the slurmd port on the login node is firewalled in such a way that incoming packets are silently dropped rather than rejected, the RPC must wait for the configurable timeout.

Additionally, there may be a situation where an ssh connection is initiated from a compute node but, for some reason, the slurmd can't associate the source process of the connection with a job.  Maybe you configured the PAM module to allow users access even without jobs, so they aren't part of a job when they ssh again elsewhere.  Maybe you restarted slurmd at the exact moment that the RPC was made from another node.  Maybe there's a bug or you're just unlucky.  For some reason, slurmd just can't figure out the job ID.

The action_unknown parameter defines the (likely controversial) action to take when all else fails.  The default "newest" option is, in my opinion, the best.  I will attempt to convince others why this is the best option by explaining why it's the least bad option.  I will also note that node sharing must be allowed (or you set single_job_skip_rpc=0 for some reason) for you to get to this point.

The "newest" option picks the newest job on the node from the user and adopts the process into that job's extern step. The slurmd doesn't actually have access to the start time or time limit of the jobs, so the most efficient way to query that information is to compare the creation times of the various jobs' step_extern cgroups. The most recent directory mtime is chosen.  For now, it just checks the memory cgroups.  If you would prefer to use a different subsystem, modify the _indeterminate_multiple function.

Unfortunately, "newest" may result in a user running things related to job A but getting adopted into job B then having job B exit before A is finished. This will likely be a very rare event though I'm sure it will happen sometimes.

This replaced an earlier "any" option that just picked the first step it saw based on the output of a particular function call. That resulted in a somewhat random choice of job and often resulted in an older job being chosen.
The "user" option adopts the process into the slurm/uid_$UID/step_extern cgroups.  This uid_$UID cpuset cgroup contains limits that are aggregated for the jobs from that user, but the uid_$UID memory cgroup is currently not set up with an aggregated limit.  This means that the user's CPU usage can be controlled but not memory usage.  Additionally, including the ssh connection's accounting information is impossible since it's not associated with a job.  I don't recall for sure, but I don't think that stray jobs are cleaned up automatically when all user jobs exit the node.  I would not recommend this option at present.
The "user" option was removed when the switch was made from writing directly to cgroups to using stepd_add_extern_pid.  There is no mechanism in stepd_add_extern_pid for writing just to a user cgroup, and it probably wouldn't be too useful anyway.

The "allow" option just allows the process to continue without any adoption.  Limits are not enforced, accounting doesn't take place, and stray processes aren't cleaned up.  This PAM module was worthless for this situation.
The "deny" option does just what it says. This will likely cause many more user support headaches than "newest".

Sometimes a user may have a single job on a node but may sometimes have multiple jobs on a node. If a user logs into a node with only one job, the connection will succeed and be adopted into that job. If the user logs into a node with multiple jobs, the connection will be denied. From a user's point of view, this will appear to be random behavior.

Even worse, there is no good way for the user to specify that a connection should be associated with a particular job. Basically, the user sees random connection failures that can't be circumvented. This seems much worse than the occasional situation with "newest" where a user process gets adopted into the newest job on a node and the newest job exits before the user finishes working with the intended job. Deny will happen every time whereas "newest" will negatively affect someone very rarely.
Is it working?

You can enable more debug output with the PAM module's log_level parameter.  The debug levels are the same as those for SlurmctldDebug.  "debug" is probably sufficient for most debugging and "debug3" gets you slightly more from the pam module itself.  If I recall correctly there is also additional information from other portions of the Slurm code at "debug5" that might be useful for debugging.

Additionally, the /proc/self/cgroup file tells you what cgroups a process is a member of.  If a process is properly launched or adopted into a Slurm job, you should see something like this and maybe more cgroups depending on your configuration:
Those paths are under their respective cgroup root (see /proc/mounts).  The root is controlled by your cgroup.conf's CgroupMountpoint setting, which defaults to /cgroup.  cgroup subsystems are mounted under there, so 2:cpuset:/slurm/uid_1234/job_8804874/step_extern in the example above is /cgroup/cpuset/slurm/uid_1234/job_8804874/step_extern using the default configuration.

Inclusion in Slurm

The necessary code was included in Slurm 15.08.3.  We have been using the latest pam_slurm_adopt code in production starting with 15.08.2, prior to its inclusion in .3.

  • 15.08.3 mostly works and is safe to use
    • Processes are adopted and limited
      • I left off the "s" in "devices" so you need this patch
    • Accounting does not work
    • Stray processes are not cleaned up
  • 15.08.4 reworked to use new stepd_add_extern_pid function
    • Bug 2096: Stray processes are not cleaned up
    • Bug 2097: Accounting not working cpuset cgroup not created
    • Accounting now works
  • 15.08.5 will hopefully be 100% functional when released
    • Bugs 2096 and 2097 sill need to be fixed

Stray process cleanup can be fixed by using epilog to kill off anything in step_extern, such as this (assuming your cgroups are mounted at /cgroup): xargs kill < /cgroup/cpuset/slurm/uid_$UID/job_$SLURM_JOB_ID/step_extern/tasks.  I have not yet implemented or tested that code in epilog (but I will at some point) so test it first.  It may also need to clean up task_$task cgroups under the step_extern.  Once bug 2096 is resolved, this will no longer be necessary.

Hopefully 15.08.5 will fix everything.  I will update this post later.

More Information

More information about the mechanics of the CallerID RPC (REQUEST_NETWORK_CALLERID) can be found in a previous post.  Configuration information can be found with the source code in contribs/pam_slurm_adopt/README.

Let me know if you have any other ideas.

Wednesday, April 1, 2015

Caller ID: Handling ssh-launched processes in Slurm

I have thought long and hard about it, but I finally figured out how to handle ssh-launched processes in Slurm.  ssh may be used to launch tasks by real MPI with no Slurm support or by "poor man's MPI" that just launches tasks using the only thing the developer knows how to use: ssh.  Proper programs are written using MPI that is compiled with Slurm integration, but not all programs are proper.  Accounting and resource allocation enforcement should soon work with these attempted escapees...

UPDATE May 4, 2015:  The callerid code is now in Slurm. See "Status" section below for more information about the next steps.

UPDATE May 14, 2015:  SchedMD added an "extern" step that is activated with PrologFlags=contain.  This step will exist on all nodes in an allocation and will be the step that ssh-launched processes are adopted into.

UPDATE Oct 22, 2015:  I finally got around to coding again and I got everything working!  I submitted a patch.  I'll probably write a new post that more accurately reflects the final state of the code.

UPDATE NOV 3, 2015:  I made a minor update that was committed and will be in 15.08.3.

Friday, August 29, 2014

Fair Tree Slurm Fairshare Algorithm

That's right.  Levi Morrison and I created a second Slurm fairshare algorithm, Fair Tree.  Our first algorithm, LEVEL_BASED, was accepted into Slurm and became available in 14.11.0pre3 about one month ago.  Fair Tree was accepted into Slurm in time for 14.11 and replaced LEVEL_BASED.

When given the same inputs, both algorithms produce effectively equivalent outputs.  The objective of both algorithms is the same:  If accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B.

So why bother writing a new algorithm three months after the first one if the first algorithm successfully solved the same problems?

Friday, June 20, 2014

LEVEL_BASED Slurm Prioritization Algorithm

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  To see why it is necessary, please see my other post about the problems with algorithms that existed at the time of its creation.

DEPRECATED:  This has been deprecated by our new algorithm, Fair Tree.  Yes, we really did replace this algorithm within a few months even though it worked great for us.  See the post about Fair Tree for details.

Problems with Existing Slurm Prioritization Methods

UPDATELevel-Based was replaced by Fair Tree, an even better algorithm that we created.

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED.  In order to understand why LEVEL_BASED is necessary, I have chosen to write this post about our issues with the existing options and a separate post about LEVEL_BASED.  If you just want to see information about LEVEL_BASED, see the post LEVEL_BASED Slurm Prioritization Algorithm.

We want users from an account that has higher usage to have lower priority than users from an account with lower usage.  There is no existing algorithm that consistently does this.

Thursday, May 22, 2014

Job Script Generator for Slurm and PBS published on Github

We published version 2.0 of our batch job script generator on Github.  It is a Javascript library (LGPLv3) that allows users to learn Slurm and PBS syntax by testing various inputs in an easy-to-understand manner.  Links: git repo, demo, other github projects of ours.

Thursday, April 17, 2014

Scheduler Limit: Remaining Cputime Per User/Account

Update May 9, 2014:  Added a link to our GrpCPURunMins Visualizer

I have discussed this with several interested people recently so it's time for me to write it up.  When running an HPC batch job scheduling system such as Slurm, Moab, Maui, or LSF, there are many ways to configure user limits.  Some of the easiest limits to understand are on the number of jobs a user can run or the maximum cores or nodes that they can use.  We have used a different limit for several years now that is worth sharing.

No one likes to see a cluster that is 60% utilized while users sit in the queue, unable to run due to a core count limit they are hitting.  Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a cluster with $MAX_WALLTIME day jobs during a brief lull in usage.  Obviously, other users are displeased when they then submit jobs five minutes later that will now have to wait for all of the other user's job to finish in $MAX_WALLTIME days.  This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.

Thursday, October 17, 2013

User Fencing Tools (UFT) on github

I just published a set of scripts, programs, config file examples, etc that I wrote for use at BYU but should be useful to other HPC sites.  I couldn't think of a better name for it, so I called it the User Fencing Tools (UFT).  It is available in our github repo at

The tools are used to control users on HPC login nodes and compute nodes in various ways.  The tools make use of cgroups, namespaces, and cputime limits to ensure that users don't negatively affect each others' work.  We limit memory, CPU, disk, and cputime for users.

UFT also has examples for how to control ssh-launched processes on compute nodes.  You can account for those with Torque but can't control them (just like normal).  SLURM will have accounting and resource enforcement for these in 13.12 (Dec. 2013).

Wednesday, August 21, 2013

IPMI over LAN vulnerability and some BMC "features"

I don't want to pull away credit or page views from Dan Farmer's great work, but this needs more exposure...
For those of you who manage servers with IPMI over LAN enabled, there is a very severe vulnerability that may allow anyone full root access to your iLO/iDRAC/IMM/ILOM/whatever (aka BMC).  This is independent of the OS, though once rooted the attacker can then take over the OS in the same way they would as if they have physical access.  They can control power, boot settings, serial over LAN, BIOS settings (via serial), KVM, and can even read/write arbitrary system memory.

For those of you who do not have IPMI over LAN enabled, there may be some stuff that affects you too...

Wednesday, July 24, 2013

Server Room and Three Phase Power for Systems Administrators

There doesn't seem to be much educational material about server room power that is comprehensible to systems administrators.  I don't think there is a "typical" sysadmin type out there but I'm guessing that most have had little to no formal training about server room power.  Three phase power may seem like black magic and lots of incorrect assumptions are made, thus I decided to write this post.  Hopefully this will be useful to some sysadmins out there.