Wednesday, April 1, 2015

Caller ID: Handling ssh-launched processes in Slurm

I have thought long and hard about it, but I finally figured out how to handle ssh-launched processes in Slurm.  ssh may be used to launch tasks by real MPI with no Slurm support or by "poor man's MPI" that just launches tasks using the only thing the developer knows how to use: ssh.  Proper programs are written using MPI that is compiled with Slurm integration, but not all programs are proper.  Accounting and resource allocation enforcement should soon work with these attempted escapees...

UPDATE May 4, 2015:  The callerid code is now in Slurm. See "Status" section below for more information about the next steps.

UPDATE May 14, 2015:  SchedMD added an "extern" step that is activated with PrologFlags=contain.  This step will exist on all nodes in an allocation and will be the step that ssh-launched processes are adopted into.

UPDATE Oct 22, 2015:  I finally got around to coding again and I got everything working!  I submitted a patch.  I'll probably write a new post that more accurately reflects the final state of the code.

UPDATE NOV 3, 2015:  I made a minor update that was committed and will be in 15.08.3.

Why do ssh-launched programs cause problems?

If you like accounting, resource limit enforcement, etc. then you don't want tasks to be launched via ssh.  ssh allows a user to escape all of that since the sshd instance has no integration with Slurm.  Unfortunately some users rely on poorly written programs that use ssh and that's just how it is.  We could block ssh connections completely but instead require that they at least have a job on the node they ssh into (enforced by pam_slurm.so).

Previous ideas on how to "catch" ssh-launched processes

I had several ideas of how to do this, but not all of the ideas were that good.  Here is a summary of the first two ideas and why they don't work:

Idea #1: pam + environment variables

pam is meant for this kind of thing, right?  First, I would configure AcceptEnv and SendEnv in the ssh config files so that $SLURM_JOB_ID is sent between nodes.  pam could then read those values and "adopt" the process into the right job, subject to sanity checks for $SLURM_JOB_ID values to make sure that the owner of the job is the same as the owner of the ssh process.

After some checking I saw that it was likely impossible to read user-provided environment variables in pam.  I confirmed it with an openssh developer, so this method will not work.  The pam modules are run before the environment variables are even received by sshd.

Idea #2: sshrc + environment variables

The next idea was really similar but much uglier.  There is no one script that 1) you can guarantee will be run for every ssh connection that 2) works in all shells 3) without potentially affecting user workflows.  The closest is /etc/ssh/sshrc (see sshd(8) manpage, SSHRC section).  If you disable user rc files (option available as PermitUserRC since openssh 6.7) you can guarantee that /etc/ssh/sshrc will run, though some people actually use ~/.ssh/rc.

sshrc runs as the end user and its parent is the process that launches the user's shell.  You could grab the pid of the sshrc's parent process and "adopt" that into the job, subject to the same security constraints as mentioned in idea #1.  This idea was still messy since users can just overwrite $SLURM_JOB_ID before calling ssh (accidentally or maliciously).  Accidents would be easy since all you have to do is use one of the exec*e functions to wipe out the environment, or something similar.

Configuration would also be harder for other Slurm admins since you would have to 1) configure ssh_config, 2) configure sshd_config, 3) upgrade to the very recent openssh 6.7 release and configure PermitUserRC=false, 4) create an appropriate /etc/ssh/sshrc file that won't mess up X11 sessions, 5) work with any users who have a ~/.ssh/rc file to do something different, and 6) hope that users don't accidentally overwrite $SLURM_JOB_ID.  After all of that is done, it should actually work most of the time when the "adopt" functionality is added to Slurm.

An idea that works: pam + "Caller ID" RPC

I finally had an epiphany on how to make this all work in a much simpler way.  Think of running netstat with a "-p" on both the destination and source nodes of an ssh connection.  That shows you which process is associated with which connection on which node.  You can associate which process on the source resulted in which process on the destination.

Slurm keeps track of which processes are part of which jobs, so any ssh connections launched within a job to another node can theoretically be traced back.  The destination node has to make an RPC call to ask the source of the connection about who initiated the connection in the first place, something that the slurmd on the source could then look up.  A pam module would be the perfect place for the destination node to make that RPC call.

For lack of a better term, I decided to call this functionality "Caller ID" and I wrote the Slurm code for this RPC call and the pam module (pam_slurm_adopt.so).

Here is a rough diagram of how it works:

Node1                        Node2
-----                        -----
job123            
 ||
 \/
ssh  =====================>  sshd
                              ||
                              \/
                             pam_slurm_adopt.so
                              ||
                              \/
                             look in /proc/self/fd/ for sockets
                             (use inode# of socket)
                              ||
                              \/
                             look for inode# in /proc/net/tcp{,6}
                             (row contains src/dst IP/port of ssh
                             connection)
                              ||
                              \/
                             send callerid RPC call to slurmd
                             at src IP.
slurmd  <==================  (data: src/dst IP/port and IP version)
 ||
 \/
check src/dst IP/port
among TCP connections
in /proc/net/tcp{,6}
(results in socket inode#)
 ||
 \/
search /proc/[0-9]*/fd/
for process with that
inode open
 ||
 \/
find Slurm job that
contains that process
 ||
 \/
return job_id  ===========> returned to RPC call in pam_slurm_adopt.so
                              ||
                              \/
                            "adopt" this process into task,
                             accounting, and other plugins
                              ||
                              \/
                             continue


This means that every ssh connection results in an RPC call to the node that originated the ssh connection.  If there is a slurmd at that IP address, great.  Otherwise it continues on without adopting the process.  That may change slightly with some future development work (see below).

What do I mean by "adopt"?

Slurm tracks processes for accounting and resource allocation enforcement purposes.  It also cleans up processes when a job exits.

When I say "adopt a process", I mean that Slurm will begin accounting and enforcing resource allocations for a process that it didn't launch, such as one launched through ssh.  For cgroups plugins this means adding the process to the appropriate cgroups tasks files.

My initial plan is to never adopt root processes.  The reason for this is that you probably don't want root ssh connections to hang for (potentially) a few seconds while the RPC call is made, especially since most sites don't use root jobs.  I'll probably add a configuration option at some point for the pam module so that an admin can decide otherwise.

What step will this process be adopted into?

The pam module needs some additional Slurm code to be added before the adoption can work.  SchedMD has tentatively planned to add what's called an "allocation step".  This step will be like the batch step in that it allows other steps to coexist with it simultaneously but the maximum resources allowed to be used on the node will not increase beyond the allocation.

Several other options were discussed and rejected, such as creating a step per ssh connection, creating a whole-node step per ssh connection, or having the first ssh connection that's launched create a whole-allocation "ssh step".  Each of these options has negative side-effects that would preclude other steps from launching since normal steps can't coexist with other normal steps in the same part of the allocation (i.e. only one normal step can have access to CPU 17 on node21).

Some MPI code may launch one ssh connection per node that then spawns all the tasks.  Other MPI code may launch one ssh connection per task.  The fact that some users may use one MPI and others a different MPI mean the normal step approach won't work, thus the allocation step idea.

I'll update this post once I know more details about SchedMD's plans (probably with a link to the relevant yet-to-be-filed bug report).

Status

RPC call and PAM module

The RPC call is developed and working.  The pam module (pam_slurm_adopt.so) is developed and working but does not yet have the process adopted into Slurm.  The code was committed in 3153612e6907e736158d5df3fc43409c7b2395eb on May 4, 2015.

Allocation Step

SchedMD added the code to make this happen.  When PrologFlags=contain is set, an "extern" step is created on each node in a job's allocation.  The ssh-launched processes will use the extern step.  There was a minor bug that had to be fixed for 15.08.2, but this is all in Slurm now.

Adoption

Also complete.  This was the last piece of code that was needed and it was submitted for inclusion on October 22, 2015.  Once this is merged, everything will be complete.

Future development?

Most of the future development work has to do with failures to contact the slurmd at the source of the ssh connection or determine the job ID once it is contacted.  I may add some of this before submitting the code; we'll see.

The most common failure scenarios will be administrator error (e.g. firewalls) or a user on a login node connecting directly to a node to monitor a job.  There are a few variations on the problem:

Indeterminate Job ID:  User has a single job on the destination node COMMITTED

If no job ID can be determined, such as when a user interactively connects to a node from outside of a job, the pam module currently does nothing.  I plan to add code to the pam module so that 1) if the source job ID cannot be determined and 2) there is a single job from that user on the destination node, the process will be adopted into that job (otherwise they shouldn't be there, right?).

Of course it may be a good idea to skip the RPC call in this scenario and just do the adoption; there is only one option anyway.  If this ssh connection is not supposed to be part of that job, who cares?  What was the user doing on that node anyway?

Now that I think about it, I may just add code to skip the RPC call if the user has only one job on the node. The process will be adopted into that job.

Indeterminate Job ID:  User has multiple jobs on the destination node
COMMITTED

That idea is simple enough, but what if there are multiple jobs from the user on the node?  Should it pick a random job?  Pick the most recently launched?  Deny the connection?  Do nothing?  Present an interactive chooser?

This should most likely be configurable by the administrator.  My preference is probably to pick a random job and hope for the best.  If they do something stupid like launch a cpu- or memory-intensive application and mess up that random job, at least we can ask them the (hopefully rhetorical) question "would you prefer that it affect someone else's jobs instead?".  An interactive menu might be preferable but I don't know if it could cause issues.

If cgroup plugins are in use, a less attractive option is to place the process in the uid_$UID cgroups instead of the job or step cgroups.  The cpuset cgroup is currently set up for aggregation at the uid level but not the memory cgroup, meaning that if the user has five cores allocated between five different jobs, the uid cgroup limits the user to five cores.  The setup code for memory cgroups does not have similar code yet, as far as I know.

Indeterminate Job ID:  User has no jobs on the destination node
COMMITTED

pam_slurm.so already handles this situation by denying access to the node.  Should this pam module do the same or should it be used in conjunction with pam_slurm.so?  Currently both pam modules must be used to achieve this functionality so it might make sense to just incorporate that functionality into this pam module and stop using the other (unless it has other functionality that I'm not aware of).

Set Primary GID to Job's GID

It would be useful to have an option for the pam module to set the user's primary group to be that of the job's GID.  This would match up the environments better but I need to investigate if it's possible.

This became an issue when we had a user try to run gdb on a running process of his that was launched with "sbatch --gid=somegroup".  His ssh session on that node didn't automatically switch him to somegroup (as expected) but that made him get "ptrace: Operation not permitted" messages.  We had to tell him to run "newgrp somegroup" before running gdb or strace.  It also might have been the cause of why hidepid=2 on /proc caused him to not see his own processes (before we disabled it)...

Automatically placing the user in the right primary group could be quite useful.  The only time it might get confusing is if the process is adopted into a random job because the user 1) ssh'd directly from a login node 2) to a node with multiple jobs of his on that node that 3) were launched with different --gid settings, and 4) the pam module option was set to just pick a random job if the source job of the ssh connection is indeterminate.  That's a whole lot of conditions that have to be met for there to be problems so I plan to set this option as enabled by default.

UPDATE:  After looking, it appears that sshd overwrites the permissions that pam sets (even when done the proper way in a pam module).  I'll have to ask on openssh-dev to see about adding an option to allow pam to overwrite the primary group.  I could successfully modify the list of groups, just not the primary.

UPDATE2:  I submitted a patch to openssh to allow for this to happen when PermitGidOverride=yes.  Let's see if it gets accepted: mailing list thread and bug report.

Caveats

  • Multihomed nodes can cause issues if slurmd isn't listening on all IP addresses or is firewalled on different IPs.  The ssh connection must originate from an IP on which that node's slurmd is accessible.  The positive side of this is that the RPC call is made directly to the source IP address, so IPs that Slurm doesn't "know about" will still work.
  • IPv6 is supported by the caller ID code itself but Slurm does not yet support IPv6.  No RPC call can be made to an IPv6 address so IPv6 will not work until that's possible.
  • This is very much a Linux-only solution at the moment.  The RPC call itself is OS-agnostic, so the changes needed to support other operating systems are somewhat minimal.  The callerid.c file would need to be updated with $OTHER_OS's method of finding connection information and then mapping it back to a process.  If that feature is added, the same Slurm installation should handle nodes with different OSs without an issue.
  • There is currently no support in this code for running multiple slurmd's on the same node.  That is typically only used in a development environment, so someone else can worry about the code for that if it's actually needed.

Random thoughts

  • The RPC call on the slurmd side checks to see that the requestor is root, SlurmUser, or the owner of the job.  If not, it does not return the job ID.  I cannot see a valid reason to tell a non-job-owner about someone else's job.
  • This code should work as-is for network connections other than ssh and should work just fine for UDP (read from /proc/net/udp{,6}) or maybe other protocols.
  • Make sure your slurmd is available and not firewalled on all the IPs that a user might initiate an ssh connection from.
  • Hopefully users will just use real MPI with Slurm support (and hopefully someone will just send me millions of dollars for fun).

I'll post occasional updates here as things progress.

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.