Monday, November 16, 2015

pam_slurm_adopt

pam_slurm_adopt is a PAM module I wrote that adopts incoming ssh connections into the appropriate Slurm job on the node. The module allows Slurm to control ssh-launched processes as if they were launched under Slurm in the first place and provides for limits enforcement, accounting, and stray process cleanup when a job exits.

Some MPI implementations don't provide for Slurm integration or are not compiled with Slurm support, so the fallback is ssh.  Some code doesn't even use MPI and instead directly calls ssh.  The ideal solution is always to utilize properly-compiled MPI that supports Slurm, but realistically that's not going to happen all the time.  That's where pam_slurm_adopt comes in.

As I see it, this PAM module solves three main problems.  ssh-launched processes now:
  1. Have proper limit enforcement
  2. Will be accounted for in the accounting database
  3. Will be cleaned up when the job exits
We have been using the code in production or more than three weeks now and things work great, except a few minor bugs still need to be fixed to get all the intended benefits.  See the "Inclusion in Slurm" section for details.

Wednesday, April 1, 2015

Caller ID: Handling ssh-launched processes in Slurm

I have thought long and hard about it, but I finally figured out how to handle ssh-launched processes in Slurm.  ssh may be used to launch tasks by real MPI with no Slurm support or by "poor man's MPI" that just launches tasks using the only thing the developer knows how to use: ssh.  Proper programs are written using MPI that is compiled with Slurm integration, but not all programs are proper.  Accounting and resource allocation enforcement should soon work with these attempted escapees...

UPDATE May 4, 2015:  The callerid code is now in Slurm. See "Status" section below for more information about the next steps.

UPDATE May 14, 2015:  SchedMD added an "extern" step that is activated with PrologFlags=contain.  This step will exist on all nodes in an allocation and will be the step that ssh-launched processes are adopted into.

UPDATE Oct 22, 2015:  I finally got around to coding again and I got everything working!  I submitted a patch.  I'll probably write a new post that more accurately reflects the final state of the code.

UPDATE NOV 3, 2015:  I made a minor update that was committed and will be in 15.08.3.