Monday, November 16, 2015

pam_slurm_adopt

pam_slurm_adopt is a PAM module I wrote that adopts incoming ssh connections into the appropriate Slurm job on the node. The module allows Slurm to control ssh-launched processes as if they were launched under Slurm in the first place and provides for limits enforcement, accounting, and stray process cleanup when a job exits.

Some MPI implementations don't provide for Slurm integration or are not compiled with Slurm support, so the fallback is ssh.  Some code doesn't even use MPI and instead directly calls ssh.  The ideal solution is always to utilize properly-compiled MPI that supports Slurm, but realistically that's not going to happen all the time.  That's where pam_slurm_adopt comes in.

As I see it, this PAM module solves three main problems.  ssh-launched processes now:
  1. Have proper limit enforcement
  2. Will be accounted for in the accounting database
  3. Will be cleaned up when the job exits
We have been using the code in production or more than three weeks now and things work great, except a few minor bugs still need to be fixed to get all the intended benefits.  See the "Inclusion in Slurm" section for details.