Wednesday, April 1, 2015

Caller ID: Handling ssh-launched processes in Slurm

I have thought long and hard about it, but I finally figured out how to handle ssh-launched processes in Slurm.  ssh may be used to launch tasks by real MPI with no Slurm support or by "poor man's MPI" that just launches tasks using the only thing the developer knows how to use: ssh.  Proper programs are written using MPI that is compiled with Slurm integration, but not all programs are proper.  Accounting and resource allocation enforcement should soon work with these attempted escapees...

UPDATE May 4, 2015:  The callerid code is now in Slurm. See "Status" section below for more information about the next steps.

UPDATE May 14, 2015:  SchedMD added an "extern" step that is activated with PrologFlags=contain.  This step will exist on all nodes in an allocation and will be the step that ssh-launched processes are adopted into.

UPDATE Oct 22, 2015:  I finally got around to coding again and I got everything working!  I submitted a patch.  I'll probably write a new post that more accurately reflects the final state of the code.

UPDATE NOV 3, 2015:  I made a minor update that was committed and will be in 15.08.3.