Tuesday, July 16, 2013

Per-user /tmp and /dev/shm directories

Updated Oct 7, 2013: Tons of updates
Updated: March 19, 2014: The recommended configuration has been in production for months now and works great

I recently discovered a great feature in Linux that allows for per-process namespaces (aka polyinstantiation).  Different processes on the same machine can have different views of a filesystem, such as where /tmp and /dev/shm are.  You can easily make it so that each user on a shared system has a different /tmp that, to each of them, really looks like (and is) /tmp.  This isn't done by setting an environment variable; this redefines mount points on a per-process basis such that each users' processes are using their own directory as /tmp.

For example, you can create /usertmp and have directories automatically created as /usertmp/$USER when a user logs in.  A pam module (pam_namespace.so) is called from /etc/pam.d/sshd that does some magic to disassociate mount points and mount /usertmp/$USER as /tmp.  If you want the user's /tmp to be on the same filesystem as the "real" /tmp (e.g. /tmp is a separate partition) you can instead instantiate user directories as /tmp/usertmp/$USER.

There is also an option to do a per-session directory; when the user exits (disconnects from ssh) the directory is deleted.  SELinux contexts can also be used.  A new tmpfs mount can be created too.

We've begun using this on login nodes and it works well.  Cleaning up these directories will be as simple as this cron script:
for dir in /tmp/usertmp/* /dev/shm/usertmp/*
do
    user=$(basename "$dir")
    if ! pgrep -u $user >/dev/null
    then
        tmpwatch 3h "$dir" #can also use rm -rf "$dir"
    fi
done

(Code typed directly into the blog and not yet tested. Allows for temporary disconnects over ssh before deleting everything).

Why use per-user /tmp and /dev/shm?

It makes it much easier to clean up after users when batch jobs exit or they are logged out from login nodes.  It can be much cleaner than periodically traversing all files on /tmp then hoping the user doesn't need the files anymore.  Now we can easily tie it to whether a user is logged in or not and very quickly delete the data when they're logged out.

The biggest advantage will be on compute nodes.  A simple epilog script will delete a user's directories when there are no more jobs from that user on the node.  I looked through the pam_namespace code and figured out how to create directories on a per-job basis but I don't know how to get the $SLURM_JOB_ID variable (or any other ssh variable) passed to pam.  We do have it set up such that $SLURM_JOB_ID is allowed to pass between ssh and sshd, so that's not the issue.  Most multi-node jobs are launched using MPI with SLURM support so I may be able to make it work, but it probably isn't worth the effort at this point.

Configuration Examples

Below are some simple configuration settings that you can try out.  Read through the Caveats section since I explain why these simple examples may cause problems.

Append to /etc/pam.d/sshd and other applicable PAM files, if any, such as slurm or other scheduler equivalents:
session required pam_namespace.so

/etc/security/namespace.conf:
/tmp      /tmp/usertmp/      user  root,admin1,admin2
/dev/shm  /dev/shm/usertmp/  user  root,admin1,admin2


You can omit admin1, admin2, etc. if you make the changes described below in Caveats: sudo/su.

Also create an init script (or append to /etc/rc.local) something that does:
mkdir -pm 000 /tmp/usertmp
mkdir -pm 000 /dev/shm/usertmp

This configuration will exclude the root, admin1, and admin2 users from namespaces.  There is also a syntax to whitelist instead of blacklist: prepend the user list with ~.

The kernel code to allow polyinstantiation was included in 2.4 so it's a shame I didn't hear about it sooner.  This mechanism is very useful for keeping /tmp and /dev/shm clean even on large multi-user systems.  Users won't be able to see each other's /tmp files at all so it's also nice for security reasons.

Unfortunately, things aren't as simple as I would like so you'll definitely want to read through the caveats.  My recommended configuration is after the caveats section.


Caveats

The polyinstantiation mechanism used by pam_namespace.so completely detaches a process's view of mountpoints from future changes, such as mounting or unmounting a filesystem.  This can be problematic and may require a reboot of the server if you want to make big changes.

Here are several things that may break in ways you have probably not seen before.  I address each one individually then give my recommendation for a one-size-fits-all global fix.

Automounting

Automounting will break but there is a fix.  If you automount directories somewhere under /autofs, run the following commands as part of an init script before anything actually uses the automount:
mount --bind /autofs /autofs
mount --make-rshared /autofs

Be sure to run that before any users log in or batch jobs run.  Line 1 can be replaced by the following line in /etc/fstab: "/autofs /autofs bind defaults,bind 0 0".  Only do one of the two methods.  Does anyone know the equivalent of line 2 in /etc/fstab?

There's more information about this at LWN [1] [2].  Link 1's contents are also under Documentation/filesystems/sharedsubtree.txt in the kernel documentation.  Skimming over it probably won't help much; carefully read through the examples and it should be understandable.

Basically, we need to force the /autofs mount to share itself and all of its mounts and unmounts with all namespaces.  This must be done before those namespaces are created.  Any changes made under /autofs will be reflected in all namespaces that were created after mount --make-rshared was called.


sudo / su

Read this section in its entirety before attempting to implement it or you will break your system in different but subtle ways.

Unless you configure things properly, you may break your system.  For example, you don't exclude admin2 in namespace.conf.  admin2 logs in as admin2 then runs "sudo some command that restarts a daemon" or "su".  Anything launched as root in that scenario will retain the namespace configuration.  That can cause problems if you try to automatically clean admin2's directories from cron after admin2 logs out.

This can be mostly fixed in the following manner:

Add the following line...
session required pam_namespace.so unmnt_only

...to the following files:
/etc/pam.d/{sudo,sudo-i,su,su-l}

There may be some other files in pam.d that apply to you but it's not too likely.  You may also use unmnt_remnt instead of unmnt_only if it's more applicable, such as doing something different for root.

There is something important to note about this configuration.  The namespace is a new private one that should look very similar to the original.  But it is not the same as the global view.  You can come very close to approximating a global view by doing the following:
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm


This makes it so mounts are shared between namespaces by default.  The directories /tmp and /dev/shm are then set to be private so that private namespaces will work properly for those directories.

SLURM doesn't use PAM to launch jobs by default

Set UsePAM=1 in slurm.conf.  YMMV with other schedulers (if applicable).

Reboot required for new mounts/unmounts?

Okay, this can get a little ugly.  Let's say userbob logged in and will be logged in for days.  You're selling the NAS device that is mounted via NFS at /oldjunk.  You want to unmount /oldjunk.  What happens when you call "umount /oldjunk"?  It won't work because userbob logged in and automatically had it mounted in his own namespace by pam_namespace.so.  The unmount fails.  Not so nice, is it?  Alternatively, userbob needs /newtoy but his existing processes won't see it.  The easy way to handle this is to tell him to log out and log back in again.

There are some solutions to it.  One easy way is to mount all of your network filesystems (or USB, etc) under /mnt/.  Then run the following in an init script:
mount --bind /mnt /mnt
mount --make-rshared /mnt

You may notice that it's the same commands as fixing automounting.  Just as with the automounter, we want changes to these filesystems to be reflected globally.  So now instead of /oldjunk, it's /mnt/oldjunk.  You can still symlink that to /oldjunk if so desired.

Let's say you misconfigured things...

You deployed your namespace configuration but didn't fully test every possible scenario (who would do such a thing... ;-) ).  A user has trouble accessing an automounted filesystem since the filesystem wasn't mounted when the job launched.  How can you help the user get work done even while you're testing changes?

In an HPC environment this can cause interesting behavior.  Any necessary filesystems must be mounted before a user's job starts.  The automount must be done before the job script hits the PAM stack.  This may be done via a prolog mechanism.  In SLURM, the user can use "ls" or something to make the directory automount then use srun to launch the job.  If UsePAM=1, the process that launches via srun will hit the PAM stack and have the automounted file system available.

This will be loads of fun for a user trying to diagnose messages like "ls: cannot access /autofs/mystuff: Too many levels of symbolic links" that occur in a job but not on a login node.  Of course the user leaves a shell open for months on the login node in /autofs/mystuff/dir1/test/ so the automount never expires and they never hit that error message...

Final config with fixes for everything above (hopefully)

Add this and only this to an init script that runs before anything else that needs a private namespace (e.g. user logins, ssh, batch jobs, etc):
mkdir -pm 000 /tmp/usertmp
mkdir -pm 000 /dev/shm/usertmp
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm


If you're adventurous like me (see "Caveats: sudo/su"), configure the following in /etc/security/namespace.conf:
/tmp      /tmp/usertmp/      user  root
/dev/shm  /dev/shm/usertmp/  user  root

I foresee (but have not encountered) possible bugs in not excluding admins, so the following might be safer to use in /etc/security/namespace.conf:
/tmp      /tmp/usertmp/      user  root,admin1,admin2
/dev/shm  /dev/shm/usertmp/  user  root,admin1,admin2

Add session required pam_namespace.so to /etc/pam.d/{sshd,slurm,......} and other files that need namespaces.

Add session required pam_namespace.so unmnt_only to /etc/pam.d/{sudo,sudo-i,su,su-l}.

Does it actually work?

Works for me!  The recommended configuration has been in production for months now and works great, even with all our users who make careers out of breaking things.  Users get their own "sandboxes" in /dev/shm and /tmp that are separate from each other and easy to clean up.  We have a SLURM epilog script that automatically cleans up those directories for the user and it works great.  It's much easier, safer, and faster than the old traverse-the-filesystem method of cleaning up after a user.

Debugging

For debugging information examine /proc/$pid/mounts (/proc/self/mounts is the current process) to see what process has access to which mounts.  Simply calling "mount" won't give you all the information you need.


(For other people searching for how to do something like this, I will include some phrases I initially searched for while researching this solution:  per user bind mounts, dynamic bind mounts, dynamic /tmp or /dev/shm, separate /tmp or /dev/shm or tmpfs per user.  I don't normally try any kind of SEO but the terms to search on aren't exactly clear: "polyinstantiation" or "pam_namespace")

1 comment:

  1. Thank you very much, your post was very useful and is very complete, helped me a lot.

    ReplyDelete

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.