Wednesday, May 14, 2025

Securely Running Windows in HPC

We now have proper Windows support in our HPC environment. Hooray? It is through pretty novel means, meant to make it as secure as possible while being easy to maintain for Linux systems administrators.

Long story short, a Windows VM is launched through Open OnDemand and starts inside of an isolated network namespace. The network namespace is started by spank_iso_netns (see below), a SPANK plugin I wrote. The user's files are made available inside the Windows VM.

Network Segmentation of Users on Multi-User Servers and Networks

Early last year, I finished a paper for the Master of Science in Information Security Engineering program at SANS Technology Institute. The paper was published at SANS.org and SANS.edu a year ago but I am just now writing about the project here since life was pretty busy at the time.

I spent several years thinking through this problem and trying different technologies to make it work. I spent a few months finalizing it then writing it up for the academic paper (during which time I also had to, um... think about... this totally hypothetical disaster for several months...).

Rather than rewriting my paper as a blog post, I will include some of the most important parts of the paper, add some additional commentary, then refer you to the paper as needed. If you are actually interested in this solution, please read the paper since it has many more details.

Problem: Potential attack vector

As seen in real life on an HPC login node:

tcp 0 0 127.0.0.1:46614 0.0.0.0:*   LISTEN   23689/Xorg
tcp 0 0 0.0.0.0:48279    0.0.0.0:*   LISTEN   123965/mpiexec
tcp 0 0 0.0.0.0:6104     0.0.0.0:*   LISTEN   107345/Xorg
tcp 0 0 0.0.0.0:39000    0.0.0.0:*   LISTEN   213053/node
tcp 0 0 0.0.0.0:39001    0.0.0.0:*   LISTEN   212528/mongod
tcp 0 0 0.0.0.0:39002    0.0.0.0:*   LISTEN   212606/python
tcp 0 0 0.0.0.0:39003    0.0.0.0:*   LISTEN   212788/python
tcp 0 0 0.0.0.0:39004    0.0.0.0:*   LISTEN   212879/python
tcp 0 0 0.0.0.0:36846    0.0.0.0:* LISTEN   60839/java
tcp 0 0 127.0.0.1:63342 0.0.0.0:*   LISTEN   60839/java
tcp 0 0 0.0.0.0:45621    0.0.0.0:*   LISTEN   60839/java
tcp 0 0 0.0.0.0:43727    0.0.0.0:*   LISTEN   6473/mpiexec
tcp 0 0 0.0.0.0:56733    0.0.0.0:*   LISTEN   219999/mpiexec
tcp 0 0 0.0.0.0:33374    0.0.0.0:*   LISTEN   175150/mpiexec
tcp 0 0 0.0.0.0:37987    0.0.0.0:*   LISTEN   199971/mpiexec
tcp 0 0 0.0.0.0:44228    0.0.0.0:*   LISTEN   123965/mpiexec
tcp 0 0 0.0.0.0:52294    0.0.0.0:*   LISTEN   264169/mpiexec

These processes were all owned by various HPC users and there were probably dozens of people logged into that system at that moment.

I'll be lazy and quote from the paper's abstract:

In High Performance Computing (HPC) environments, hundreds of users can be logged in and running batch jobs simultaneously on clusters of servers in a multi-user environment. Security controls may be in place for much of the overall HPC environment, but user network communication is rarely included in those controls. Some users run software that must listen on arbitrary network ports, exposing user software to attacks by others. This creates the possibility of account compromise by fellow users who have access to those same servers and networks. A solution was developed to transparently segregate users from each other both locally and over the network. The result is easy to install and administer.

Fixing bad T10-PI checksums

Background (in case you care)

Let's say you have some kind of hypothetical data disaster where you lose all redundancy in a RAID-6 pool through no fault or design decision of your own, maybe because of a firmware bug. Then let's say a hard drive physically bites the dust at the worst possible time.

Then let's imagine that you send the drive off to a data recovery firm who is able to recover 100% of the data (awesome!!!), but they don't know about T10-PI checksums and thus don't copy them onto the clone. Let's assume that due to various circumstances this was your one shot at it and they can't just reclone with T10-PI for some reason.

So then pretend that you get your 100% successful clone back, only to find out that your disk is unreadable in the array because the disk, which was correctly formatted with T10-PI Type 2 checksums, does not contain the correct checksums for each sector. Every read of every sector on the disk will fail.

So now your data is sitting there on the drive, just waiting for you to pull it off. Simple, right? Just find somewhere else you can read the data off by disabling the checksum verification. Well, let's say that for some reason your "RAID-6" pool is actually proprietary RAID-6 from the vendor, so you're stuck using their array to read the data and you can't disable T10-PI. Uh oh.

pam_slurm_adopt

pam_slurm_adopt is a PAM module I wrote that adopts incoming ssh connections into the appropriate Slurm job on the node. The module allows Slurm to control ssh-launched processes as if they were launched under Slurm in the first place and provides for limits enforcement, accounting, and stray process cleanup when a job exits.

Some MPI implementations don't provide for Slurm integration or are not compiled with Slurm support, so the fallback is ssh. Some code doesn't even use MPI and instead directly calls ssh. The ideal solution is always to utilize properly-compiled MPI that supports Slurm, but realistically that's not going to happen all the time. That's where pam_slurm_adopt comes in.

As I see it, this PAM module solves three main problems. ssh-launched processes now:

Have proper limit enforcement
Will be accounted for in the accounting database
Will be cleaned up when the job exits

We have been using the code in production or more than three weeks now and things work great, except a few minor bugs still need to be fixed to get all the intended benefits. See the "Inclusion in Slurm" section for details.

Caller ID: Handling ssh-launched processes in Slurm

I have thought long and hard about it, but I finally figured out how to handle ssh-launched processes in Slurm. ssh may be used to launch tasks by real MPI with no Slurm support or by "poor man's MPI" that just launches tasks using the only thing the developer knows how to use: ssh. Proper programs are written using MPI that is compiled with Slurm integration, but not all programs are proper. Accounting and resource allocation enforcement should soon work with these attempted escapees...

UPDATE May 4, 2015: The callerid code is now in Slurm. See "Status" section below for more information about the next steps.

UPDATE May 14, 2015: SchedMD added an "extern" step that is activated with PrologFlags=contain. This step will exist on all nodes in an allocation and will be the step that ssh-launched processes are adopted into.

UPDATE Oct 22, 2015: I finally got around to coding again and I got everything working! I submitted a patch. I'll probably write a new post that more accurately reflects the final state of the code.

UPDATE NOV 3, 2015: I made a minor update that was committed and will be in 15.08.3.

Fair Tree Slurm Fairshare Algorithm

That's right. Levi Morrison and I created a second Slurm fairshare algorithm, Fair Tree. Our first algorithm, LEVEL_BASED, was accepted into Slurm and became available in 14.11.0pre3 about one month ago. Fair Tree was accepted into Slurm in time for 14.11 and replaced LEVEL_BASED.

When given the same inputs, both algorithms produce effectively equivalent outputs. The objective of both algorithms is the same: If accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B.

So why bother writing a new algorithm three months after the first one if the first algorithm successfully solved the same problems?

LEVEL_BASED Slurm Prioritization Algorithm

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED. To see why it is necessary, please see my other post about the problems with algorithms that existed at the time of its creation.

DEPRECATED: This has been deprecated by our new algorithm, Fair Tree. Yes, we really did replace this algorithm within a few months even though it worked great for us. See the post about Fair Tree for details.

Problems with Existing Slurm Prioritization Methods

UPDATE: Level-Based was replaced by Fair Tree, an even better algorithm that we created.

Levi Morrison and I have co-authored a new prioritization mechanism for Slurm called LEVEL_BASED. In order to understand why LEVEL_BASED is necessary, I have chosen to write this post about our issues with the existing options and a separate post about LEVEL_BASED. If you just want to see information about LEVEL_BASED, see the post LEVEL_BASED Slurm Prioritization Algorithm.

We want users from an account that has higher usage to have lower priority than users from an account with lower usage. There is no existing algorithm that consistently does this.

Job Script Generator for Slurm and PBS published on Github

We published version 2.0 of our batch job script generator on Github. It is a Javascript library (LGPLv3) that allows users to learn Slurm and PBS syntax by testing various inputs in an easy-to-understand manner. Links: git repo, demo, other github projects of ours.

Thursday, April 17, 2014

Scheduler Limit: Remaining Cputime Per User/Account

Update May 9, 2014: Added a link to our GrpCPURunMins Visualizer

I have discussed this with several interested people recently so it's time for me to write it up. When running an HPC batch job scheduling system such as Slurm, Moab, Maui, or LSF, there are many ways to configure user limits. Some of the easiest limits to understand are on the number of jobs a user can run or the maximum cores or nodes that they can use. We have used a different limit for several years now that is worth sharing.

No one likes to see a cluster that is 60% utilized while users sit in the queue, unable to run due to a core count limit they are hitting. Likewise for a site with no user limits, only the lucky user himself likes being able to fill up a cluster with $MAX_WALLTIME day jobs during a brief lull in usage. Obviously, other users are displeased when they then submit jobs five minutes later that will now have to wait for all of the other user's job to finish in $MAX_WALLTIME days. This is typically solved by limiting the core or node count per user/account, but we use a limit that vastly improves the situation.

tech.ryancox.net

Wednesday, May 14, 2025

Securely Running Windows in HPC

Friday, January 21, 2022

Network Segmentation of Users on Multi-User Servers and Networks

Problem: Potential attack vector

Friday, December 18, 2020

Fixing bad T10-PI checksums

Background (in case you care)

Monday, November 16, 2015

pam_slurm_adopt

Wednesday, April 1, 2015