Early last year, I finished a paper for the Master of Science in Information Security Engineering program at SANS Technology Institute. The paper was published at SANS.org and SANS.edu a year ago but I am just now writing about the project here since life was pretty busy at the time.
I spent several years thinking through this problem and trying different technologies to make it work. I spent a few months finalizing it then writing it up for the academic paper (during which time I also had to, um... think about... this totally hypothetical disaster for several months...).
Rather than rewriting my paper as a blog post, I will include some of the most important parts of the paper, add some additional commentary, then refer you to the paper as needed. If you are actually interested in this solution, please read the paper since it has many more details.
Problem: Potential attack vector
As seen in real life on an HPC login node:
tcp 0 0 0.0.0.0:48279 0.0.0.0:* LISTEN 123965/mpiexec
tcp 0 0 0.0.0.0:6104 0.0.0.0:* LISTEN 107345/Xorg
tcp 0 0 0.0.0.0:39000 0.0.0.0:* LISTEN 213053/node
tcp 0 0 0.0.0.0:39001 0.0.0.0:* LISTEN 212528/mongod
tcp 0 0 0.0.0.0:39002 0.0.0.0:* LISTEN 212606/python
tcp 0 0 0.0.0.0:39003 0.0.0.0:* LISTEN 212788/python
tcp 0 0 0.0.0.0:39004 0.0.0.0:* LISTEN 212879/python
tcp 0 0 0.0.0.0:36846 0.0.0.0:* LISTEN 60839/java
tcp 0 0 127.0.0.1:63342 0.0.0.0:* LISTEN 60839/java
tcp 0 0 0.0.0.0:45621 0.0.0.0:* LISTEN 60839/java
tcp 0 0 0.0.0.0:43727 0.0.0.0:* LISTEN 6473/mpiexec
tcp 0 0 0.0.0.0:56733 0.0.0.0:* LISTEN 219999/mpiexec
tcp 0 0 0.0.0.0:33374 0.0.0.0:* LISTEN 175150/mpiexec
tcp 0 0 0.0.0.0:37987 0.0.0.0:* LISTEN 199971/mpiexec
tcp 0 0 0.0.0.0:44228 0.0.0.0:* LISTEN 123965/mpiexec
tcp 0 0 0.0.0.0:52294 0.0.0.0:* LISTEN 264169/mpiexec
These processes were all owned by various HPC users and there were probably dozens of people logged into that system at that moment.
I'll be lazy and quote from the paper's abstract:
In High Performance Computing (HPC) environments, hundreds of users can be logged in and running batch jobs simultaneously on clusters of servers in a multi-user environment. Security controls may be in place for much of the overall HPC environment, but user network communication is rarely included in those controls. Some users run software that must listen on arbitrary network ports, exposing user software to attacks by others. This creates the possibility of account compromise by fellow users who have access to those same servers and networks. A solution was developed to transparently segregate users from each other both locally and over the network. The result is easy to install and administer.