Cluster Computing Updates

Hot off the press! Advanced Computing Facility Cluster news.

In the past 2 months, several of user have noticed that some jobs they submit take a long long time to start. Some have noticed that the same job will fail or succeed apparently at random.

This came to a head on Tuesday; we've exerted ourselves and have a mostly understandable set of answers. I promised news as soon as I understand it, and here it is. We will have most of this written up in detail in the hpcexample collection (https://gitlab.crmda.ku.edu/crmda/hpcexample) within a week.

To explain the intermittent problems, there are several separate items I need to mention. There are several balls in the air. Or are they spinning plates?

1. Our nodes are shared, to a certain extent.

That's the "community" in community cluster.

I asked for our nodes to be reserved so we can get quick access to them, but there is a detail I did not understand.

Our nodes in the cluster--24 systems, each with 20 cores--are open to short (6 hours or less) jobs from other users in the cluster. If one of those other users gets active, submitting 1000s of jobs, then our jobs wait as long as 6 hours before they start. Our jobs have "first priority" when currently running jobs end.

While it is technically possible to block those other users, I believe it would be impolite to do so at the current moment. We may revisit that decision in the future.

2. We are asking for resources in the wrong way.

We have been following principles that were established in the "hpc.crmda.ku.edu" cluster we used previously. That cluster was only for CRMDA and its affiliates, it did not include such a large audience as the current cluster.

In hpc, if we wanted 19 cores, we were told to allow the cluster to scatter them across nodes if it wanted. Submission for a 19 core job would read like this:

#PBS -l nodes=19:ppn=1

However, in the new cluster that's more likely to cause a total fail because our jobs can get scattered across a lot of different nodes on which strangers are also running jobs.

Those strangers might not be well behaved. I mean, their jobs might not be well behaved. They may request 2GB memory, but then actually use 20GB memory. By gobbling up all memory, they cause our jobs to fail.

Hence, if we want a job using 19 cores, in the brave new world it is smarter and faster to run

#PBS -l nodes=1:ppn=19:ib

That is to say, if you have the whole node, or most of its cores, you are protected from bad strangers. In the old hpc cluster, that would have made a job slower to start. In the new cluster, it is the best bet.

I inserted that special bit, ":ib", and the next sections explain why.

We are getting revised, detailed instructions on how to ask for more cores, and how to specify memory. I'll follow up with information about that later.

3. Please understand this heterogeneous "transport layer" problem.

In the old hpc cluster, we had 1/3 nodes with Infiniband. Jobs would fail all the time because of that diversity. Nodes that had Infiniband could not interact with nodes that did not. We learned to insert ":noib" to prevent the Infiniband layer from being used. All of our nodes had ethernet connections, they could use that slightly slower method.

In the new cluster, we thought all nodes had infiniband. So we told people "get rid of :noib". But some people did not get the message.

But now, it appears in the new cluster, it is not true all nodes have infiniband. Possibly the slow/bad ones don't. Keep that in mind if you read the next point.

4. In the new cluster, we are asking for resources in an especially stupid way (as a result of issue in part 3).

Some of use are still using submission scripts from the old hpc cluster, where we had "noib".

PBS -l nodes=19:ppn=1:noib

Here's why that is especially wrong. All of the nodes that CRMDA controls now DO have Infiniband, and so if you insert "noib", you are PREVENTING the job from starting within the nodes on which we actually have a reservation. That means your jobs are orphaned among the cluster, waiting for a kind stranger to take them in.

That stranger node that lets you in is likely to be a slow, old one. Explaining the "my job takes forever" problem you have noted recently.

Instead, FOR SURE, we need

PBS -l nodes=19:ppn=1:ib

5. Heterogeneous error type 1.

Suppose you don't put ":ib". You launch the job and ask for 50 cores. The first one has Infiniband, but the available node it finds is not, then you get this crash.

This should look familiar to a few of you

If the node chosen to be your Rmpi master is on Infiniband, and it finds other nodes that are not, we see a crash like so:

  > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
[[41508,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
   Host: n301

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 22368 on node n301 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

6. We need to get better at module and R package management.

User scripts can specify modules. That's possible, and will fix it.

But it looks like an uphill battle for me to keep all of that clear among all of the different users. We'll try to do it in our part so you don't bother.

You'll know if you hit this problem.

Some jobs fail like this:

 > library(snow)
 >
 > p <- rnorm(123, m = 33)
 >
 > CLUSTERSIZE <- 18
 >
 > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 110202 on node n404 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

That's caused by a shared library version mismatch.

The R 3.2.3 packages are build against the OpenMPI C library. WHen I built those packages, the OpenMPI version was 1.6.5.

Since then, OpenMPI has has updates, but I did not rebuild the packages. I thought the R-3.2.3 libraries and setup were going to remember OpenMPI1.6.6.

However....

Our default module, Rstats, did not connect R-3.2.3 with the OpenMPI version. Thus, sometimes, when a job would run, it would use a particular OpenMPI feature, we had a compiled library mismatch.

We can fix that ourselves by altering the submission script, say by specifying

module purge
module load openmpi/1.6.5
module load Rstats/3.2.2

However, I don't want to be that detailed. The fix for that, which will be in place very soon (by Monday, for sure, maybe sooner) is

1. The cluster admin will tie together the R choice with the OpenMPI choice.

2. I'll rebuild the R packages for R -3.3.1 with the newer OpenMPI and then admins will update the default module for crmda_remote users (that's you).

This entry was posted in Computing and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *