Cluster Computing Updates

Hot off the press! Advanced Computing Facility Cluster news.

In the past 2 months, several of user have noticed that some jobs they submit take a long long time to start. Some have noticed that the same job will fail or succeed apparently at random.

This came to a head on Tuesday; we've exerted ourselves and have a mostly understandable set of answers. I promised news as soon as I understand it, and here it is. We will have most of this written up in detail in the hpcexample collection (https://gitlab.crmda.ku.edu/crmda/hpcexample) within a week.

To explain the intermittent problems, there are several separate items I need to mention. There are several balls in the air. Or are they spinning plates?

1. Our nodes are shared, to a certain extent.

That's the "community" in community cluster.

I asked for our nodes to be reserved so we can get quick access to them, but there is a detail I did not understand.

Our nodes in the cluster--24 systems, each with 20 cores--are open to short (6 hours or less) jobs from other users in the cluster. If one of those other users gets active, submitting 1000s of jobs, then our jobs wait as long as 6 hours before they start. Our jobs have "first priority" when currently running jobs end.

While it is technically possible to block those other users, I believe it would be impolite to do so at the current moment. We may revisit that decision in the future.

2. We are asking for resources in the wrong way.

We have been following principles that were established in the "hpc.crmda.ku.edu" cluster we used previously. That cluster was only for CRMDA and its affiliates, it did not include such a large audience as the current cluster.

In hpc, if we wanted 19 cores, we were told to allow the cluster to scatter them across nodes if it wanted. Submission for a 19 core job would read like this:

#PBS -l nodes=19:ppn=1

However, in the new cluster that's more likely to cause a total fail because our jobs can get scattered across a lot of different nodes on which strangers are also running jobs.

Those strangers might not be well behaved. I mean, their jobs might not be well behaved. They may request 2GB memory, but then actually use 20GB memory. By gobbling up all memory, they cause our jobs to fail.

Hence, if we want a job using 19 cores, in the brave new world it is smarter and faster to run

#PBS -l nodes=1:ppn=19:ib

That is to say, if you have the whole node, or most of its cores, you are protected from bad strangers. In the old hpc cluster, that would have made a job slower to start. In the new cluster, it is the best bet.

I inserted that special bit, ":ib", and the next sections explain why.

We are getting revised, detailed instructions on how to ask for more cores, and how to specify memory. I'll follow up with information about that later.

3. Please understand this heterogeneous "transport layer" problem.

In the old hpc cluster, we had 1/3 nodes with Infiniband. Jobs would fail all the time because of that diversity. Nodes that had Infiniband could not interact with nodes that did not. We learned to insert ":noib" to prevent the Infiniband layer from being used. All of our nodes had ethernet connections, they could use that slightly slower method.

In the new cluster, we thought all nodes had infiniband. So we told people "get rid of :noib". But some people did not get the message.

But now, it appears in the new cluster, it is not true all nodes have infiniband. Possibly the slow/bad ones don't. Keep that in mind if you read the next point.

4. In the new cluster, we are asking for resources in an especially stupid way (as a result of issue in part 3).

Some of use are still using submission scripts from the old hpc cluster, where we had "noib".

PBS -l nodes=19:ppn=1:noib

Here's why that is especially wrong. All of the nodes that CRMDA controls now DO have Infiniband, and so if you insert "noib", you are PREVENTING the job from starting within the nodes on which we actually have a reservation. That means your jobs are orphaned among the cluster, waiting for a kind stranger to take them in.

That stranger node that lets you in is likely to be a slow, old one. Explaining the "my job takes forever" problem you have noted recently.

Instead, FOR SURE, we need

PBS -l nodes=19:ppn=1:ib

5. Heterogeneous error type 1.

Suppose you don't put ":ib". You launch the job and ask for 50 cores. The first one has Infiniband, but the available node it finds is not, then you get this crash.

This should look familiar to a few of you

If the node chosen to be your Rmpi master is on Infiniband, and it finds other nodes that are not, we see a crash like so:

  > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
[[41508,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
   Host: n301

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 22368 on node n301 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

6. We need to get better at module and R package management.

User scripts can specify modules. That's possible, and will fix it.

But it looks like an uphill battle for me to keep all of that clear among all of the different users. We'll try to do it in our part so you don't bother.

You'll know if you hit this problem.

Some jobs fail like this:

 > library(snow)
 >
 > p <- rnorm(123, m = 33)
 >
 > CLUSTERSIZE <- 18
 >
 > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 110202 on node n404 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

That's caused by a shared library version mismatch.

The R 3.2.3 packages are build against the OpenMPI C library. WHen I built those packages, the OpenMPI version was 1.6.5.

Since then, OpenMPI has has updates, but I did not rebuild the packages. I thought the R-3.2.3 libraries and setup were going to remember OpenMPI1.6.6.

However....

Our default module, Rstats, did not connect R-3.2.3 with the OpenMPI version. Thus, sometimes, when a job would run, it would use a particular OpenMPI feature, we had a compiled library mismatch.

We can fix that ourselves by altering the submission script, say by specifying

module purge
module load openmpi/1.6.5
module load Rstats/3.2.2

However, I don't want to be that detailed. The fix for that, which will be in place very soon (by Monday, for sure, maybe sooner) is

1. The cluster admin will tie together the R choice with the OpenMPI choice.

2. I'll rebuild the R packages for R -3.3.1 with the newer OpenMPI and then admins will update the default module for crmda_remote users (that's you).

Posted in Computing | Tagged | Leave a comment

kutils Package for R: New Updates Available

The KRAN server, an R repository hosted by the Center for Research Methods and Data analysis at the University of Kansas, offers packages being prepared for researchers that use R. A suite of tools for research project management, dubbed "kutils", is undergoing rapid development. This includes functions that can initialize projects, scan input data, create quick overviews of the information in the data, and guide the recoding and data refactoring process. The package includes the Variable Key System, a framework to import and revise data in a team/project oriented manner. An essay about this is available (and also is included with the package): The Variable Key Data Management Framework.

In case you might like to try this out, the KRAN server is available. It will be necessary to pair this with the general purpose CRAN server (and any other servers you use currently, such as OmegaHat or Bioconductor). We suggest trying this R code:

CRAN <- "https://rweb.crmda.ku.edu/cran"
KRAN <- "https://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
## We suggest installing updates, but this next step is not required
update.packages(ask = FALSE, checkBuilt = TRUE)
## Then install our new package
install.packages("kutils", dep = TRUE)

In case you use Bioconductor, for example, here is the way we integrate it with the update process

CRAN <- "https://rweb.crmda.ku.edu/cran"
KRAN <- "https://rweb.crmda.ku.edu/kran"
BIOC <- "https://www.bioconductor.org/packages/3.3/bioc"
options(repos = c(KRAN, CRAN, BIOC))
## We suggest installing updates, but this next step is not required
update.packages(ask = FALSE, checkBuilt = TRUE)
install.packages("kutils", dep = TRUE)

After running that, then using

kutils is as easy as:

library(kutils)

The functions that we suggest you check first are

peek and initProject. We include with this package the first fully functional version of the Variable Key, a custom development process developed within CRMDA. The variable key offers an enhanced project management framework along with an easier-to-use, tabular system of notation that makes it easier for non-technicians to guide and supervise research exercises. There is a vignette about the Variable Key provided with the package. After library(kutils), just run

vignette("variablekey")

and a PDF should be displayed. If your system's PDF viewer can't be found by R, you'll get an error message that points you in the right direction. The most recent changes in the package concern the Variable Key. We have streamlined the "round trip" research process. A researcher should approach data importation and re-coding in these steps.

  1. Run keyTemplate. This scans the data and creates a table that can be used for recoding.
  2. Edit the key template document. This can be done in a spreadsheet program (MS Excel) or a text editor such as Emacs, Notepad++, Sublime Text, or Textmate (basically, any programmer's file editor, NOT Microsoft Word)
  3. Run keyImport.  This reviews the requested data revisions from the revised template.
  4. Run keyApply. The requested data changes in the new key are applied to the data frame being considered. In our original design, we thought the four step process was the end of this. However, we have run into a few cases in which the Variable Key system exposes problems in the original data frame that cause the data owners to revise their data frame.  Our original thought was that the teams that revise the data will repeat the original 4 step process--create a new key template, revise it, import and apply it. However, in a case where the key includes 100s of variables, this implies a lot of repeated work. In the most recent version of kutils, we include a new function that adds a fifth step that can address this situation.
  1. Run keyUpdate. This scans the new data, checks for new variables and new values, and then incorporates them into the previously prepared variable key.

While this is the newest function in a

development package, we encourage researchers to try it and let us know how it works. For troubleshooting purposes, here is the sessionInfo output.

Update 2016-10-28. A zip file variablekey-anes-20161028 is available with an example that we have used to test the variable key setup. This revealed some challenges with "fancy quotes" that we need to solve in the future. If a person edits the key and inserts fancy slanted quotes, then the re-import process fails because slanted quotes are unrecognized.

> sessionInfo()
    R version 3.3.1 (2016-06-21)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.1 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
     [9] LC_ADDRESS=C               LC_TELEPHONE=C
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base
    
    other attached packages:
    [1] kutils_0.31
    
    loaded via a namespace (and not attached):
    [1] plyr_1.8.4     tools_3.3.1    Rcpp_0.12.7    xtable_1.8-2   openxlsx_3.0.0
Posted in Uncategorized | Leave a comment

Best of Lawrence Chooses CRMDA

The CRMDA team has been selected for this prestigious honor.

Posted in Uncategorized | Leave a comment

PSPP, git-lab implemented

Are we more excited to have PSPP or Git with lfs support? Lets have a show of hands!

PSPP is the GNU project's version of a user friendly statistical package for the social sciences. It falls in line with the other easy-to-use GNU stats program called GRETL.

The other big news is that our Gitlab server is upgraded to allow Git lfs (large file structure). As many of you know, Git is intended to tracking changes in text files. When the files being tracked are "binary" files, say photos or movies, Git cannot track changes at all. Instead, it must save the entire file--every copy of every revision. In our guides repository, which is now mirrored online at https://crmda.dept.ku.edu/guides, this was causing Git to become slower and slower.

As of October 6, 2016, we have transitioned to Git lfs. We are holding a training workshop on October 7, 2016, at 3pm in Watson library Room 455 if you are interested. If you can't attend, you get the big picture if you watch these 2 videos on YouTube:

What is Git LFS (by Tim Peterson)

I thought that one was more more informative than the 2 minute presentation from the Git LFS team, but that's also nice.

Git Large File Storage - How to Work with Big Files (GitHub Training Team)

Posted in Uncategorized | Leave a comment

Guides Going Online!

We are mirroring our guides folder on the Webserver, and will soon have documents linked up from specific points of interest to these pages.

But don't feel bashful, you can browse in our folders if you want! There's a Google search bar built into the page. Try http://crmda.dept.ku.edu/guides.

Our tech support pages will still link to these guides, the flavor of should not change too much https://crmda.ku.edu/guides.

Posted in Uncategorized | Leave a comment