R modules: Super Exciting New Updates

This is revised Monday, July 24, 2017.

Some of you have reported segmentation faults during the past week. We learned they come from 3 different problems. First, some people have R packages compiled in their user accounts. These fall out-of-date with the R packages we provide, causing incompatability. Second, some new compute nodes came on line during the past 2 weeks and some are missing support libraries. When these are missing, the R packages that rely on them (such as our beloved kutils or rockchalk) would fail to load. This was a tricky problem because it only happened on some nodes, which only became recently available. Third, I did not understand the gravity and drama involved with the user account setup and the Rmpi package.

Lets skip to the chase. What should users do now.

Step 1. remove module statements from submission scripts.

Those statements are not having the anticipated effect, and they will destroy the benefits of the changes I suggest next.

I'm told this problem does not affect all MPI jobs, just ones that use R and the style of parallelization that we understand.

Step 2. Configure your individual user account to load appropriate modules.

Some module should be available for every session launched for your account, in every node. These have to be THE SAME in all nodes and cores launched by the job. There are 2 ways to get this done.

Option 1. The easy way: Use my R package module stanza, crmda_env.sh

In the cluster file system, I have a file /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh with contents like this:


module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
export OMPI_MCA_btl

I say "like this" because I may insert new material there. The last 2 lines were inserted July 22, 2017. The goal is to conceal all of the details from users by putting them in a module that's loaded, such as Rstats/3.3. When we are ready to transition to R-3.4, I'll change that line accordingly.

In your user accounts, there are 2 files where you can incorporate this information, they are ~/.bashrc and ~/.bash_profile. Add a last line in those files like this:

source /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh

I'll show you my ~/.bashrc file so you can see the larger context:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc

# User specific aliases and functions
export LS_COLORS=$LS_COLORS:'di=0;33:'
# alert for rm, cp, mv
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# color and with classification
alias ls='ls -F --color=auto'
alias ll='ls -alF'

source /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh

I strongly urge all of our cluster users to include the "alert for rm, cp, mv" piece. This causes the system to ask for confirmation before deleting or replacing files. But that's up to you. I also have some an adjustment to the colors of the directory listing.

I insert the same "source" line at the end of ~/.bash_profile as well.

On 2017-07-23, I made a minor edit in my .bashrc and .bash_profile files:

export PATH=/panfs/pfs.local/work/crmda/tools/bin:$PATH
source crmda_env.sh

This is equivalent, but gives me a side benefit. Instead of adding the source function with the full path, I inserted that bin folder into my path. That means I can use any script in that folder without typing out the full path. When I find very handy shell scripts that I use often, and I think the other users should have access to them as well, then I will put them in that folder. For example, if you look there today, you should see "crmda_env-test.sh", which is the new one I'm working on. When that's ready, it will become "crmda_env.sh" and the old one will get renamed as "crmda_env-2017xxxx.sh", where xxxx is the date on which it becomes the old one.

Option 2. Add your own module statements in ~/.bashrc and ~/.bash_profile

Make sure you put the same modules in both ~./bashrc and ~./bash_profile. Look at the file /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh to get ideas of what you need. For example, run

$ cat /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh

You might consider creating a file similar to /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh in your account. Then source that at the end of your ~/.bashrc and ~/.bash_profile. If you do that, they will always stay consistent.

Frequently Asked Questions that I anticipate

Can you explain why the segmentation fault happens?

Answer: Yes, I have some answers.

Here is the basic issue. Suppose you have a submission script that looks like this:

#MSUB -N RParallelHelloWorld 
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -M your-name-here@ku.edu
#MSUB -m bea


module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.3

mpiexec -n 1 R --vanilla -f parallel-hello.R 

I though we were supposed to do that, until last week. Here's what is wrong with it.

The environment specifies Rstats/3.3, but that ONLY applies to the "master" node in the R session. It does not apply to the "child" nodes that are spawned by Rmpi. Those nodes are spawned, they are completely separate shell sessions and they are launched by settings in ~/.bash_profile. If your ~/.bash_profile does not have the required modules, then the new nodes are going to have the system default R session, and guess what you get with that? The wrong shared libraries for just about everything. Possibly you get a different version of Rmpi or Rcpp loaded, and when the separate nodes start taking to each other, they notice the difference and sometimes crash.

As a result, the submission scripts, for example, in hpcexample/Ex65-R-parallel, will now look like this:

#MSUB -N RParallelHelloWorld
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -M pauljohn@ku.edu
#MSUB -m bea


## Please check your ~/.bash_profile to make sure
## the correct modules will be loaded with new shells.
## See discussion:
## http://www.crmda.dept.ku.edu/timeline/archives/184

mpiexec -n 1 R --vanilla -f parallel-hello.R

Why is this needed for both ~/.bashrc and ~/.bash_profile?

Answer: You ask a lot of questions.

The short answer is "there's some computer nerd detail". The long answer is, "when you log in on a system, the settings in ~/.bash_profile are used. That is a 'login shell'. If you are in already, and you run a command that launches a new shell inside your session, for example by running "bash", then your new shell is not a 'login shell'. It will be created with settings in ~./bashrc.

If you will never run an interactive session, never interact with R via Emacs or Rstudio, then it might be enough to change ~/.bash_profile. If you think you might ever want to log in and run a small test case, then you should have same in both ~/.bashrc and ~/.bash_profile.

What are the benefits of Option 1?

Answer: Over time, the CRMDA R setup may evolve. Right now, I've already built a setup Rstats/3.4. After we do some bug-testing, then I can easily update the shell file (/panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh) and use that. If you maintain your own modules, then you have to do that yourself.

What are the dangers of Option 1?

Answer: If I get it wrong, then you get it wrong.

Does this mean you need to revise all of the code examples in the hpcexample (https://gitlab.crmda.ku.edu/crmda/hpcexample​) set?

Answer: Yes. It has not been a good week. And it looks like it won't be a good week again.

Why didn't we hear about this in the old community cluster, or in CRMDA's HPC cluster

Answer: Because "we" were in control of the cluster settings and user accounts, the cluster administrators would work all of this out for us and they inserted the settings in the shell for us. Some of you may open your ~/.bashrc or ~/.bash_profile and see the old cluster settings. When I opened mine on 2017-07-07, I noticed that I had modules loaded from the old cluster. I also noticed I'd made an error of editing ~/.bashrc and not ~/.bash_profile.

Why didn't we see these problems before?

Answer: Dumb luck.

In the new CRC-supervised cluster, some modules are loaded automatically. As those modules were more-or-less consistent with what we need to do, then the different environments were not causing segmentation faults. However, when we update the R packages like Rstan, Rcpp, and, well, anything with a lot of shared libraries, then we hit the crash.

I notice you don't have oreterun in your submission example. Do you mean mpiexec really?

Answer: The documentation says that orterun, mpiexec, and mpirun are all interchangeable. I rather enjoyed orterun, it sounds fancy. However, it appears mpiexec is more widely used. There are more advanced tools (such as mpiexec.hydra, which we might start using).

In your submission script, why don't you specify the $PBS_NODEFILE any more.

Answer: The program mpiexec is compiled in a way that makes this no longer necessary. It is not harmful to specify $PBS_NODEFILE, but it is not needed either. The hpcexamples will get cleaned up. The CRMDA cluster documentation will need to be corrected.

Posted in Computing, R | Leave a comment

Rstats/3.3 and Rstats/3.4 updates: dealing with OpenMPI and Infiniband library concerns.

Dear CRMDA cluster users

During the past 2 months, some of us have seen the MPI warning from parallel R programs:

An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

We have wrestled with this. Today I've made a decision what to do. The CRMDA modules for Rstats/3.3 and 3.4 will prevent the OpenMPI (parallel computing) framework from trying to access the Infiniband network devices. That makes the warning go away. Because the ethernet communication devices are slower than Infiniband, this is not a decision taken lightly.

The CRMDA R module stanza should "just work", either

module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.3


module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.4

How is this done?

I've rebuilt openmpi-1.10.7, which is also now in our module collection, so I have power to insert the special configuration described below.

R Packages

The packages list that is kept up to date, system-wide, is the same in Rstats-3.3 or Rstats-3.4. A full list is included at the end of this announcement.

If you find that updates cause your applications to break, it is allowed for users to install old versions of R packages in ~/R.

Details about the OpenMPI/openib warning message.

Embarrassingly, while googling for help on this message, I've discovered that, in 2010, I was in exact same situation setting up the CRMDA cluster that used to be in the Structural Biology Center. It had completely gone out of my mind, but with the new cluster in 2017 and fresh installs of OpenMPI, we hit the problem again.

Here is what I've learned about OpenMPI and Rmpi during the past 2 weeks.

I don't understand computer science enough to understand fully the dangers of forks and data corruption when OpenMPI uses infiniband. However, perhaps one of you can tell me.

  1. Rmpi will compile with OpenMPI >= 2.0, but it is not fully compatible. The Rmpi author has written to me directly that he is working on revisions that will make these compatible. One symptom of the problem we find is that stopCluster() does not work. It hangs the session entirely. The only way to shut down the cluster is mpi.quit(), which terminates the R session entirely.

  2. Rmpi will compile/run with OpenMPI < 2.0.

However, on systems that have Infiniband connective devices and openib libraries, there will be warnings about threads and forks as well as a danger of data corruption. The warning from OpenMPI is triggered by such innocuous R functions as sessionInfo().

Here is a session that shows the warning, using R-3.4 in the cluster.

$ R

R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Microsoft R Open 3.4.0
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2017 Microsoft Corporation

Using the Intel MKL for parallel mathematical computing(using 1 cores).

Default CRAN mirror snapshot taken on 2017-05-01.
See: https://mran.microsoft.com/.

[Previously saved workspace restored]

> library(Rmpi)
> sessionInfo()
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          n410 (PID 34456)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

Matrix products: default
BLAS: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/libRblas.so
LAPACK: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/libRlapack.so

[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Rmpi_0.6-6           RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
[1] compiler_3.4.0   RevoUtils_10.0.4 parallel_3.4.0

I do not know how how dangerous forks might be, but if you go read this message, it appears they can cause data corruption, and this has been known since 2010:


It is above my understanding to say whether garden variety R users will cause these problems. I do know the R parallel documentation warns against system calls and forks, possibly for same reason. R functions that use disk--dir.create, list.files--make a system call that would fall into the dangerous fork category. Possibly. This is a little above my pay grade.

Conservative approach

My "better safe than sorry" instinct leads to this conclusion: TURN OFF INFINIBAND SUPPORT IN OpenMPI. This is the policy we adopted in 2010. It was in place on the KU community cluster. In the new cluster, it was not in place, resulting in the warning message. I had forgotten about this for a long time. With newly installed OpenMPI, I ran into same old problem.

This can be done in the user account, by adding ~/.openmpi/mca-params.conf (or, systemwide in the openmpi install folder etc/openmpi-mca-params.conf) with this line.

btl = ^openib

That prevents OpenMPI from using Infiniband transport layer. I am doing this in the CRMDA OpenMPI module configuration.

One can tell that an Infiniband device is detected with the shell program "ompi_info" provided by OpenMPI. Load the module Rstats/3.3 or Rstats/3.4. After running "ompi_info", look for the btl stanza. The return from ompi_info is like this if you have Infiniband.

   MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)

And like this after changing either ~/openmpi/mca-params.conf or, etc/openmpi-mca-params.conf, to include btl = ^openib.

   MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)

I believe it is worth mentioning that, if some of your compute nodes have Infiniband, an some do not, then OpenMPI jobs will crash if they try to integrate nodes connected with ethernet and Infiniband. That is another reason to tell OpenMPI not to try to use Infiniband at all.

If users do want to use Infiniband within OpenMPI, they can do so by editing a personal configuration file, in ~./openmpi.

Alphabetical R package list.

As of 2017-07-05, these are the packages we install in the directory "/panfs/pfs.local/work/crmda/tools/mro/3.3" (or 3.4)

c("ADGofTest", "AER", "Amelia", "BH", "BMA", "BradleyTerry2", 
"Cairo", "Cubist", "DBI", "DCluster", "DEoptimR", "Devore7", 
"DiagrammeR", "ENmisc", "Ecdat", "Ecfun", "Formula", "GPArotation", 
"HistData", "Hmisc", "HyperbolicDist", "ISwR", "Iso", "JGR", 
"JM", "JMdesign", "JavaGD", "Kendall", "LearnBayes", "MCMCpack", 
"MCPAN", "MEMSS", "MNP", "MPV", "MatchIt", "Matching", "MatrixModels", 
"MplusAutomation", "NMF", "PASWR", "PolynomF", "R2HTML", "R2OpenBUGS", 
"RColorBrewer", "RCurl", "RGtk2", "RSvgDevice", "RUnit", "RandomFields", 
"Rcmdr", "RcmdrMisc", "Rcpp", "RcppArmadillo", "RcppEigen", "Rd2roxygen", 
"Rmpi", "SASmixed", "SemiPar", "SoDA", "SparseM", "StanHeaders", 
"StatDataML", "SweaveListingUtils", "TH.data", "TeachingDemos", 
"UsingR", "VGAM", "VIM", "XML", "Zelig", "abind", "acepack", 
"actuar", "ada", "ade4", "adehabitat", "akima", "alr3", "amap", 
"aod", "ape", "aplpack", "arm", "arules", "assertthat", "backports", 
"base64enc", "bayesm", "bcp", "bdsmatrix", "bestglm", "betareg", 
"biglm", "bit", "bit64", "bitops", "bnlearn", "brew", "brglm", 
"caTools", "cairoDevice", "car", "caret", "cellranger", "censReg", 
"chron", "clue", "clv", "cocorresp", "coda", "coin", "colorspace", 
"combinat", "copula", "corpcor", "crayon", "cubature", "data.table", 
"deldir", "descr", "dichromat", "digest", "diptest", "distr", 
"dlm", "doBy", "doMC", "doMPI", "doParallel", "doSNOW", "dotCall64", 
"dse", "e1071", "earth", "ecodist", "effects", "eha", "eiPack", 
"emplik", "evaluate", "expm", "faraway", "fastICA", "fastmatch", 
"fda", "ffmanova", "fields", "flexmix", "foreach", "formatR", 
"forward", "gam", "gamlss", "gamlss.data", "gamlss.dist", "gamm4", 
"gbm", "gclus", "gdata", "gee", "geepack", "geoR", "geoRglm", 
"ggm", "ggplot2", "glmc", "glmmBUGS", "glmmML", "glmnet", "glmpath", 
"gmodels", "gmp", "gpclib", "gridBase", "gridExtra", "gsl", "gsubfn", 
"gtable", "gtools", "hexbin", "highr", "htmltools", "htmlwidgets", 
"igraph", "ineq", "influence.ME", "inline", "iplots", "irlba", 
"iterators", "itertools", "jpeg", "jsonlite", "kernlab", "knitr", 
"kutils", "labeling", "laeken", "languageR", "lars", "latticeExtra", 
"lava", "lavaan", "lavaan.survey", "lazyeval", "leaps", "lme4", 
"lmeSplines", "lmec", "lmm", "lmtest", "locfit", "logspline", 
"longitudinal", "longitudinalData", "lpSolve", "ltm", "magrittr", 
"manipulate", "maps", "maptools", "markdown", "matrixcalc", "maxLik", 
"mboost", "mcgibbsit", "mclust", "mcmc", "mda", "memisc", "memoise", 
"mi", "micEcon", "mice", "microbenchmark", "mime", "minqa", "misc3d", 
"miscTools", "mitools", "mix", "mixtools", "mlbench", "mnormt", 
"modeltools", "msm", "multcomp", "munsell", "mvProbit", "mvbutils", 
"mvtnorm", "network", "nloptr", "nnls", "nor1mix", "norm", "nortest", 
"np", "numDeriv", "nws", "openxlsx", "ordinal", "orthopolynom", 
"pan", "partDSA", "party", "pbivnorm", "pbkrtest", "pcaPP", "permute", 
"pixmap", "pkgKitten", "pkgmaker", "plm", "plotmo", "plotrix", 
"pls", "plyr", "pmml", "pmmlTransformations", "png", "polspline", 
"polycor", "polynom", "portableParallelSeeds", "ppcor", "profileModel", 
"proto", "proxy", "pscl", "psidR", "pspline", "psych", "quadprog", 
"quantreg", "randomForest", "randomForestSRC", "rattle", "rbenchmark", 
"rbugs", "rda", "readxl", "registry", "relimp", "rematch", "reshape", 
"reshape2", "rgenoud", "rgl", "rlang", "rlecuyer", "rmarkdown", 
"rms", "rngtools", "robustbase", "rockchalk", "roxygen2", "rpart.plot", 
"rpf", "rprojroot", "rrcov", "rstan", "rstudioapi", "sandwich", 
"scales", "scatterplot3d", "segmented", "sem", "semTools", "setRNG", 
"sets", "sfsmisc", "shapefiles", "simsem", "sm", "smoothSurv", 
"sna", "snow", "snowFT", "sp", "spam", "spatialCovariance", "spdep", 
"splancs", "stabledist", "stabs", "startupmsg", "statmod", "statnet.common", 
"stepwise", "stringi", "stringr", "strucchange", "subselect", 
"survey", "systemfit", "tables", "tcltk2", "tensorA", "testthat", 
"texreg", "tfplot", "tframe", "tibble", "tidyverse", "timeDate", 
"tis", "tree", "triangle", "trimcluster", "trust", "ucminf", 
"urca", "vcd", "vegan", "visNetwork", "waveslim", "wnominate", 
"xtable", "xts", "yaml", "zipfR", "zoo", "KernSmooth", "MASS", 
"Matrix", "MicrosoftR", "R6", "RUnit", "RevoIOQ", "RevoMods", 
"RevoUtils", "RevoUtilsMath", "base", "boot", "checkpoint", "class", 
"cluster", "codetools", "compiler", "curl", "datasets", "deployrRserve", 
"doParallel", "foreach", "foreign", "grDevices", "graphics", 
"grid", "iterators", "jsonlite", "lattice", "methods", "mgcv", 
"nlme", "nnet", "parallel", "png", "rpart", "spatial", "splines", 
"stats", "stats4", "survival", "tcltk", "tools", "utils")
Posted in Data Analysis, R | Leave a comment

kutils update

kutils, our utility package that includes the Variable Key framework, was updated to version 1.0 on CRAN last week.

Minor bug fixes will be offered in our package server KRAN, which users can access by running R code like this

CRAN <- "http://rweb.crmda.ku.edu/cran"
KRAN <- "http://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
update.packages(ask = F, checkBuilt = TRUE)

That presupposes you have kutils already, of course. If not, run install.packages instead.

I've just uploaded to KRAN version 1.10, which has a little fix in the reverse function, which is intended to reverse the ordering of factor levels. In case you wonder what this is, here is a code snippit:

##' Reverse the levels in a factor
##' Simple literal reversal. Will stop with an error message if x is
##' not a factor (or ordered) variable.
##' Sometimes people want to
##' reverse some levels, excluding others and leaving them at the end
##' of the list. The "eol" argument sets aside some levels and puts
##' them at the end of the list of levels.
##' The use case for the \code{eol} argument is a factor
##' with several missing value labels, as appears in SPSS. With
##' up to 18 different missing codes, we want to leave them
##' at the end. In the case for which this was designed, the
##' researcher did not want to designate those values as
##' missing before inspecting the pattern of observed values.
##' @param x a factor variable
##' @param eol values to be kept at the end of the list
##' @export
##' @return a new factor variable with reversed values
##' @author Paul Johnson <pauljohn@@ku.edu>
##' @examples
##' ## Consider alphabetication of upper and lower
##' x <- factor(c("a", "b", "c", "C", "a", "c"))
##' levels(x)
##' xr1 <- reverse(x)
##' xr1
##' ## Keep "C" at end of list, after reverse others
##' xr2 <- reverse(x, eol = "C")
##' xr2
##' y <- ordered(x, levels = c("a", "b", "c", "C"))
##' yr1 <- reverse(y)
##' yr1
##' ## Hmm. end of list amounts to being "maximal".
##' ## Unintended side-effect, but interesting.
##' yr2 <- reverse(y, eol = "C")
##' yr2
reverse <- function(x, eol = c("Skip", "DNP")){
    if (!is.factor(x)) stop("your variable is not a factor")
    rlevels <- rev(levels(x))
    if (length(eol) > 0){
        for (jj in eol){
            if (length(yyy <- grep(jj, rlevels))){
                rlevels <- c(rlevels[-yyy], jj)
    factor(x, levels = rlevels)

If for some reason you don't want to install/update kutils, you can just as well paste that code into your R file and use it as the example demonstrates.

Posted in Data Analysis | Leave a comment

Cluster faster, Rstan optimized as of 2017-05-17

Special thanks to Wes Mason of the ITTC. There are 2 breakthroughs to report today.

Nodes are faster

During the spring, users reported that calculations were taking longer. I raised the problem with Wes and he did some diagnosis. It appeared the node BIOS could be adjusted to allow calculations to run faster--nearly two times faster! The CRC administrators understood the issue and they implemented the fixes on May 15, 2017.

Testing on May 16 confirmed that MCMC jobs that were taking 25 hours now take 12 hours.

Now Rstan is optimized as well

I had a lot of trouble getting the settings corrected to build Rstan in the cluster. It turns out that the user who builds Rstan needs to have special settings in a hidden file in the user account. I tried that in February and failed for various reasons, but now victory is at hand. This is one of the examples why we don't suggest individual users try to compile these packages--it is simply too difficult/frustrating.

To use the specially built Rstan, it is necessary to do the 5 step incantation described in the previous post, R Packages available for CRMDA cluster members.

These packages are compiled with GCC-6.3, the latest and greatest, with the C++ optimizer dialed up to "-O3".

In case you need to compile Rstan with GCC-6.3, here is what I have in the ~/.R/Makevars file:

R_XTRA_CPPFLAGS =  -I$(R_INCLUDE_DIR)   #set_by_rstan
## for OpenMx
CXX1X = g++
CXX1XSTD =  -std=c++0x
## For Rstan
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-Wno-ignored-attributes -Wno-deprecated-declarations

The Rstan installation manual suggests two other flags, "-flto -ffat-lto-objects", but these cause a compilation failure. We believe these are not compatible with GCC-6.3.

The other thing worth knowing is that the GCC compiler will demand much more memory than you expect. In February, I was failing over and over because the node was allowing me access to 500MB, but 5GB was necessary. Unfortunately, the error message is completely opaque, suggesting an internal bug in GCC, rather than exhaustion of memory. That was another problem that Wes Mason diagnosed for us.

Posted in Data Analysis | Leave a comment

Apply for our Student Hourly Position

The link for students to apply is:


The last day students can apply is May 23, 2017, and committee members can review candidates by logging into the BrassRing system on or after May 24, 2017.

Posted in Data Analysis | Leave a comment

R Packages available for CRMDA cluster members

This is the 20170425 update, which includes an updated module set and reports of success with Java and TkTcl-based R packages. In other words, an almost complete victory is achieved. Special thanks to Wes Mason of ITTC.

To use R, here is a set of commands I run to set the environment. This is necessary every time I want to use R with Emacs. Let's call this the magic 5 line stanza, for sake of discussion.

module purge
module load legacy
module load emacs  
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3

I agree if you say "it is a pain in the rump to have to remember to do that every time I log in." In the old cluster, I was in a position to place those startup commands into all of the CRMDA user environments. That is no longer the case.

I'm checking on ways you can automate this within your own account. Details are posted at the end of this article.

Consider (strongly) obliterating your $HOME/R package folder

When you want to work with R on the CRC cluster, please consider using the R packages we install within the $WORK folder for CRMDA group members. These packages have some special features and if you try to install them in your user folder (under $HOME/R, as R invites you to do if you run "install.packages()" in a session), then they may not compile correctly.

Recently, we have had runtime errors because the R we are recommending, as described below, is not compatible with packages that users build and install with other versions of R (or the same version of R in a different build environment). In particular, if

  1. You have packages built on the old ACF cluster, or
  2. You have recently installed packages without loading the modules listed below

then you should delete the packages you have under $HOME/R. I think it is best if you let us try to install what you need, but if you install R packages in your own home folder, please do so only AFTER loading the modules listed below. Please DO NOT load the CRC-provided module "R/3.3". It does not provide the services we need.

Background information

The module Rstats/3.3 is built by Wes Mason of ITTC and it is installed into the $WORK folder for CRMDA (hence the module use command above). We work together to make sure the OpenMPI layer is compiled correctly, so it is possible to use Rmpi and the R package parallel. The compiler used is GCC-6.3, which is quite a bit newer than the standard GCC which is provided with the cluster node operating system. This is the principal reason why the CRC-provided "R/3.3" is not acceptable. It does not make sure that the OpenMPI and GCC components are kept in lock-step with R itself. Observe, if we start with an empty session and run

module purge
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3

we find that we actually load several modules:

$ module list

Currently Loaded Modules:
1) compiler/gcc/6.3   
2) openmpi/2.0
3) java/1.8.0_131
4) xz/5.2.3
5) icu/59.1
6) tcltk/8.6.6
7) Rstats/3.3

The openmpi version must be kept in lock-step with R and the packages we have installed in the past. gcc-6.3 is the compiler version we use for all of the packages. It is necessary to have that new version because of demands by packages like Rstan and OpenMX. The java and tcltk modules are needed by various R packages, such as rJava and tkrplot. The xz module is a decompression suite, needed to interact with source code itself. The Rstats module itself is, for the most part, a "holding company" that keeps all of this together. It simply loads the requirements of gcc, openmpi, java, xz, icu, and tcltk, and then it accesses the R provided by the CRC system maintainers from /panfs/pfs.local/software/install/MRO/3.3. The R packages provided by the base R install are found in the directory /panfs/pfs.local/work/crmda/tools/mro/3.3/site-library.

The R packages in our collection are, in most cases, going to be updates & replacements of those packages because we are building with the different compiler. Users who load Rstats/3.3 should notice that our directory comes into the user path before the system-wide folder. Inside R, we see:

> .libPaths()
[1] "/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library"
[2] "/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library"

In my basic 5 line session starter sequence, I also have modules named legacy and emacs. In my opinion, that is a little dangerous because I'm jumbling together modules from the old and new cluster. That's necessary because I need an IDE with which to interact with R. Emacs was configured with ESS by Wes Mason, and it is in the legacy module set. If you prefer, legacy also provides RStudio version 9.98.978. That is, unfortunately, outdated and unmaintained. I've filed a request with CRC to get a new version of Rstudio.

What do you get if you load our environment?

After the magic 5 line stanza above, within your R session, you have access to these packages (Run "library()" to see all folders your path, and all packages within):

Packages in library '/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library':

ADGofTest               Anderson-Darling GoF test
AER                     Applied Econometrics with R
Amelia                  A Program for Missing Data
BH                      Boost C++ Header Files
BMA                     Bayesian Model Averaging
BradleyTerry2           Bradley-Terry Models
Cairo                   R graphics device using cairo graphics library
                        for creating high-quality bitmap (PNG, JPEG,
                        TIFF), vector (PDF, SVG, PostScript) and
                        display (X11 and Win32) output
Cubist                  Rule- And Instance-Based Regression Modeling
DBI                     R Database Interface
DCluster                Functions for the Detection of Spatial Clusters
                        of Diseases
DEoptimR                Differential Evolution Optimization in Pure R
Devore7                 Data sets from Devore's "Prob and Stat for Eng
                        (7th ed)"
DiagrammeR              Create Graph Diagrams and Flowcharts Using R
ENmisc                  Neuwirth miscellaneous
Ecdat                   Data Sets for Econometrics
Ecfun                   Functions for Ecdat
Formula                 Extended Model Formulas
GPArotation             GPA Factor Rotation
HistData                Data Sets from the History of Statistics and
                        Data Visualization
Hmisc                   Harrell Miscellaneous
HyperbolicDist          The hyperbolic distribution
ISwR                    Introductory Statistics with R
Iso                     Functions to Perform Isotonic Regression
JGR                     JGR - Java GUI for R
JM                      Joint Modeling of Longitudinal and Survival
JMdesign                Joint Modeling of Longitudinal and Survival
                        Data - Power Calculation
JavaGD                  Java Graphics Device
Kendall                 Kendall rank correlation and Mann-Kendall trend
LearnBayes              Functions for Learning Bayesian Inference
MCMCglmm                MCMC Generalised Linear Mixed Models
MCMCpack                Markov Chain Monte Carlo (MCMC) Package
MCPAN                   Multiple Comparisons Using Normal Approximation
MEMSS                   Data sets from Mixed-effects Models in S
MNP                     R Package for Fitting the Multinomial Probit
MPV                     Data Sets from Montgomery, Peck and Vining's
MatchIt                 Nonparametric Preprocessing for Parametric
                        Casual Inference
Matching                Multivariate and Propensity Score Matching with
                        Balance Optimization
MatrixModels            Modelling with Sparse And Dense Matrices
ModelMetrics            Rapid Calculation of Model Metrics
MplusAutomation         Automating Mplus Model Estimation and
NMF                     Algorithms and Framework for Nonnegative Matrix
                        Factorization (NMF)
OpenMx                  Extended Structural Equation Modelling
PBSmapping              Mapping Fisheries Data and Spatial Analysis
PolynomF                Polynomials in R
R2HTML                  HTML Exportation for R Objects
R2OpenBUGS              Running OpenBUGS from R
R6                      Classes with Reference Semantics
RColorBrewer            ColorBrewer Palettes
RCurl                   General Network (HTTP/FTP/...) Client Interface
                        for R
RGtk2                   R bindings for Gtk 2.8.0 and above
RSvgDevice              An R SVG graphics device.
RandomFields            Simulation and Analysis of Random Fields
RandomFieldsUtils       Utilities for the Simulation and Analysis of
                        Random Fields
Rcmdr                   R Commander
RcmdrMisc               R Commander Miscellaneous Functions
Rcpp                    Seamless R and C++ Integration
RcppArmadillo           'Rcpp' Integration for the 'Armadillo'
                        Templated Linear Algebra Library
RcppEigen               'Rcpp' Integration for the 'Eigen' Templated
                        Linear Algebra Library
Rd2roxygen              Convert Rd to 'Roxygen' Documentation
Rmpi                    Interface (Wrapper) to MPI (Message-Passing
Rook                    Rook - a web server interface for R
SAScii                  Import ASCII files directly into R using only a
                        SAS input script
SASmixed                Data sets from "SAS System for Mixed Models"
SemiPar                 Semiparametic Regression
SoDA                    Functions and Examples for "Software for Data
SparseM                 Sparse Linear Algebra
StanHeaders             C++ Header Files for Stan
StatDataML              Read and Write StatDataML Files
SweaveListingUtils      Utilities for Sweave Together with TeX
                        'listings' Package
TH.data                 TH's Data Archive
TeachingDemos           Demonstrations for Teaching and Learning
UsingR                  Data Sets, Etc. for the Text "Using R for
                        Introductory Statistics", Second Edition
VGAM                    Vector Generalized Linear and Additive Models
VIM                     Visualization and Imputation of Missing Values
XML                     Tools for Parsing and Generating XML Within R
                        and S-Plus
Zelig                   Everyone's Statistical Software
abind                   Combine Multidimensional Arrays
acepack                 ACE and AVAS for Selecting Multiple Regression
actuar                  Actuarial Functions and Heavy Tailed
ada                     The R Package Ada for Stochastic Boosting
ade4                    Analysis of Ecological Data : Exploratory and
                        Euclidean Methods in Environmental Sciences
adehabitat              Analysis of Habitat Selection by Animals
akima                   Interpolation of Irregularly and Regularly
                        Spaced Data
alr3                    Data to accompany Applied Linear Regression 3rd
amap                    Another Multidimensional Analysis Package
aod                     Analysis of Overdispersed Data
ape                     Analyses of Phylogenetics and Evolution
aplpack                 Another Plot PACKage: stem.leaf, bagplot,
                        faces, spin3R, plotsummary, plothulls, and some
                        slider functions
arm                     Data Analysis Using Regression and
                        Multilevel/Hierarchical Models
arules                  Mining Association Rules and Frequent Itemsets
assertthat              Easy Pre and Post Assertions
backports               Reimplementations of Functions Introduced Since
base64enc               Tools for base64 encoding
bayesm                  Bayesian Inference for
bcp                     Bayesian Analysis of Change Point Problems
bdsmatrix               Routines for Block Diagonal Symmetric matrices
bestglm                 Best Subset GLM
betareg                 Beta Regression
biglm                   bounded memory linear and generalized linear
bit                     A class for vectors of 1-bit booleans
bit64                   A S3 Class for Vectors of 64bit Integers
bitops                  Bitwise Operations
bnlearn                 Bayesian Network Structure Learning, Parameter
                        Learning and Inference
brew                    Templating Framework for Report Generation
brglm                   Bias reduction in binomial-response generalized
                        linear models.
broom                   Convert Statistical Analysis Objects into Tidy
                        Data Frames
caTools                 Tools: moving window statistics, GIF, Base64,
                        ROC AUC, etc.
cairoDevice             Embeddable Cairo Graphics Device Driver
car                     Companion to Applied Regression
caret                   Classification and Regression Training
cellranger              Translate Spreadsheet Cell Ranges to Rows and
censReg                 Censored Regression (Tobit) Models
checkmate               Fast and Versatile Argument Checks
chron                   Chronological Objects which can Handle Dates
                        and Times
clue                    Cluster Ensembles
clv                     Cluster Validation Techniques
cocorresp               Co-Correspondence Analysis Methods
coda                    Output Analysis and Diagnostics for MCMC
coin                    Conditional Inference Procedures in a
                        Permutation Test Framework
colorspace              Color Space Manipulation
combinat                combinatorics utilities
commonmark              High Performance CommonMark and Github Markdown
                        Rendering in R
copula                  Multivariate Dependence with Copulas
corpcor                 Efficient Estimation of Covariance and
                        (Partial) Correlation
crayon                  Colored Terminal Output
cslogistic              Conditionally Specified Logistic Regression
cubature                Adaptive Multivariate Integration over
data.table              Extension of `data.frame`
deldir                  Delaunay Triangulation and Dirichlet (Voronoi)
desc                    Manipulate DESCRIPTION Files
descr                   Descriptive Statistics
dichromat               Color Schemes for Dichromats
digest                  Create Compact Hash Digests of R Objects
diptest                 Hartigan's Dip Test Statistic for Unimodality -
distr                   Object Oriented Implementation of Distributions
dlm                     Bayesian and Likelihood Analysis of Dynamic
                        Linear Models
doBy                    Groupwise Statistics, LSmeans, Linear
                        Contrasts, Utilities
doMC                    Foreach Parallel Adaptor for 'parallel'
doMPI                   Foreach parallel adaptor for the Rmpi package
doSNOW                  Foreach Parallel Adaptor for the 'snow' Package
dplyr                   A Grammar of Data Manipulation
dse                     Dynamic Systems Estimation (Time Series
e1071                   Misc Functions of the Department of Statistics,
                        Probability Theory Group (Formerly: E1071), TU
earth                   Multivariate Adaptive Regression Splines
ecodist                 Dissimilarity-based functions for ecological
effects                 Effect Displays for Linear, Generalized Linear,
                        and Other Models
eha                     Event History Analysis
eiPack                  eiPack: Ecological Inference and
                        Higher-Dimension Data Management
emplik                  Empirical Likelihood Ratio for
                        Censored/Truncated Data
evaluate                Parsing and Evaluation Tools that Provide More
                        Details than the Default
expint                  Exponential Integral and Incomplete Gamma
expm                    Matrix Exponential, Log, 'etc'
faraway                 Functions and Datasets for Books by Julian
fastICA                 FastICA Algorithms to perform ICA and
                        Projection Pursuit
fastmatch               Fast match() function
fda                     Functional Data Analysis
ffmanova                Fifty-fifty MANOVA
fields                  Tools for Spatial Data
flexmix                 Flexible Mixture Modeling
forcats                 Tools for Working with Categorical Variables
formatR                 Format R Code Automatically
forward                 Forward search
gam                     Generalized Additive Models
gamlss                  Generalised Additive Models for Location Scale
                        and Shape
gamlss.data             GAMLSS Data
gamlss.dist             Distributions to be Used for GAMLSS Modelling
gamm4                   Generalized Additive Mixed Models using 'mgcv'
                        and 'lme4'
gbm                     Generalized Boosted Regression Models
gclus                   Clustering Graphics
gdata                   Various R Programming Tools for Data
gee                     Generalized Estimation Equation Solver
geepack                 Generalized Estimating Equation Package
geoR                    Analysis of Geostatistical Data
geoRglm                 A Package for Generalised Linear Spatial Models
ggm                     Functions for graphical Markov models
ggplot2                 Create Elegant Data Visualisations Using the
                        Grammar of Graphics
glmc                    Fitting Generalized Linear Models Subject to
glmmBUGS                Generalised Linear Mixed Models with BUGS and
glmmML                  Generalized Linear Models with Clustering
glmnet                  Lasso and Elastic-Net Regularized Generalized
                        Linear Models
glmpath                 L1 Regularization Path for Generalized Linear
                        Models and Cox Proportional Hazards Model
gmodels                 Various R Programming Tools for Model Fitting
gmp                     Multiple Precision Arithmetic
gpclib                  General Polygon Clipping Library for R
gridBase                Integration of base and grid graphics
gridExtra               Miscellaneous Functions for "Grid" Graphics
grpreg                  Regularization Paths for Regression Models with
                        Grouped Covariates
gsl                     Wrapper for the Gnu Scientific Library
gsubfn                  Utilities for strings and function arguments.
gtable                  Arrange 'Grobs' in Tables
gtools                  Various R Programming Tools
haven                   Import and Export 'SPSS', 'Stata' and 'SAS'
hexbin                  Hexagonal Binning Routines
highr                   Syntax Highlighting for R Source Code
hms                     Pretty Time of Day
htmlTable               Advanced Tables for Markdown/HTML
htmltools               Tools for HTML
htmlwidgets             HTML Widgets for R
httpuv                  HTTP and WebSocket Server Library
httr                    Tools for Working with URLs and HTTP
igraph                  Network Analysis and Visualization
ineq                    Measuring Inequality, Concentration, and
influence.ME            Tools for Detecting Influential Data in Mixed
                        Effects Models
influenceR              Software Tools to Quantify Structural
                        Importance of Nodes in a Network
inline                  Functions to Inline C, C++, Fortran Function
                        Calls from R
iplots                  iPlots - interactive graphics for R
irlba                   Fast Truncated SVD, PCA and Symmetric
                        Eigendecomposition for Large Dense and Sparse
itertools               Iterator Tools
jpeg                    Read and write JPEG images
kernlab                 Kernel-Based Machine Learning Lab
knitr                   A General-Purpose Package for Dynamic Report
                        Generation in R
kutils                  Project Management Tools
labeling                Axis Labeling
laeken                  Estimation of indicators on social exclusion
                        and poverty
languageR               Data sets and functions with "Analyzing
                        Linguistic Data: A practical introduction to
lars                    Least Angle Regression, Lasso and Forward
latticeExtra            Extra Graphical Utilities Based on Lattice
lava                    Latent Variable Models
lavaan                  Latent Variable Analysis
lavaan.survey           Complex Survey Structural Equation Modeling
lazyeval                Lazy (Non-Standard) Evaluation
leaps                   Regression Subset Selection
lme4                    Linear Mixed-Effects Models using 'Eigen' and
lmeSplines              Add smoothing spline modelling capability to
lmec                    Linear Mixed-Effects Models with Censored
lmerTest                Tests in Linear Mixed Effects Models
lmm                     Linear Mixed Models
lmtest                  Testing Linear Regression Models
locfit                  Local Regression, Likelihood and Density
logspline               Logspline Density Estimation Routines
longitudinal            Analysis of Multiple Time Course Data
longitudinalData        Longitudinal Data
lpSolve                 Interface to 'Lp_solve' v. 5.5 to Solve
                        Linear/Integer Programs
ltm                     Latent Trait Models under IRT
lubridate               Make Dealing with Dates a Little Easier
magic                   create and investigate magic squares
magrittr                A Forward-Pipe Operator for R
manipulate              Interactive Plots for RStudio
maps                    Draw Geographical Maps
maptools                Tools for Reading and Handling Spatial Objects
markdown                'Markdown' Rendering for R
matrixcalc              Collection of functions for matrix calculations
maxLik                  Maximum Likelihood Estimation and Related Tools
mboost                  Model-Based Boosting
mcgibbsit               Warnes and Raftery's MCGibbsit MCMC diagnostic
mclust                  Gaussian Mixture Modelling for Model-Based
                        Clustering, Classification, and Density
mcmc                    Markov Chain Monte Carlo
mda                     Mixture and Flexible Discriminant Analysis
mediation               Causal Mediation Analysis
memisc                  Tools for Management of Survey Data and the
                        Presentation of Analysis Results
memoise                 Memoisation of Functions
mi                      Missing Data Imputation and Model Checking
micEcon                 Microeconomic Analysis and Modelling
mice                    Multivariate Imputation by Chained Equations
microbenchmark          Accurate Timing Functions
mime                    Map Filenames to MIME Types
minqa                   Derivative-free optimization algorithms by
                        quadratic approximation
misc3d                  Miscellaneous 3D Plots
miscTools               Miscellaneous Tools and Utilities
mitools                 Tools for multiple imputation of missing data
mix                     Estimation/Multiple Imputation for Mixed
                        Categorical and Continuous Data
mixtools                Tools for Analyzing Finite Mixture Models
mlbench                 Machine Learning Benchmark Problems
mnormt                  The Multivariate Normal and t Distributions
modelr                  Modelling Functions that Work with the Pipe
modeltools              Tools and Classes for Statistical Models
msm                     Multi-State Markov and Hidden Markov Models in
                        Continuous Time
multcomp                Simultaneous Inference in General Parametric
munsell                 Utilities for Using Munsell Colours
mvProbit                Multivariate Probit Models
mvbutils                Workspace organization, code and documentation
                        editing, package prep and editing, etc.
mvtnorm                 Multivariate Normal and t Distributions
neighbr                 Classification, Regression, Clustering with K
                        Nearest Neighbors
network                 Classes for Relational Data
nloptr                  R interface to NLopt
nnls                    The Lawson-Hanson algorithm for non-negative
                        least squares (NNLS)
nor1mix                 Normal (1-d) Mixture Models (S3 Classes and
norm                    Analysis of multivariate normal datasets with
                        missing values
nortest                 Tests for Normality
np                      Nonparametric kernel smoothing methods for
                        mixed data types
numDeriv                Accurate Numerical Derivatives
nws                     R functions for NetWorkSpaces and Sleigh
openssl                 Toolkit for Encryption, Signatures and
                        Certificates Based on OpenSSL
openxlsx                Read, Write and Edit XLSX Files
ordinal                 Regression Models for Ordinal Data
orthopolynom            Collection of functions for orthogonal and
                        orthonormal polynomials
pan                     Multiple Imputation for Multivariate Panel or
                        Clustered Data
pander                  An R Pandoc Writer
partDSA                 Partitioning Using Deletion, Substitution, and
                        Addition Moves
party                   A Laboratory for Recursive Partytioning
pbivnorm                Vectorized Bivariate Normal CDF
pbkrtest                Parametric Bootstrap and Kenward Roger Based
                        Methods for Mixed Model Comparison
pcaPP                   Robust PCA by Projection Pursuit
permute                 Functions for Generating Restricted
                        Permutations of Data
pixmap                  Bitmap Images (``Pixel Maps'')
pkgKitten               Create Simple Packages Which Do not Upset R
                        Package Checks
pkgmaker                Package development utilities
plm                     Linear Models for Panel Data
plotmo                  Plot a Model's Response and Residuals
plotrix                 Various Plotting Functions
pls                     Partial Least Squares and Principal Component
plyr                    Tools for Splitting, Applying and Combining
pmml                    Generate PMML for Various Models
pmmlTransformations     Transforms Input Data from a PMML Perspective
polspline               Polynomial Spline Routines
polycor                 Polychoric and Polyserial Correlations
polynom                 A Collection of Functions to Implement a Class
                        for Univariate Polynomial Manipulations
portableParallelSeeds   Allow Replication of Simulations on Parallel
                        and Serial Computers
ppcor                   Partial and Semi-Partial (Part) Correlation
praise                  Praise Users
profileModel            Tools for profiling inference functions for
                        various model classes
proto                   Prototype Object-Based Programming
proxy                   Distance and Similarity Measures
pscl                    Political Science Computational Laboratory,
                        Stanford University
psidR                   Build Panel Data Sets from PSID Raw Data
pspline                 Penalized Smoothing Splines
psych                   Procedures for Psychological, Psychometric, and
                        Personality Research
purrr                   Functional Programming Tools
quadprog                Functions to solve Quadratic Programming
quantreg                Quantile Regression
rJava                   Low-Level R to Java Interface
randomForest            Breiman and Cutler's Random Forests for
                        Classification and Regression
randomForestSRC         Random Forests for Survival, Regression and
                        Classification (RF-SRC)
rattle                  Graphical User Interface for Data Mining in R
rbenchmark              Benchmarking routine for R
rbugs                   Fusing R and OpenBugs and Beyond
rda                     Shrunken Centroids Regularized Discriminant
readr                   Read Rectangular Text Data
readxl                  Read Excel Files
registry                Infrastructure for R Package Registries
relimp                  Relative Contribution of Effects in a
                        Regression Model
rematch                 Match Regular Expressions with a Nicer 'API'
reshape                 Flexibly Reshape Data
reshape2                Flexibly Reshape Data: A Reboot of the Reshape
rgenoud                 R Version of GENetic Optimization Using
rgexf                   Build, Import and Export GEXF Graph Files
rgl                     3D Visualization Using OpenGL
rlecuyer                R Interface to RNG with Multiple Streams
rmarkdown               Dynamic Documents for R
rms                     Regression Modeling Strategies
rngtools                Utility functions for working with Random
                        Number Generators
robustbase              Basic Robust Statistics
rockchalk               Regression Estimation and Presentation
roxygen2                In-Line Documentation for R
rpart.plot              Plot 'rpart' Models: An Enhanced Version of
rpf                     Response Probability Functions
rprojroot               Finding Files in Project Subdirectories
rrcov                   Scalable Robust Estimators with High Breakdown
rstan                   R Interface to Stan
rstudio                 Tools and Utilities for RStudio
rstudioapi              Safely Access the RStudio API
rvest                   Easily Harvest (Scrape) Web Pages
sandwich                Robust Covariance Matrix Estimators
scales                  Scale Functions for Visualization
scatterplot3d           3D Scatter Plot
segmented               Regression Models with Breakpoints/Changepoints
selectr                 Translate CSS Selectors to XPath Expressions
sem                     Structural Equation Models
semTools                Useful Tools for Structural Equation Modeling
setRNG                  Set (Normal) Random Number Generator and Seed
sets                    Sets, Generalized Sets, Customizable Sets and
sfsmisc                 Utilities from "Seminar fuer Statistik" ETH
shapefiles              Read and Write ESRI Shapefiles
shiny                   Web Application Framework for R
simsem                  SIMulated Structural Equation Modeling
sm                      Smoothing methods for nonparametric regression
                        and density estimation
smoothSurv              Survival Regression with Smoothed Error
sna                     Tools for Social Network Analysis
snow                    Simple Network of Workstations
snowFT                  Fault Tolerant Simple Network of Workstations
sourcetools             Tools for Reading, Tokenizing and Parsing R
sp                      Classes and Methods for Spatial Data
spam                    SPArse Matrix
spatialCovariance       Computation of Spatial Covariance Matrices for
                        Data on Rectangles
spatialkernel           Nonparameteric estimation of spatial
                        segregation in a multivariate point process
spdep                   Spatial Dependence: Weighting Schemes,
                        Statistics and Models
splancs                 Spatial and Space-Time Point Pattern Analysis
stabledist              Stable Distribution Functions
stabs                   Stability Selection with Error Control
startupmsg              Utilities for Start-Up Messages
statmod                 Statistical Modeling
statnet.common          Common R Scripts and Utilities Used by the
                        Statnet Project Software
stepwise                Stepwise detection of recombination breakpoints
stringi                 Character String Processing Facilities
stringr                 Simple, Consistent Wrappers for Common String
strucchange             Testing, Monitoring, and Dating Structural
subselect               Selecting Variable Subsets
survey                  Analysis of Complex Survey Samples
survival                Survival Analysis
systemfit               Estimating Systems of Simultaneous Equations
tables                  Formula-Driven Table Generation
tcltk2                  Tcl/Tk Additions
tensorA                 Advanced tensors arithmetic with named indices
testthat                Unit Testing for R
texreg                  Conversion of R Regression Output to LaTeX or
                        HTML Tables
tfplot                  Time Frame User Utilities
tframe                  Time Frame Coding Kernel
tibble                  Simple Data Frames
tidyr                   Easily Tidy Data with 'spread()' and 'gather()'
tidyverse               Easily Install and Load 'Tidyverse' Packages
timeDate                Rmetrics - Chronological and Calendar Objects
tis                     Time Indexes and Time Indexed Series
tkrplot                 TK Rplot
tree                    Classification and Regression Trees
triangle                Provides the Standard Distribution Functions
                        for the Triangle Distribution
trimcluster             Cluster analysis with trimming
trust                   Trust Region Optimization
ucminf                  General-Purpose Unconstrained Non-Linear
urca                    Unit Root and Cointegration Tests for Time
                        Series Data
vcd                     Visualizing Categorical Data
vegan                   Community Ecology Package
viridis                 Default Color Maps from 'matplotlib'
viridisLite             Default Color Maps from 'matplotlib' (Lite
visNetwork              Network Visualization using 'vis.js' Library
waveslim                Basic wavelet routines for one-, two- and
                        three-dimensional signal processing
wnominate               Roll Call Analysis Software
xgboost                 Extreme Gradient Boosting
xml2                    Parse XML
xtable                  Export Tables to LaTeX or HTML
xts                     eXtensible Time Series
yaml                    Methods to Convert R Data to YAML and Back
zipfR                   Statistical models for word frequency
zoo                     S3 Infrastructure for Regular and Irregular
                        Time Series (Z's Ordered Observations)

Packages in library '/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library':

KernSmooth              Functions for Kernel Smoothing Supporting Wand
                        & Jones (1995)
MASS                    Support Functions and Datasets for Venables and
                        Ripley's MASS
Matrix                  Sparse and Dense Matrix Classes and Methods
MicrosoftR              Microsoft R umbrella package
R6                      Classes with Reference Semantics
RUnit                   R Unit test framework
RevoIOQ                 Microsoft R Services Test Suite
RevoMods                R Functions Modified For Revolution R
RevoUtils               Microsoft R Utility Package
RevoUtilsMath           Microsoft R Services Math Utilities Package
base                    The R Base Package
boot                    Bootstrap Functions (Originally by Angelo Canty
                        for S)
checkpoint              Install Packages from Snapshots on the
                        Checkpoint Server for Reproducibility
class                   Functions for Classification
cluster                 "Finding Groups in Data": Cluster Analysis
                        Extended Rousseeuw et al.
codetools               Code Analysis Tools for R
compiler                The R Compiler Package
curl                    A Modern and Flexible Web Client for R
datasets                The R Datasets Package
deployrRserve           Binary R server
doParallel              Foreach Parallel Adaptor for the 'parallel'
foreach                 Provides Foreach Looping Construct for R
foreign                 Read Data Stored by Minitab, S, SAS, SPSS,
                        Stata, Systat, Weka, dBase, ...
grDevices               The R Graphics Devices and Support for Colours
                        and Fonts
graphics                The R Graphics Package
grid                    The Grid Graphics Package
iterators               Provides Iterator Construct for R
jsonlite                A Robust, High Performance JSON Parser and
                        Generator for R
lattice                 Trellis Graphics for R
methods                 Formal Methods and Classes
mgcv                    Mixed GAM Computation Vehicle with GCV/AIC/REML
                        Smoothness Estimation
nlme                    Linear and Nonlinear Mixed Effects Models
nnet                    Feed-Forward Neural Networks and Multinomial
                        Log-Linear Models
parallel                Support for Parallel computation in R
png                     Read and write PNG images
rpart                   Recursive Partitioning and Regression Trees
spatial                 Functions for Kriging and Point Pattern
splines                 Regression Spline Functions and Classes
stats                   The R Stats Package
stats4                  Statistical Functions using S4 Classes
survival                Survival Analysis
tcltk                   Tcl/Tk Interface
tools                   Tools for Package Development
utils                   The R Utils Package

As usual, if these don't work right, its something I got wrong and will fix. Email me.

As of 2017-04-25, we have solved the problems of compiling Java and tk-based R packages. In other words, we find ourselves roughly back in the place where we were in October, 2016, or perhaps a little bit ahead of that. Now that the gcc issues have been addressed, we are able to stay up to date with changes in the cutting edge packages like Rcpp, Rstan and OpenMx.

If you need other packages, I'll install them if you email me .

If you launch R and you don't find packages (in the output of library(), for example), it probably means you forgot the module magic.

If you are having trouble with Rstan, the likely sources of trouble are 1) errors in your ~/.R/Makevars file, or 2) old packages in your home folder ~/R/ that do not cooperate with the new R and the other packages we make available.

Make a module script file

I have another more lesson. Instead of re-typing that stanza whenever it is needed, put those lines in a file. I just tested this. I put module stanza rstats.sh. I saved that in $HOME/bin and made it executable ("chmod +x rstats.sh"). It seems to succeed then to run

source rstats.sh

Building your own packages?

If we don't build packages for you, you have to build your own. Here is a lesson from the school of hard knocks. In the new CRC cluster, the memory limits of your sessions are strictly enforced. The compiler will often use more than 2GB memory. As a result, when you try to build a package inside R with "install.packages", you may get a vague message of failure. To protect yourself against that, it is wise to ask for an interactive session with more memory. I do this, for example:

$ msub -I -X -l nodes=1:ppn=1,pmem=6144m

That is sufficient to compile Rstan, which is the most intensive package I have tried to build.

Posted in Data Analysis | Leave a comment

Revolutions R in new acf cluster

The cluster runs on RedHat RHEL 6, which is too old to support the new versions of R. The principal weakness is the older gcc compiler in RHEL6.

In the cluster, however, we have access to much newer Intel MKL compiler and math libraries, so the R program, and the things on which it relies, can be built with the Intel compiler. It appears as though we can stay up to date with the troublesome R modules like Rstan, Rcpp, RcppArmadillo.

Wes Mason of ITTC worked this out for us. The scheme we are testing now can be accessed as follows.

For people in the crmda user group, try this interactively

$ module purge
$ module use /panfs/pfs.local/work/crmda/tools/modules
$ module load Rstats/3.3

After that, observe

$ R

 > library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev: 
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

We are still in a testing phase on this setup, surely there will be problems. I do not understand what is necessary to compile new R packages with this setup. We don't want packages built with gcc if we can avoid it, there is always danger of incompatability when shared libraries are built with different compilers.

But the key message is still encouraging. Even though the OS does now have the needed parts, there is a work around.

Why is this "Revolution R"? The company Revolution R, which was later purchased by Microsoft, popularized the use of the Intel MKL on Ubuntu Linux. A version of R built with Intel's compiler was used, with permission, on Ubuntu in 2012. The version of R we are using now goes by the moniker "MRO". Can you guess what the M and the R stand for?

Posted in Data Analysis | Leave a comment

Making sure fonts are embedded in LaTeX thesis and dissertation documents

KU thesis rules require that all fonts used in the submitted PDF document must be embedded in the document itself. This is required to eliminate the problem that special symbols are not legible in the document on the receiver's computer.

Making sure all fonts are embedded appears to be not so easy across platforms. When I compile the ku thesis document, I notice the Wingding and symbols are not embedded.

However, this is not a flaw in pdflatex as it currently exists. It was a pdflatex flaw in the past. So far as I can tell, all fonts needed in the pdflatex run are embedded if you use a LaTeX distribution that is reasonably modern.

The major problem arises when a document includes other PDF documents, using \includegraphics{} for example. If those included documents are lacking in embedded fonts, then pdflatex does not fix that.

In my example document, before 20160503, the fonts were missing because they were not embedded in the R plots that are included in the example chapters. I had to to go back and re-run the R code to make sure the fonts are embedded in the pdf files for the graphs. After that, the pdflatex output of the thesis template is fine.

You can check for yourself, Run

$ pdffonts thesis-ku.pdf

If we don't fix the R output files before compiling the thesis itself, we are in a somewhat dangerous situation. People suggest using various magic wands to add fonts, but all of them seem to have major flaws. They either corrupt the quality of the output or destroy its internal structure.

I found ways to embed fonts using ghostscript. This converts document over to ps and then back to pdf.

$ pdf2ps  thesis-ku.pdf test.ps
$ ps2pdf14 -dPDFSettings=/prepress -dEmbedAllFonts=true test.ps

The bad news. 1 It destroys internal hyperlinks. 2 IT DOES NOT embed fonts needed for material in embedded graphs (things inserted by \includegraphics, such as PDF produced by R).



In my opinion, this is a bad outcome, should not happen. But it does.

As a result, it seems necessary to fix the individual PDF graphics files before compiling the larger thesis document.

This reminds me that at one point I had a post-processing script written for R Sweave sessions that would embed fonts in all pdf output files.

The shell script would cycle through all of the R output and embed fonts. Enjoy!

for i in *.pdf; do
base=`basename $i .pdf`;

##echo "$i base: $base new: $basenew"
  /usr/bin/gs -o $basenew -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite $i

mv -f $basenew $i

Same can be achieved inside R. Each time a PDF is created, embed the fonts with the embedFonts() function. See ?embedFonts

Posted in Data Analysis | Leave a comment

ACF Cluster resource limits: home file space and file quota

User home folders are limited at 100GB and no customization is allowed. To our users who were previously limited to 20GB, that's great news. To the others who had 600GB allocations, that's disaster. Oh, well. Just one among many.

When you log in on hpc.crc.ku.edu, a system status message appears. One report is the disk usage. Here's what I see today:

Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn

   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  65.04  85.00 100.00 :  136150  85000 100000 : /home/pauljohn uid:xxxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

In case you want to see the same output, the new cluster has a command called "mystats" which will display it again. In the terminal, run


In the output about my home folder, there is a "hard limit" at 100GB, as you can see. That is not adjustable in the current regime.

The main concern today is that I'm over the limit on the number of files. The limit is now 100,000 files but I have 136150. If I'm over the limit, I am not allowed to create new files. If I remain over the limit, the system can prevent me from doing my job.

Wait a minute. 136,150 files? WTH? Last time I checked, there were only 135,998 files and I'm sure I did not add any. Did some make babies? Do you suppose some R files found some C++ files and made an Rcpp project? (That's programmer humor. It knocks them out at conferences.)

I probably have files I don't need any more. I'm pretty sure that, for example, when I compile R, it uses tens of thousands of files. Maybe I can move that work somewhere else.

I wondered how I could find out where I have all those files. We asked and the best suggestion so far is to run the following, which sifts through all directories and counts the files.

for i in $(find . -maxdepth 1 -type d);do echo $i;find $i -type f |wc -l;done

The return shows directory names and file counts, like this:

995 .

I'll have to sift through that. Clearly, there are some files I can live without. I've got about 20K files in TMPRlib, which is a building spot for R packages before I put them in the generally accessible part of the system. .ccache is the compiler cache, I can delete those files. They just get regenerated and saved to speed up C compiler jobs, but I have to make a choice there.

So far, I've obliterated the temporary build information, but I remain over the quota. I'll show the output from "mystats" so that you can see the difference:

$ mystats
Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn
   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  63.26  85.00 100.00 :  113510  85000 100000 : /home/pauljohn uid:xxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

Oh, well, I'll have to cut/move more things.

The take-aways from this post are

  1. The CRC put in place a hard, unchangeable 100GB limit on user home directories.

  2. There is a limit of 100,000 on the number of files that can be stored within that. Users will need to cut files to be under the limit.

  3. One can use the find command in the shell to find out where the files are.

How to avoid the accidental buildup of files? The main issue is that compiling software (R packages) creates intermediate object files that are not needed once the work is done. It is difficult to police these files (at least it is for me).

I don't have time to write all this down now, but here is a hint. The question is where to store "temporary" files that are need to compile software or run a program, but they are not needed after that. In many programming chores, one can link the "build" folder to a faster, temporary storage device that is not in the network file system. In the past, I've usually used "/tmp/a_folder_i_create" because that is on the disk "in" the compute node. Disk access on the local disk is much faster than the network file system. Lately, I'm told it is even faster to put temporary material in "/dev/shm", but have not much experience. By a little clever planning, one can write the temporary files in a much faster memory disk that will be easily disposed of and, so far as I can see today, do not count within the file quota. This is not to be taken lightly. I've compared the time required to compile R using the network file storage against the local temporary storage. The difference is 45 minutes versus 15 minutes.

Posted in Programming | Tagged | Leave a comment

Interactive sessions on HPC

Danger: new smaller memory default!

At the user meeting on April 12, we found out that requesting 1 core will automatically provide only 500MB of memory. This is a BIG change, because in older cluster we received 2GB per core and that was generally sufficient. That is to say, we almost always did not specify memory.

The default interactive session is not likely to be sufficient, so it will be required to specify memory.

As a result, the command to ask for 1 node with 1 processor (core) on that node would be

msub -X -I -l nodes=1:ppn=1,pmem=2048m 

This asks for graphics X11 forwarding (-X). The memory can also be specified as "2gb".

If you only want 1 core on 1 node, the simpler notation would be to use the flag "procs".

msub -X -I -l procs=1,pmem=2048m 

To ask for several cores on 1 node (test multicore project), run

msub -X -I -l nodes=1:ppn=5,pmem=2048m

** Specify a queue **

Interactive jobs can be run on any queue. By default, they go to the user's nodes.

The default queue is displayed with 'mystats'. If you wish to run on a node that is not in your owner group, like a GPGPU node, you will then need to specify the sixhour queue and the node name. You will only have a maximum of 6 hours on this node. There is no time limit to your default queue.

msub -X -I -l nodes=1:ppn=5,pmem=2048m -q sixhour

One can specify a particular node, "g0001", with a request likee:

msub -X -I -lnodes=g001:ppn=1 -q sixhour

CRC made a page regarding queues and has relocated it at http://crc.ku.edu/using-hpc#Submitting http://crc.ku.edu/queues

Update 20170413

We requested a simpler way to launch the usual type of interactive session--one node, one core--as we had in the old cluster. The administrators created a script "qxlogin" which the user can run from the login node.

$ qxlogin
qsub: waiting for job 40565091.sched to start
qsub: job 40565091.sched ready

We suggest caution with this, since the new memory default limit is 500MB and CRMDA users have regularly reported frustration with unanticipated job failures.

In case you want to write your own login script, you can take an example from the new qxlogin, which I found is installed in /usr/local/bin on the new cluster.

$ cat /usr/local/bin/qxlogin


/opt/moab/bin/msub -X -I -lnodes=1:ppn=1 $ARGS

If you want more interactive nodes, or more ppn, just change the 1's. To test that, suppose you save it as "qxlogin2", then run

$ sh qxlogin2

If you enjoy the result, save that file in your $HOME/bin directory, make it executable, and then it will be more generally available within your sessions. After that, there is no need to run "sh" before "qxlogin2". Try it out, let me know if there is trouble.

Posted in Data Analysis | Leave a comment