Some of you have reported segmentation faults during the past week. We learned they come from 3 different problems. First, some people have R packages compiled in their user accounts. These fall out-of-date with the R packages we provide, causing incompatability. Second, some new compute nodes came on line during the past 2 weeks and some are missing support libraries. When these are missing, the R packages that rely on them (such as our beloved kutils or rockchalk) would fail to load. This was a tricky problem because it only happened on some nodes, which only became recently available. Third, I did not understand the gravity and drama involved with the user account setup and the Rmpi package.
Lets skip to the chase. What should users do now.
Those statements are not having the anticipated effect, and they will destroy the benefits of the changes I suggest next.
I'm told this problem does not affect all MPI jobs, just ones that use R and the style of parallelization that we understand.
Some module should be available for every session launched for your account, in every node. These have to be THE SAME in all nodes and cores launched by the job. There are 2 ways to get this done.
In the cluster file system, I have a file /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh with contents like this:
#!/bin/bash
module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
OMPI_MCA_btl=^openib
export OMPI_MCA_btl
I say "like this" because I may insert new material there. The last 2 lines were inserted July 22, 2017. The goal is to conceal all of the details from users by putting them in a module that's loaded, such as Rstats/3.3. When we are ready to transition to R-3.4, I'll change that line accordingly.
In your user accounts, there are 2 files where you can incorporate this information, they are ~/.bashrc and ~/.bash_profile. Add a last line in those files like this:
source /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh
I'll show you my ~/.bashrc file so you can see the larger context:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
export LS_COLORS=$LS_COLORS:'di=0;33:'
# alert for rm, cp, mv
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
# color and with classification
alias ls='ls -F --color=auto'
alias ll='ls -alF'
source /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh
I strongly urge all of our cluster users to include the "alert for rm, cp, mv" piece. This causes the system to ask for confirmation before deleting or replacing files. But that's up to you. I also have some an adjustment to the colors of the directory listing.
I insert the same "source" line at the end of ~/.bash_profile as well.
On 2017-07-23, I made a minor edit in my .bashrc and .bash_profile files:
export PATH=/panfs/pfs.local/work/crmda/tools/bin:$PATH
source crmda_env.sh
This is equivalent, but gives me a side benefit. Instead of adding the source function with the full path, I inserted that bin folder into my path. That means I can use any script in that folder without typing out the full path. When I find very handy shell scripts that I use often, and I think the other users should have access to them as well, then I will put them in that folder. For example, if you look there today, you should see "crmda_env-test.sh", which is the new one I'm working on. When that's ready, it will become "crmda_env.sh" and the old one will get renamed as "crmda_env-2017xxxx.sh", where xxxx is the date on which it becomes the old one.
Make sure you put the same modules in both ~./bashrc and ~./bash_profile. Look at the file /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh to get ideas of what you need. For example, run
$ cat /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh
You might consider creating a file similar to /panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh in your account. Then source that at the end of your ~/.bashrc and ~/.bash_profile. If you do that, they will always stay consistent.
Answer: Yes, I have some answers.
Here is the basic issue. Suppose you have a submission script that looks like this:
#!/bin/sh
#
#
#MSUB -N RParallelHelloWorld
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -M your-name-here@ku.edu
#MSUB -m bea
cd $PBS_O_WORKDIR
module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
mpiexec -n 1 R --vanilla -f parallel-hello.R
I though we were supposed to do that, until last week. Here's what is wrong with it.
The environment specifies Rstats/3.3, but that ONLY applies to the "master" node in the R session. It does not apply to the "child" nodes that are spawned by Rmpi. Those nodes are spawned, they are completely separate shell sessions and they are launched by settings in ~/.bash_profile. If your ~/.bash_profile does not have the required modules, then the new nodes are going to have the system default R session, and guess what you get with that? The wrong shared libraries for just about everything. Possibly you get a different version of Rmpi or Rcpp loaded, and when the separate nodes start taking to each other, they notice the difference and sometimes crash.
As a result, the submission scripts, for example, in hpcexample/Ex65-R-parallel, will now look like this:
#!/bin/sh
#
#
#MSUB -N RParallelHelloWorld
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -M pauljohn@ku.edu
#MSUB -m bea
cd $PBS_O_WORKDIR
## Please check your ~/.bash_profile to make sure
## the correct modules will be loaded with new shells.
## See discussion:
## http://www.crmda.dept.ku.edu/timeline/archives/184
mpiexec -n 1 R --vanilla -f parallel-hello.R
Answer: You ask a lot of questions.
The short answer is "there's some computer nerd detail". The long answer is, "when you log in on a system, the settings in ~/.bash_profile are used. That is a 'login shell'. If you are in already, and you run a command that launches a new shell inside your session, for example by running "bash", then your new shell is not a 'login shell'. It will be created with settings in ~./bashrc.
If you will never run an interactive session, never interact with R via Emacs or Rstudio, then it might be enough to change ~/.bash_profile. If you think you might ever want to log in and run a small test case, then you should have same in both ~/.bashrc and ~/.bash_profile.
Answer: Over time, the CRMDA R setup may evolve. Right now, I've already built a setup Rstats/3.4. After we do some bug-testing, then I can easily update the shell file (/panfs/pfs.local/work/crmda/tools/bin/crmda_env.sh) and use that. If you maintain your own modules, then you have to do that yourself.
Answer: If I get it wrong, then you get it wrong.
Answer: Yes. It has not been a good week. And it looks like it won't be a good week again.
Answer: Because "we" were in control of the cluster settings and user accounts, the cluster administrators would work all of this out for us and they inserted the settings in the shell for us. Some of you may open your ~/.bashrc or ~/.bash_profile and see the old cluster settings. When I opened mine on 2017-07-07, I noticed that I had modules loaded from the old cluster. I also noticed I'd made an error of editing ~/.bashrc and not ~/.bash_profile.
Answer: Dumb luck.
In the new CRC-supervised cluster, some modules are loaded automatically. As those modules were more-or-less consistent with what we need to do, then the different environments were not causing segmentation faults. However, when we update the R packages like Rstan, Rcpp, and, well, anything with a lot of shared libraries, then we hit the crash.
Answer: The documentation says that orterun, mpiexec, and mpirun are all interchangeable. I rather enjoyed orterun, it sounds fancy. However, it appears mpiexec is more widely used. There are more advanced tools (such as mpiexec.hydra, which we might start using).
Answer: The program mpiexec is compiled in a way that makes this no longer necessary. It is not harmful to specify $PBS_NODEFILE, but it is not needed either. The hpcexamples will get cleaned up. The CRMDA cluster documentation will need to be corrected.
]]>During the past 2 months, some of us have seen the MPI warning from parallel R programs:
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
We have wrestled with this. Today I've made a decision what to do. The CRMDA modules for Rstats/3.3 and 3.4 will prevent the OpenMPI (parallel computing) framework from trying to access the Infiniband network devices. That makes the warning go away. Because the ethernet communication devices are slower than Infiniband, this is not a decision taken lightly.
The CRMDA R module stanza should "just work", either
module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
or
module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.4
How is this done?
I've rebuilt openmpi-1.10.7, which is also now in our module collection, so I have power to insert the special configuration described below.
The packages list that is kept up to date, system-wide, is the same in Rstats-3.3 or Rstats-3.4. A full list is included at the end of this announcement.
If you find that updates cause your applications to break, it is allowed for users to install old versions of R packages in ~/R.
Embarrassingly, while googling for help on this message, I've discovered that, in 2010, I was in exact same situation setting up the CRMDA cluster that used to be in the Structural Biology Center. It had completely gone out of my mind, but with the new cluster in 2017 and fresh installs of OpenMPI, we hit the problem again.
Here is what I've learned about OpenMPI and Rmpi during the past 2 weeks.
I don't understand computer science enough to understand fully the dangers of forks and data corruption when OpenMPI uses infiniband. However, perhaps one of you can tell me.
Rmpi will compile with OpenMPI >= 2.0, but it is not fully compatible. The Rmpi author has written to me directly that he is working on revisions that will make these compatible. One symptom of the problem we find is that stopCluster() does not work. It hangs the session entirely. The only way to shut down the cluster is mpi.quit(), which terminates the R session entirely.
Rmpi will compile/run with OpenMPI < 2.0.
However, on systems that have Infiniband connective devices and openib libraries, there will be warnings about threads and forks as well as a danger of data corruption. The warning from OpenMPI is triggered by such innocuous R functions as sessionInfo().
Here is a session that shows the warning, using R-3.4 in the cluster.
$ R
R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Microsoft R Open 3.4.0
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2017 Microsoft Corporation
Using the Intel MKL for parallel mathematical computing(using 1 cores).
Default CRAN mirror snapshot taken on 2017-05-01.
See: https://mran.microsoft.com/.
[Previously saved workspace restored]
> library(Rmpi)
> sessionInfo()
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: n410 (PID 34456)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)
Matrix products: default
BLAS: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/libRblas.so
LAPACK: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/libRlapack.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rmpi_0.6-6 RevoUtilsMath_10.0.0
loaded via a namespace (and not attached):
[1] compiler_3.4.0 RevoUtils_10.0.4 parallel_3.4.0
I do not know how how dangerous forks might be, but if you go read this message, it appears they can cause data corruption, and this has been known since 2010:
https://www.mail-archive.com/devel@lists.open-mpi.org/msg08785.html
It is above my understanding to say whether garden variety R users will cause these problems. I do know the R parallel documentation warns against system calls and forks, possibly for same reason. R functions that use disk--dir.create, list.files--make a system call that would fall into the dangerous fork category. Possibly. This is a little above my pay grade.
My "better safe than sorry" instinct leads to this conclusion: TURN OFF INFINIBAND SUPPORT IN OpenMPI. This is the policy we adopted in 2010. It was in place on the KU community cluster. In the new cluster, it was not in place, resulting in the warning message. I had forgotten about this for a long time. With newly installed OpenMPI, I ran into same old problem.
This can be done in the user account, by adding ~/.openmpi/mca-params.conf (or, systemwide in the openmpi install folder etc/openmpi-mca-params.conf) with this line.
btl = ^openib
That prevents OpenMPI from using Infiniband transport layer. I am doing this in the CRMDA OpenMPI module configuration.
One can tell that an Infiniband device is detected with the shell program "ompi_info" provided by OpenMPI. Load the module Rstats/3.3 or Rstats/3.4. After running "ompi_info", look for the btl stanza. The return from ompi_info is like this if you have Infiniband.
MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)
And like this after changing either ~/openmpi/mca-params.conf or, etc/openmpi-mca-params.conf, to include btl = ^openib.
MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)
I believe it is worth mentioning that, if some of your compute nodes have Infiniband, an some do not, then OpenMPI jobs will crash if they try to integrate nodes connected with ethernet and Infiniband. That is another reason to tell OpenMPI not to try to use Infiniband at all.
If users do want to use Infiniband within OpenMPI, they can do so by editing a personal configuration file, in ~./openmpi.
As of 2017-07-05, these are the packages we install in the directory "/panfs/pfs.local/work/crmda/tools/mro/3.3" (or 3.4)
c("ADGofTest", "AER", "Amelia", "BH", "BMA", "BradleyTerry2",
"Cairo", "Cubist", "DBI", "DCluster", "DEoptimR", "Devore7",
"DiagrammeR", "ENmisc", "Ecdat", "Ecfun", "Formula", "GPArotation",
"HistData", "Hmisc", "HyperbolicDist", "ISwR", "Iso", "JGR",
"JM", "JMdesign", "JavaGD", "Kendall", "LearnBayes", "MCMCpack",
"MCPAN", "MEMSS", "MNP", "MPV", "MatchIt", "Matching", "MatrixModels",
"MplusAutomation", "NMF", "PASWR", "PolynomF", "R2HTML", "R2OpenBUGS",
"RColorBrewer", "RCurl", "RGtk2", "RSvgDevice", "RUnit", "RandomFields",
"Rcmdr", "RcmdrMisc", "Rcpp", "RcppArmadillo", "RcppEigen", "Rd2roxygen",
"Rmpi", "SASmixed", "SemiPar", "SoDA", "SparseM", "StanHeaders",
"StatDataML", "SweaveListingUtils", "TH.data", "TeachingDemos",
"UsingR", "VGAM", "VIM", "XML", "Zelig", "abind", "acepack",
"actuar", "ada", "ade4", "adehabitat", "akima", "alr3", "amap",
"aod", "ape", "aplpack", "arm", "arules", "assertthat", "backports",
"base64enc", "bayesm", "bcp", "bdsmatrix", "bestglm", "betareg",
"biglm", "bit", "bit64", "bitops", "bnlearn", "brew", "brglm",
"caTools", "cairoDevice", "car", "caret", "cellranger", "censReg",
"chron", "clue", "clv", "cocorresp", "coda", "coin", "colorspace",
"combinat", "copula", "corpcor", "crayon", "cubature", "data.table",
"deldir", "descr", "dichromat", "digest", "diptest", "distr",
"dlm", "doBy", "doMC", "doMPI", "doParallel", "doSNOW", "dotCall64",
"dse", "e1071", "earth", "ecodist", "effects", "eha", "eiPack",
"emplik", "evaluate", "expm", "faraway", "fastICA", "fastmatch",
"fda", "ffmanova", "fields", "flexmix", "foreach", "formatR",
"forward", "gam", "gamlss", "gamlss.data", "gamlss.dist", "gamm4",
"gbm", "gclus", "gdata", "gee", "geepack", "geoR", "geoRglm",
"ggm", "ggplot2", "glmc", "glmmBUGS", "glmmML", "glmnet", "glmpath",
"gmodels", "gmp", "gpclib", "gridBase", "gridExtra", "gsl", "gsubfn",
"gtable", "gtools", "hexbin", "highr", "htmltools", "htmlwidgets",
"igraph", "ineq", "influence.ME", "inline", "iplots", "irlba",
"iterators", "itertools", "jpeg", "jsonlite", "kernlab", "knitr",
"kutils", "labeling", "laeken", "languageR", "lars", "latticeExtra",
"lava", "lavaan", "lavaan.survey", "lazyeval", "leaps", "lme4",
"lmeSplines", "lmec", "lmm", "lmtest", "locfit", "logspline",
"longitudinal", "longitudinalData", "lpSolve", "ltm", "magrittr",
"manipulate", "maps", "maptools", "markdown", "matrixcalc", "maxLik",
"mboost", "mcgibbsit", "mclust", "mcmc", "mda", "memisc", "memoise",
"mi", "micEcon", "mice", "microbenchmark", "mime", "minqa", "misc3d",
"miscTools", "mitools", "mix", "mixtools", "mlbench", "mnormt",
"modeltools", "msm", "multcomp", "munsell", "mvProbit", "mvbutils",
"mvtnorm", "network", "nloptr", "nnls", "nor1mix", "norm", "nortest",
"np", "numDeriv", "nws", "openxlsx", "ordinal", "orthopolynom",
"pan", "partDSA", "party", "pbivnorm", "pbkrtest", "pcaPP", "permute",
"pixmap", "pkgKitten", "pkgmaker", "plm", "plotmo", "plotrix",
"pls", "plyr", "pmml", "pmmlTransformations", "png", "polspline",
"polycor", "polynom", "portableParallelSeeds", "ppcor", "profileModel",
"proto", "proxy", "pscl", "psidR", "pspline", "psych", "quadprog",
"quantreg", "randomForest", "randomForestSRC", "rattle", "rbenchmark",
"rbugs", "rda", "readxl", "registry", "relimp", "rematch", "reshape",
"reshape2", "rgenoud", "rgl", "rlang", "rlecuyer", "rmarkdown",
"rms", "rngtools", "robustbase", "rockchalk", "roxygen2", "rpart.plot",
"rpf", "rprojroot", "rrcov", "rstan", "rstudioapi", "sandwich",
"scales", "scatterplot3d", "segmented", "sem", "semTools", "setRNG",
"sets", "sfsmisc", "shapefiles", "simsem", "sm", "smoothSurv",
"sna", "snow", "snowFT", "sp", "spam", "spatialCovariance", "spdep",
"splancs", "stabledist", "stabs", "startupmsg", "statmod", "statnet.common",
"stepwise", "stringi", "stringr", "strucchange", "subselect",
"survey", "systemfit", "tables", "tcltk2", "tensorA", "testthat",
"texreg", "tfplot", "tframe", "tibble", "tidyverse", "timeDate",
"tis", "tree", "triangle", "trimcluster", "trust", "ucminf",
"urca", "vcd", "vegan", "visNetwork", "waveslim", "wnominate",
"xtable", "xts", "yaml", "zipfR", "zoo", "KernSmooth", "MASS",
"Matrix", "MicrosoftR", "R6", "RUnit", "RevoIOQ", "RevoMods",
"RevoUtils", "RevoUtilsMath", "base", "boot", "checkpoint", "class",
"cluster", "codetools", "compiler", "curl", "datasets", "deployrRserve",
"doParallel", "foreach", "foreign", "grDevices", "graphics",
"grid", "iterators", "jsonlite", "lattice", "methods", "mgcv",
"nlme", "nnet", "parallel", "png", "rpart", "spatial", "splines",
"stats", "stats4", "survival", "tcltk", "tools", "utils")
]]>Minor bug fixes will be offered in our package server KRAN, which users can access by running R code like this
CRAN <- "http://rweb.crmda.ku.edu/cran"
KRAN <- "http://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
update.packages(ask = F, checkBuilt = TRUE)
That presupposes you have kutils already, of course. If not, run install.packages instead.
I've just uploaded to KRAN version 1.10, which has a little fix in the reverse function, which is intended to reverse the ordering of factor levels. In case you wonder what this is, here is a code snippit:
##' Reverse the levels in a factor
##'
##' Simple literal reversal. Will stop with an error message if x is
##' not a factor (or ordered) variable.
##'
##' Sometimes people want to
##' reverse some levels, excluding others and leaving them at the end
##' of the list. The "eol" argument sets aside some levels and puts
##' them at the end of the list of levels.
##'
##' The use case for the \code{eol} argument is a factor
##' with several missing value labels, as appears in SPSS. With
##' up to 18 different missing codes, we want to leave them
##' at the end. In the case for which this was designed, the
##' researcher did not want to designate those values as
##' missing before inspecting the pattern of observed values.
##'
##' @param x a factor variable
##' @param eol values to be kept at the end of the list
##' @export
##' @return a new factor variable with reversed values
##' @author Paul Johnson <pauljohn@@ku.edu>
##' @examples
##' ## Consider alphabetication of upper and lower
##' x <- factor(c("a", "b", "c", "C", "a", "c"))
##' levels(x)
##' xr1 <- reverse(x)
##' xr1
##' ## Keep "C" at end of list, after reverse others
##' xr2 <- reverse(x, eol = "C")
##' xr2
##' y <- ordered(x, levels = c("a", "b", "c", "C"))
##' yr1 <- reverse(y)
##' yr1
##' ## Hmm. end of list amounts to being "maximal".
##' ## Unintended side-effect, but interesting.
##' yr2 <- reverse(y, eol = "C")
##' yr2
reverse <- function(x, eol = c("Skip", "DNP")){
if (!is.factor(x)) stop("your variable is not a factor")
rlevels <- rev(levels(x))
if (length(eol) > 0){
for (jj in eol){
if (length(yyy <- grep(jj, rlevels))){
rlevels <- c(rlevels[-yyy], jj)
}
}
}
factor(x, levels = rlevels)
}
If for some reason you don't want to install/update kutils, you can just as well paste that code into your R file and use it as the example demonstrates.
]]>During the spring, users reported that calculations were taking longer. I raised the problem with Wes and he did some diagnosis. It appeared the node BIOS could be adjusted to allow calculations to run faster--nearly two times faster! The CRC administrators understood the issue and they implemented the fixes on May 15, 2017.
Testing on May 16 confirmed that MCMC jobs that were taking 25 hours now take 12 hours.
I had a lot of trouble getting the settings corrected to build Rstan in the cluster. It turns out that the user who builds Rstan needs to have special settings in a hidden file in the user account. I tried that in February and failed for various reasons, but now victory is at hand. This is one of the examples why we don't suggest individual users try to compile these packages--it is simply too difficult/frustrating.
To use the specially built Rstan, it is necessary to do the 5 step incantation described in the previous post, R Packages available for CRMDA cluster members.
These packages are compiled with GCC-6.3, the latest and greatest, with the C++ optimizer dialed up to "-O3".
In case you need to compile Rstan with GCC-6.3, here is what I have in the ~/.R/Makevars file:
R_XTRA_CPPFLAGS = -I$(R_INCLUDE_DIR) #set_by_rstan
## for OpenMx
CXX1X = g++
CXX1XFLAGS = -g -O2
CXX1XPICFLAGS = -fpic
CXX1XSTD = -std=c++0x
## For Rstan
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-Wno-unused-local-typedefs
CXXFLAGS+=-Wno-ignored-attributes -Wno-deprecated-declarations
The Rstan installation manual suggests two other flags, "-flto -ffat-lto-objects", but these cause a compilation failure. We believe these are not compatible with GCC-6.3.
The other thing worth knowing is that the GCC compiler will demand much more memory than you expect. In February, I was failing over and over because the node was allowing me access to 500MB, but 5GB was necessary. Unfortunately, the error message is completely opaque, suggesting an internal bug in GCC, rather than exhaustion of memory. That was another problem that Wes Mason diagnosed for us.
]]>https://employment.ku.edu/student/8685BR
The last day students can apply is May 23, 2017, and committee members can review candidates by logging into the BrassRing system on or after May 24, 2017.
]]>To use R, here is a set of commands I run to set the environment. This is necessary every time I want to use R with Emacs. Let's call this the magic 5 line stanza, for sake of discussion.
module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
I agree if you say "it is a pain in the rump to have to remember to do that every time I log in." In the old cluster, I was in a position to place those startup commands into all of the CRMDA user environments. That is no longer the case.
I'm checking on ways you can automate this within your own account. Details are posted at the end of this article.
When you want to work with R on the CRC cluster, please consider using the R packages we install within the $WORK folder for CRMDA group members. These packages have some special features and if you try to install them in your user folder (under $HOME/R, as R invites you to do if you run "install.packages()" in a session), then they may not compile correctly.
Recently, we have had runtime errors because the R we are recommending, as described below, is not compatible with packages that users build and install with other versions of R (or the same version of R in a different build environment). In particular, if
then you should delete the packages you have under $HOME/R. I think it is best if you let us try to install what you need, but if you install R packages in your own home folder, please do so only AFTER loading the modules listed below. Please DO NOT load the CRC-provided module "R/3.3". It does not provide the services we need.
The module Rstats/3.3 is built by Wes Mason of ITTC and it is installed into the $WORK folder for CRMDA (hence the module use command above). We work together to make sure the OpenMPI layer is compiled correctly, so it is possible to use Rmpi and the R package parallel. The compiler used is GCC-6.3, which is quite a bit newer than the standard GCC which is provided with the cluster node operating system. This is the principal reason why the CRC-provided "R/3.3" is not acceptable. It does not make sure that the OpenMPI and GCC components are kept in lock-step with R itself. Observe, if we start with an empty session and run
module purge
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
we find that we actually load several modules:
$ module list
Currently Loaded Modules:
1) compiler/gcc/6.3
2) openmpi/2.0
3) java/1.8.0_131
4) xz/5.2.3
5) icu/59.1
6) tcltk/8.6.6
7) Rstats/3.3
The openmpi version must be kept in lock-step with R and the packages we have installed in the past. gcc-6.3 is the compiler version we use for all of the packages. It is necessary to have that new version because of demands by packages like Rstan and OpenMX. The java and tcltk modules are needed by various R packages, such as rJava and tkrplot. The xz module is a decompression suite, needed to interact with source code itself. The Rstats module itself is, for the most part, a "holding company" that keeps all of this together. It simply loads the requirements of gcc, openmpi, java, xz, icu, and tcltk, and then it accesses the R provided by the CRC system maintainers from /panfs/pfs.local/software/install/MRO/3.3. The R packages provided by the base R install are found in the directory /panfs/pfs.local/work/crmda/tools/mro/3.3/site-library.
The R packages in our collection are, in most cases, going to be updates & replacements of those packages because we are building with the different compiler. Users who load Rstats/3.3 should notice that our directory comes into the user path before the system-wide folder. Inside R, we see:
> .libPaths()
[1] "/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library"
[2] "/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library"
In my basic 5 line session starter sequence, I also have modules named legacy and emacs. In my opinion, that is a little dangerous because I'm jumbling together modules from the old and new cluster. That's necessary because I need an IDE with which to interact with R. Emacs was configured with ESS by Wes Mason, and it is in the legacy module set. If you prefer, legacy also provides RStudio version 9.98.978. That is, unfortunately, outdated and unmaintained. I've filed a request with CRC to get a new version of Rstudio.
After the magic 5 line stanza above, within your R session, you have access to these packages (Run "library()" to see all folders your path, and all packages within):
Packages in library '/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library':
ADGofTest Anderson-Darling GoF test
AER Applied Econometrics with R
Amelia A Program for Missing Data
BH Boost C++ Header Files
BMA Bayesian Model Averaging
BradleyTerry2 Bradley-Terry Models
Cairo R graphics device using cairo graphics library
for creating high-quality bitmap (PNG, JPEG,
TIFF), vector (PDF, SVG, PostScript) and
display (X11 and Win32) output
Cubist Rule- And Instance-Based Regression Modeling
DBI R Database Interface
DCluster Functions for the Detection of Spatial Clusters
of Diseases
DEoptimR Differential Evolution Optimization in Pure R
Devore7 Data sets from Devore's "Prob and Stat for Eng
(7th ed)"
DiagrammeR Create Graph Diagrams and Flowcharts Using R
ENmisc Neuwirth miscellaneous
Ecdat Data Sets for Econometrics
Ecfun Functions for Ecdat
Formula Extended Model Formulas
GPArotation GPA Factor Rotation
HistData Data Sets from the History of Statistics and
Data Visualization
Hmisc Harrell Miscellaneous
HyperbolicDist The hyperbolic distribution
ISwR Introductory Statistics with R
Iso Functions to Perform Isotonic Regression
JGR JGR - Java GUI for R
JM Joint Modeling of Longitudinal and Survival
Data
JMdesign Joint Modeling of Longitudinal and Survival
Data - Power Calculation
JavaGD Java Graphics Device
Kendall Kendall rank correlation and Mann-Kendall trend
test
LearnBayes Functions for Learning Bayesian Inference
MCMCglmm MCMC Generalised Linear Mixed Models
MCMCpack Markov Chain Monte Carlo (MCMC) Package
MCPAN Multiple Comparisons Using Normal Approximation
MEMSS Data sets from Mixed-effects Models in S
MNP R Package for Fitting the Multinomial Probit
Model
MPV Data Sets from Montgomery, Peck and Vining's
Book
MatchIt Nonparametric Preprocessing for Parametric
Casual Inference
Matching Multivariate and Propensity Score Matching with
Balance Optimization
MatrixModels Modelling with Sparse And Dense Matrices
ModelMetrics Rapid Calculation of Model Metrics
MplusAutomation Automating Mplus Model Estimation and
Interpretation
NMF Algorithms and Framework for Nonnegative Matrix
Factorization (NMF)
OpenMx Extended Structural Equation Modelling
PASWR PROBABILITY and STATISTICS WITH R
PBSmapping Mapping Fisheries Data and Spatial Analysis
Tools
PolynomF Polynomials in R
R2HTML HTML Exportation for R Objects
R2OpenBUGS Running OpenBUGS from R
R6 Classes with Reference Semantics
RColorBrewer ColorBrewer Palettes
RCurl General Network (HTTP/FTP/...) Client Interface
for R
RGtk2 R bindings for Gtk 2.8.0 and above
RSvgDevice An R SVG graphics device.
RandomFields Simulation and Analysis of Random Fields
RandomFieldsUtils Utilities for the Simulation and Analysis of
Random Fields
Rcmdr R Commander
RcmdrMisc R Commander Miscellaneous Functions
Rcpp Seamless R and C++ Integration
RcppArmadillo 'Rcpp' Integration for the 'Armadillo'
Templated Linear Algebra Library
RcppEigen 'Rcpp' Integration for the 'Eigen' Templated
Linear Algebra Library
Rd2roxygen Convert Rd to 'Roxygen' Documentation
Rmpi Interface (Wrapper) to MPI (Message-Passing
Interface)
Rook Rook - a web server interface for R
SAScii Import ASCII files directly into R using only a
SAS input script
SASmixed Data sets from "SAS System for Mixed Models"
SemiPar Semiparametic Regression
SoDA Functions and Examples for "Software for Data
Analysis"
SparseM Sparse Linear Algebra
StanHeaders C++ Header Files for Stan
StatDataML Read and Write StatDataML Files
SweaveListingUtils Utilities for Sweave Together with TeX
'listings' Package
TH.data TH's Data Archive
TeachingDemos Demonstrations for Teaching and Learning
UsingR Data Sets, Etc. for the Text "Using R for
Introductory Statistics", Second Edition
VGAM Vector Generalized Linear and Additive Models
VIM Visualization and Imputation of Missing Values
XML Tools for Parsing and Generating XML Within R
and S-Plus
Zelig Everyone's Statistical Software
abind Combine Multidimensional Arrays
acepack ACE and AVAS for Selecting Multiple Regression
Transformations
actuar Actuarial Functions and Heavy Tailed
Distributions
ada The R Package Ada for Stochastic Boosting
ade4 Analysis of Ecological Data : Exploratory and
Euclidean Methods in Environmental Sciences
adehabitat Analysis of Habitat Selection by Animals
akima Interpolation of Irregularly and Regularly
Spaced Data
alr3 Data to accompany Applied Linear Regression 3rd
edition
amap Another Multidimensional Analysis Package
aod Analysis of Overdispersed Data
ape Analyses of Phylogenetics and Evolution
aplpack Another Plot PACKage: stem.leaf, bagplot,
faces, spin3R, plotsummary, plothulls, and some
slider functions
arm Data Analysis Using Regression and
Multilevel/Hierarchical Models
arules Mining Association Rules and Frequent Itemsets
assertthat Easy Pre and Post Assertions
backports Reimplementations of Functions Introduced Since
R-3.0.0
base64enc Tools for base64 encoding
bayesm Bayesian Inference for
Marketing/Micro-Econometrics
bcp Bayesian Analysis of Change Point Problems
bdsmatrix Routines for Block Diagonal Symmetric matrices
bestglm Best Subset GLM
betareg Beta Regression
biglm bounded memory linear and generalized linear
models
bit A class for vectors of 1-bit booleans
bit64 A S3 Class for Vectors of 64bit Integers
bitops Bitwise Operations
bnlearn Bayesian Network Structure Learning, Parameter
Learning and Inference
brew Templating Framework for Report Generation
brglm Bias reduction in binomial-response generalized
linear models.
broom Convert Statistical Analysis Objects into Tidy
Data Frames
caTools Tools: moving window statistics, GIF, Base64,
ROC AUC, etc.
cairoDevice Embeddable Cairo Graphics Device Driver
car Companion to Applied Regression
caret Classification and Regression Training
cellranger Translate Spreadsheet Cell Ranges to Rows and
Columns
censReg Censored Regression (Tobit) Models
checkmate Fast and Versatile Argument Checks
chron Chronological Objects which can Handle Dates
and Times
clue Cluster Ensembles
clv Cluster Validation Techniques
cocorresp Co-Correspondence Analysis Methods
coda Output Analysis and Diagnostics for MCMC
coin Conditional Inference Procedures in a
Permutation Test Framework
colorspace Color Space Manipulation
combinat combinatorics utilities
commonmark High Performance CommonMark and Github Markdown
Rendering in R
copula Multivariate Dependence with Copulas
corpcor Efficient Estimation of Covariance and
(Partial) Correlation
crayon Colored Terminal Output
cslogistic Conditionally Specified Logistic Regression
cubature Adaptive Multivariate Integration over
Hypercubes
data.table Extension of `data.frame`
deldir Delaunay Triangulation and Dirichlet (Voronoi)
Tessellation
desc Manipulate DESCRIPTION Files
descr Descriptive Statistics
dichromat Color Schemes for Dichromats
digest Create Compact Hash Digests of R Objects
diptest Hartigan's Dip Test Statistic for Unimodality -
Corrected
distr Object Oriented Implementation of Distributions
dlm Bayesian and Likelihood Analysis of Dynamic
Linear Models
doBy Groupwise Statistics, LSmeans, Linear
Contrasts, Utilities
doMC Foreach Parallel Adaptor for 'parallel'
doMPI Foreach parallel adaptor for the Rmpi package
doSNOW Foreach Parallel Adaptor for the 'snow' Package
dplyr A Grammar of Data Manipulation
dse Dynamic Systems Estimation (Time Series
Package)
e1071 Misc Functions of the Department of Statistics,
Probability Theory Group (Formerly: E1071), TU
Wien
earth Multivariate Adaptive Regression Splines
ecodist Dissimilarity-based functions for ecological
analysis
effects Effect Displays for Linear, Generalized Linear,
and Other Models
eha Event History Analysis
eiPack eiPack: Ecological Inference and
Higher-Dimension Data Management
emplik Empirical Likelihood Ratio for
Censored/Truncated Data
evaluate Parsing and Evaluation Tools that Provide More
Details than the Default
expint Exponential Integral and Incomplete Gamma
Function
expm Matrix Exponential, Log, 'etc'
faraway Functions and Datasets for Books by Julian
Faraway
fastICA FastICA Algorithms to perform ICA and
Projection Pursuit
fastmatch Fast match() function
fda Functional Data Analysis
ffmanova Fifty-fifty MANOVA
fields Tools for Spatial Data
flexmix Flexible Mixture Modeling
forcats Tools for Working with Categorical Variables
(Factors)
formatR Format R Code Automatically
forward Forward search
gam Generalized Additive Models
gamlss Generalised Additive Models for Location Scale
and Shape
gamlss.data GAMLSS Data
gamlss.dist Distributions to be Used for GAMLSS Modelling
gamm4 Generalized Additive Mixed Models using 'mgcv'
and 'lme4'
gbm Generalized Boosted Regression Models
gclus Clustering Graphics
gdata Various R Programming Tools for Data
Manipulation
gee Generalized Estimation Equation Solver
geepack Generalized Estimating Equation Package
geoR Analysis of Geostatistical Data
geoRglm A Package for Generalised Linear Spatial Models
ggm Functions for graphical Markov models
ggplot2 Create Elegant Data Visualisations Using the
Grammar of Graphics
glmc Fitting Generalized Linear Models Subject to
Constraints
glmmBUGS Generalised Linear Mixed Models with BUGS and
JAGS
glmmML Generalized Linear Models with Clustering
glmnet Lasso and Elastic-Net Regularized Generalized
Linear Models
glmpath L1 Regularization Path for Generalized Linear
Models and Cox Proportional Hazards Model
gmodels Various R Programming Tools for Model Fitting
gmp Multiple Precision Arithmetic
gpclib General Polygon Clipping Library for R
gridBase Integration of base and grid graphics
gridExtra Miscellaneous Functions for "Grid" Graphics
grpreg Regularization Paths for Regression Models with
Grouped Covariates
gsl Wrapper for the Gnu Scientific Library
gsubfn Utilities for strings and function arguments.
gtable Arrange 'Grobs' in Tables
gtools Various R Programming Tools
haven Import and Export 'SPSS', 'Stata' and 'SAS'
Files
hexbin Hexagonal Binning Routines
highr Syntax Highlighting for R Source Code
hms Pretty Time of Day
htmlTable Advanced Tables for Markdown/HTML
htmltools Tools for HTML
htmlwidgets HTML Widgets for R
httpuv HTTP and WebSocket Server Library
httr Tools for Working with URLs and HTTP
igraph Network Analysis and Visualization
ineq Measuring Inequality, Concentration, and
Poverty
influence.ME Tools for Detecting Influential Data in Mixed
Effects Models
influenceR Software Tools to Quantify Structural
Importance of Nodes in a Network
inline Functions to Inline C, C++, Fortran Function
Calls from R
iplots iPlots - interactive graphics for R
irlba Fast Truncated SVD, PCA and Symmetric
Eigendecomposition for Large Dense and Sparse
Matrices
itertools Iterator Tools
jpeg Read and write JPEG images
kernlab Kernel-Based Machine Learning Lab
knitr A General-Purpose Package for Dynamic Report
Generation in R
kutils Project Management Tools
labeling Axis Labeling
laeken Estimation of indicators on social exclusion
and poverty
languageR Data sets and functions with "Analyzing
Linguistic Data: A practical introduction to
statistics".
lars Least Angle Regression, Lasso and Forward
Stagewise
latticeExtra Extra Graphical Utilities Based on Lattice
lava Latent Variable Models
lavaan Latent Variable Analysis
lavaan.survey Complex Survey Structural Equation Modeling
(SEM)
lazyeval Lazy (Non-Standard) Evaluation
leaps Regression Subset Selection
lme4 Linear Mixed-Effects Models using 'Eigen' and
S4
lmeSplines Add smoothing spline modelling capability to
nlme.
lmec Linear Mixed-Effects Models with Censored
Responses
lmerTest Tests in Linear Mixed Effects Models
lmm Linear Mixed Models
lmtest Testing Linear Regression Models
locfit Local Regression, Likelihood and Density
Estimation.
logspline Logspline Density Estimation Routines
longitudinal Analysis of Multiple Time Course Data
longitudinalData Longitudinal Data
lpSolve Interface to 'Lp_solve' v. 5.5 to Solve
Linear/Integer Programs
ltm Latent Trait Models under IRT
lubridate Make Dealing with Dates a Little Easier
magic create and investigate magic squares
magrittr A Forward-Pipe Operator for R
manipulate Interactive Plots for RStudio
maps Draw Geographical Maps
maptools Tools for Reading and Handling Spatial Objects
markdown 'Markdown' Rendering for R
matrixcalc Collection of functions for matrix calculations
maxLik Maximum Likelihood Estimation and Related Tools
mboost Model-Based Boosting
mcgibbsit Warnes and Raftery's MCGibbsit MCMC diagnostic
mclust Gaussian Mixture Modelling for Model-Based
Clustering, Classification, and Density
Estimation
mcmc Markov Chain Monte Carlo
mda Mixture and Flexible Discriminant Analysis
mediation Causal Mediation Analysis
memisc Tools for Management of Survey Data and the
Presentation of Analysis Results
memoise Memoisation of Functions
mi Missing Data Imputation and Model Checking
micEcon Microeconomic Analysis and Modelling
mice Multivariate Imputation by Chained Equations
microbenchmark Accurate Timing Functions
mime Map Filenames to MIME Types
minqa Derivative-free optimization algorithms by
quadratic approximation
misc3d Miscellaneous 3D Plots
miscTools Miscellaneous Tools and Utilities
mitools Tools for multiple imputation of missing data
mix Estimation/Multiple Imputation for Mixed
Categorical and Continuous Data
mixtools Tools for Analyzing Finite Mixture Models
mlbench Machine Learning Benchmark Problems
mnormt The Multivariate Normal and t Distributions
modelr Modelling Functions that Work with the Pipe
modeltools Tools and Classes for Statistical Models
msm Multi-State Markov and Hidden Markov Models in
Continuous Time
multcomp Simultaneous Inference in General Parametric
Models
munsell Utilities for Using Munsell Colours
mvProbit Multivariate Probit Models
mvbutils Workspace organization, code and documentation
editing, package prep and editing, etc.
mvtnorm Multivariate Normal and t Distributions
neighbr Classification, Regression, Clustering with K
Nearest Neighbors
network Classes for Relational Data
nloptr R interface to NLopt
nnls The Lawson-Hanson algorithm for non-negative
least squares (NNLS)
nor1mix Normal (1-d) Mixture Models (S3 Classes and
Methods)
norm Analysis of multivariate normal datasets with
missing values
nortest Tests for Normality
np Nonparametric kernel smoothing methods for
mixed data types
numDeriv Accurate Numerical Derivatives
nws R functions for NetWorkSpaces and Sleigh
openssl Toolkit for Encryption, Signatures and
Certificates Based on OpenSSL
openxlsx Read, Write and Edit XLSX Files
ordinal Regression Models for Ordinal Data
orthopolynom Collection of functions for orthogonal and
orthonormal polynomials
pan Multiple Imputation for Multivariate Panel or
Clustered Data
pander An R Pandoc Writer
partDSA Partitioning Using Deletion, Substitution, and
Addition Moves
party A Laboratory for Recursive Partytioning
pbivnorm Vectorized Bivariate Normal CDF
pbkrtest Parametric Bootstrap and Kenward Roger Based
Methods for Mixed Model Comparison
pcaPP Robust PCA by Projection Pursuit
permute Functions for Generating Restricted
Permutations of Data
pixmap Bitmap Images (``Pixel Maps'')
pkgKitten Create Simple Packages Which Do not Upset R
Package Checks
pkgmaker Package development utilities
plm Linear Models for Panel Data
plotmo Plot a Model's Response and Residuals
plotrix Various Plotting Functions
pls Partial Least Squares and Principal Component
Regression
plyr Tools for Splitting, Applying and Combining
Data
pmml Generate PMML for Various Models
pmmlTransformations Transforms Input Data from a PMML Perspective
polspline Polynomial Spline Routines
polycor Polychoric and Polyserial Correlations
polynom A Collection of Functions to Implement a Class
for Univariate Polynomial Manipulations
portableParallelSeeds Allow Replication of Simulations on Parallel
and Serial Computers
ppcor Partial and Semi-Partial (Part) Correlation
praise Praise Users
profileModel Tools for profiling inference functions for
various model classes
proto Prototype Object-Based Programming
proxy Distance and Similarity Measures
pscl Political Science Computational Laboratory,
Stanford University
psidR Build Panel Data Sets from PSID Raw Data
pspline Penalized Smoothing Splines
psych Procedures for Psychological, Psychometric, and
Personality Research
purrr Functional Programming Tools
quadprog Functions to solve Quadratic Programming
Problems.
quantreg Quantile Regression
rJava Low-Level R to Java Interface
randomForest Breiman and Cutler's Random Forests for
Classification and Regression
randomForestSRC Random Forests for Survival, Regression and
Classification (RF-SRC)
rattle Graphical User Interface for Data Mining in R
rbenchmark Benchmarking routine for R
rbugs Fusing R and OpenBugs and Beyond
rda Shrunken Centroids Regularized Discriminant
Analysis
readr Read Rectangular Text Data
readxl Read Excel Files
registry Infrastructure for R Package Registries
relimp Relative Contribution of Effects in a
Regression Model
rematch Match Regular Expressions with a Nicer 'API'
reshape Flexibly Reshape Data
reshape2 Flexibly Reshape Data: A Reboot of the Reshape
Package
rgenoud R Version of GENetic Optimization Using
Derivatives
rgexf Build, Import and Export GEXF Graph Files
rgl 3D Visualization Using OpenGL
rlecuyer R Interface to RNG with Multiple Streams
rmarkdown Dynamic Documents for R
rms Regression Modeling Strategies
rngtools Utility functions for working with Random
Number Generators
robustbase Basic Robust Statistics
rockchalk Regression Estimation and Presentation
roxygen2 In-Line Documentation for R
rpart.plot Plot 'rpart' Models: An Enhanced Version of
'plot.rpart'
rpf Response Probability Functions
rprojroot Finding Files in Project Subdirectories
rrcov Scalable Robust Estimators with High Breakdown
Point
rstan R Interface to Stan
rstudio Tools and Utilities for RStudio
rstudioapi Safely Access the RStudio API
rvest Easily Harvest (Scrape) Web Pages
sandwich Robust Covariance Matrix Estimators
scales Scale Functions for Visualization
scatterplot3d 3D Scatter Plot
segmented Regression Models with Breakpoints/Changepoints
Estimation
selectr Translate CSS Selectors to XPath Expressions
sem Structural Equation Models
semTools Useful Tools for Structural Equation Modeling
setRNG Set (Normal) Random Number Generator and Seed
sets Sets, Generalized Sets, Customizable Sets and
Intervals
sfsmisc Utilities from "Seminar fuer Statistik" ETH
Zurich
shapefiles Read and Write ESRI Shapefiles
shiny Web Application Framework for R
simsem SIMulated Structural Equation Modeling
sm Smoothing methods for nonparametric regression
and density estimation
smoothSurv Survival Regression with Smoothed Error
Distribution
sna Tools for Social Network Analysis
snow Simple Network of Workstations
snowFT Fault Tolerant Simple Network of Workstations
sourcetools Tools for Reading, Tokenizing and Parsing R
Code
sp Classes and Methods for Spatial Data
spam SPArse Matrix
spatialCovariance Computation of Spatial Covariance Matrices for
Data on Rectangles
spatialkernel Nonparameteric estimation of spatial
segregation in a multivariate point process
spdep Spatial Dependence: Weighting Schemes,
Statistics and Models
splancs Spatial and Space-Time Point Pattern Analysis
stabledist Stable Distribution Functions
stabs Stability Selection with Error Control
startupmsg Utilities for Start-Up Messages
statmod Statistical Modeling
statnet.common Common R Scripts and Utilities Used by the
Statnet Project Software
stepwise Stepwise detection of recombination breakpoints
stringi Character String Processing Facilities
stringr Simple, Consistent Wrappers for Common String
Operations
strucchange Testing, Monitoring, and Dating Structural
Changes
subselect Selecting Variable Subsets
survey Analysis of Complex Survey Samples
survival Survival Analysis
systemfit Estimating Systems of Simultaneous Equations
tables Formula-Driven Table Generation
tcltk2 Tcl/Tk Additions
tensorA Advanced tensors arithmetic with named indices
testthat Unit Testing for R
texreg Conversion of R Regression Output to LaTeX or
HTML Tables
tfplot Time Frame User Utilities
tframe Time Frame Coding Kernel
tibble Simple Data Frames
tidyr Easily Tidy Data with 'spread()' and 'gather()'
Functions
tidyverse Easily Install and Load 'Tidyverse' Packages
timeDate Rmetrics - Chronological and Calendar Objects
tis Time Indexes and Time Indexed Series
tkrplot TK Rplot
tree Classification and Regression Trees
triangle Provides the Standard Distribution Functions
for the Triangle Distribution
trimcluster Cluster analysis with trimming
trust Trust Region Optimization
ucminf General-Purpose Unconstrained Non-Linear
Optimization
urca Unit Root and Cointegration Tests for Time
Series Data
vcd Visualizing Categorical Data
vegan Community Ecology Package
viridis Default Color Maps from 'matplotlib'
viridisLite Default Color Maps from 'matplotlib' (Lite
Version)
visNetwork Network Visualization using 'vis.js' Library
waveslim Basic wavelet routines for one-, two- and
three-dimensional signal processing
wnominate Roll Call Analysis Software
xgboost Extreme Gradient Boosting
xml2 Parse XML
xtable Export Tables to LaTeX or HTML
xts eXtensible Time Series
yaml Methods to Convert R Data to YAML and Back
zipfR Statistical models for word frequency
distributions
zoo S3 Infrastructure for Regular and Irregular
Time Series (Z's Ordered Observations)
Packages in library '/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library':
KernSmooth Functions for Kernel Smoothing Supporting Wand
& Jones (1995)
MASS Support Functions and Datasets for Venables and
Ripley's MASS
Matrix Sparse and Dense Matrix Classes and Methods
MicrosoftR Microsoft R umbrella package
R6 Classes with Reference Semantics
RUnit R Unit test framework
RevoIOQ Microsoft R Services Test Suite
RevoMods R Functions Modified For Revolution R
RevoUtils Microsoft R Utility Package
RevoUtilsMath Microsoft R Services Math Utilities Package
base The R Base Package
boot Bootstrap Functions (Originally by Angelo Canty
for S)
checkpoint Install Packages from Snapshots on the
Checkpoint Server for Reproducibility
class Functions for Classification
cluster "Finding Groups in Data": Cluster Analysis
Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
curl A Modern and Flexible Web Client for R
datasets The R Datasets Package
deployrRserve Binary R server
doParallel Foreach Parallel Adaptor for the 'parallel'
Package
foreach Provides Foreach Looping Construct for R
foreign Read Data Stored by Minitab, S, SAS, SPSS,
Stata, Systat, Weka, dBase, ...
grDevices The R Graphics Devices and Support for Colours
and Fonts
graphics The R Graphics Package
grid The Grid Graphics Package
iterators Provides Iterator Construct for R
jsonlite A Robust, High Performance JSON Parser and
Generator for R
lattice Trellis Graphics for R
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial
Log-Linear Models
parallel Support for Parallel computation in R
png Read and write PNG images
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern
Analysis
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
utils The R Utils Package
As usual, if these don't work right, its something I got wrong and will fix. Email me.
As of 2017-04-25, we have solved the problems of compiling Java and tk-based R packages. In other words, we find ourselves roughly back in the place where we were in October, 2016, or perhaps a little bit ahead of that. Now that the gcc issues have been addressed, we are able to stay up to date with changes in the cutting edge packages like Rcpp, Rstan and OpenMx.
If you need other packages, I'll install them if you email me
If you launch R and you don't find packages (in the output of library(), for example), it probably means you forgot the module magic.
If you are having trouble with Rstan, the likely sources of trouble are 1) errors in your ~/.R/Makevars file, or 2) old packages in your home folder ~/R/ that do not cooperate with the new R and the other packages we make available.
I have another more lesson. Instead of re-typing that stanza whenever it is needed, put those lines in a file. I just tested this. I put module stanza rstats.sh. I saved that in $HOME/bin and made it executable ("chmod +x rstats.sh"). It seems to succeed then to run
source rstats.sh
If we don't build packages for you, you have to build your own. Here is a lesson from the school of hard knocks. In the new CRC cluster, the memory limits of your sessions are strictly enforced. The compiler will often use more than 2GB memory. As a result, when you try to build a package inside R with "install.packages", you may get a vague message of failure. To protect yourself against that, it is wise to ask for an interactive session with more memory. I do this, for example:
$ msub -I -X -l nodes=1:ppn=1,pmem=6144m
That is sufficient to compile Rstan, which is the most intensive package I have tried to build.
]]>In the cluster, however, we have access to much newer Intel MKL compiler and math libraries, so the R program, and the things on which it relies, can be built with the Intel compiler. It appears as though we can stay up to date with the troublesome R modules like Rstan, Rcpp, RcppArmadillo.
Wes Mason of ITTC worked this out for us. The scheme we are testing now can be accessed as follows.
For people in the crmda user group, try this interactively
$ module purge
$ module use /panfs/pfs.local/work/crmda/tools/modules
$ module load Rstats/3.3
After that, observe
$ R
> library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev:
5fa1e80eb817)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
We are still in a testing phase on this setup, surely there will be problems. I do not understand what is necessary to compile new R packages with this setup. We don't want packages built with gcc if we can avoid it, there is always danger of incompatability when shared libraries are built with different compilers.
But the key message is still encouraging. Even though the OS does now have the needed parts, there is a work around.
Why is this "Revolution R"? The company Revolution R, which was later purchased by Microsoft, popularized the use of the Intel MKL on Ubuntu Linux. A version of R built with Intel's compiler was used, with permission, on Ubuntu in 2012. The version of R we are using now goes by the moniker "MRO". Can you guess what the M and the R stand for?
]]>Making sure all fonts are embedded appears to be not so easy across platforms. When I compile the ku thesis document, I notice the Wingding and symbols are not embedded.
However, this is not a flaw in pdflatex as it currently exists. It was a pdflatex flaw in the past. So far as I can tell, all fonts needed in the pdflatex run are embedded if you use a LaTeX distribution that is reasonably modern.
The major problem arises when a document includes other PDF documents, using \includegraphics{} for example. If those included documents are lacking in embedded fonts, then pdflatex does not fix that.
In my example document, before 20160503, the fonts were missing because they were not embedded in the R plots that are included in the example chapters. I had to to go back and re-run the R code to make sure the fonts are embedded in the pdf files for the graphs. After that, the pdflatex output of the thesis template is fine.
You can check for yourself, Run
$ pdffonts thesis-ku.pdf
If we don't fix the R output files before compiling the thesis itself, we are in a somewhat dangerous situation. People suggest using various magic wands to add fonts, but all of them seem to have major flaws. They either corrupt the quality of the output or destroy its internal structure.
I found ways to embed fonts using ghostscript. This converts document over to ps and then back to pdf.
$ pdf2ps thesis-ku.pdf test.ps
$ ps2pdf14 -dPDFSettings=/prepress -dEmbedAllFonts=true test.ps
test.pdf
The bad news. 1 It destroys internal hyperlinks. 2 IT DOES NOT embed fonts needed for material in embedded graphs (things inserted by \includegraphics, such as PDF produced by R).
See:
http://askubuntu.com/questions/50274/fonts-are-not-embedded-into-a-pdf
In my opinion, this is a bad outcome, should not happen. But it does.
As a result, it seems necessary to fix the individual PDF graphics files before compiling the larger thesis document.
This reminds me that at one point I had a post-processing script written for R Sweave sessions that would embed fonts in all pdf output files.
The shell script would cycle through all of the R output and embed fonts. Enjoy!
for i in *.pdf; do
base=`basename $i .pdf`;
basenew="${base}/newtemp.pdf"
##echo "$i base: $base new: $basenew"
/usr/bin/gs -o $basenew -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite $i
mv -f $basenew $i
done;
Same can be achieved inside R. Each time a PDF is created, embed the fonts with the embedFonts() function. See ?embedFonts
]]>When you log in on hpc.crc.ku.edu, a system status message appears. One report is the disk usage. Here's what I see today:
Primary group: hpc_crmda
Default Queue: crmda
$HOME = /home/pauljohn
<GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
65.04 85.00 100.00 : 136150 85000 100000 : /home/pauljohn uid:xxxxxx(pauljohn)
$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem Size Used Avail Use% Mounted on
panfs://pfs.local/work
14T 1.6T 13T 12% /panfs/pfs.local/work/crmda/pauljohn
$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem Size Used Avail Use% Mounted on
panfs://pfs.local/scratch
55T 37T 19T 67% /panfs/pfs.local/scratch/crmda/pauljohn
In case you want to see the same output, the new cluster has a command called "mystats" which will display it again. In the terminal, run
mystats
In the output about my home folder, there is a "hard limit" at 100GB, as you can see. That is not adjustable in the current regime.
The main concern today is that I'm over the limit on the number of files. The limit is now 100,000 files but I have 136150. If I'm over the limit, I am not allowed to create new files. If I remain over the limit, the system can prevent me from doing my job.
Wait a minute. 136,150 files? WTH? Last time I checked, there were only 135,998 files and I'm sure I did not add any. Did some make babies? Do you suppose some R files found some C++ files and made an Rcpp project? (That's programmer humor. It knocks them out at conferences.)
I probably have files I don't need any more. I'm pretty sure that, for example, when I compile R, it uses tens of thousands of files. Maybe I can move that work somewhere else.
I wondered how I could find out where I have all those files. We asked and the best suggestion so far is to run the following, which sifts through all directories and counts the files.
for i in $(find . -maxdepth 1 -type d);do echo $i;find $i -type f |wc -l;done
The return shows directory names and file counts, like this:
./tmp
17365
./work
46
./.emacs.d
0
./src
25519
./texmf
1794
./packages
5041
./SVN
4321
./Software
12014
./.ccache
995 .
/TMPRlib-3.3
19316
I'll have to sift through that. Clearly, there are some files I can live without. I've got about 20K files in TMPRlib, which is a building spot for R packages before I put them in the generally accessible part of the system. .ccache is the compiler cache, I can delete those files. They just get regenerated and saved to speed up C compiler jobs, but I have to make a choice there.
So far, I've obliterated the temporary build information, but I remain over the quota. I'll show the output from "mystats" so that you can see the difference:
$ mystats
Primary group: hpc_crmda
Default Queue: crmda
$HOME = /home/pauljohn
<GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
63.26 85.00 100.00 : 113510 85000 100000 : /home/pauljohn uid:xxxxx(pauljohn)
$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem Size Used Avail Use% Mounted on
panfs://pfs.local/work
14T 1.6T 13T 12% /panfs/pfs.local/work/crmda/pauljohn
$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem Size Used Avail Use% Mounted on
panfs://pfs.local/scratch
55T 37T 19T 67% /panfs/pfs.local/scratch/crmda/pauljohn
Oh, well, I'll have to cut/move more things.
The CRC put in place a hard, unchangeable 100GB limit on user home directories.
There is a limit of 100,000 on the number of files that can be stored within that. Users will need to cut files to be under the limit.
One can use the find command in the shell to find out where the files are.
How to avoid the accidental buildup of files? The main issue is that compiling software (R packages) creates intermediate object files that are not needed once the work is done. It is difficult to police these files (at least it is for me).
I don't have time to write all this down now, but here is a hint. The question is where to store "temporary" files that are need to compile software or run a program, but they are not needed after that. In many programming chores, one can link the "build" folder to a faster, temporary storage device that is not in the network file system. In the past, I've usually used "/tmp/a_folder_i_create" because that is on the disk "in" the compute node. Disk access on the local disk is much faster than the network file system. Lately, I'm told it is even faster to put temporary material in "/dev/shm", but have not much experience. By a little clever planning, one can write the temporary files in a much faster memory disk that will be easily disposed of and, so far as I can see today, do not count within the file quota. This is not to be taken lightly. I've compared the time required to compile R using the network file storage against the local temporary storage. The difference is 45 minutes versus 15 minutes.
]]>At the user meeting on April 12, we found out that requesting 1 core will automatically provide only 500MB of memory. This is a BIG change, because in older cluster we received 2GB per core and that was generally sufficient. That is to say, we almost always did not specify memory.
The default interactive session is not likely to be sufficient, so it will be required to specify memory.
As a result, the command to ask for 1 node with 1 processor (core) on that node would be
msub -X -I -l nodes=1:ppn=1,pmem=2048m
This asks for graphics X11 forwarding (-X). The memory can also be specified as "2gb".
If you only want 1 core on 1 node, the simpler notation would be to use the flag "procs".
msub -X -I -l procs=1,pmem=2048m
To ask for several cores on 1 node (test multicore project), run
msub -X -I -l nodes=1:ppn=5,pmem=2048m
** Specify a queue **
Interactive jobs can be run on any queue. By default, they go to the user's nodes.
The default queue is displayed with 'mystats'. If you wish to run on a node that is not in your owner group, like a GPGPU node, you will then need to specify the sixhour queue and the node name. You will only have a maximum of 6 hours on this node. There is no time limit to your default queue.
msub -X -I -l nodes=1:ppn=5,pmem=2048m -q sixhour
One can specify a particular node, "g0001", with a request likee:
msub -X -I -lnodes=g001:ppn=1 -q sixhour
CRC made a page regarding queues and has relocated it at http://crc.ku.edu/using-hpc#Submitting http://crc.ku.edu/queues
Update 20170413
We requested a simpler way to launch the usual type of interactive session--one node, one core--as we had in the old cluster. The administrators created a script "qxlogin" which the user can run from the login node.
$ qxlogin
qsub: waiting for job 40565091.sched to start
qsub: job 40565091.sched ready
We suggest caution with this, since the new memory default limit is 500MB and CRMDA users have regularly reported frustration with unanticipated job failures.
In case you want to write your own login script, you can take an example from the new qxlogin, which I found is installed in /usr/local/bin on the new cluster.
$ cat /usr/local/bin/qxlogin
#!/bin/sh
ARGS=$@
/opt/moab/bin/msub -X -I -lnodes=1:ppn=1 $ARGS
If you want more interactive nodes, or more ppn, just change the 1's. To test that, suppose you save it as "qxlogin2", then run
$ sh qxlogin2
If you enjoy the result, save that file in your $HOME/bin directory, make it executable, and then it will be more generally available within your sessions. After that, there is no need to run "sh" before "qxlogin2". Try it out, let me know if there is trouble.
]]>