GRA positions open beginning Summer, 2018

We have 2 GRA positions for 2018-2019 available now. Please consider applying:

Posted in Data Analysis | Leave a comment

stationery package for R in testing

The R package "stationery" offers themes and templates for 8 document types. It offers working examples of documents that use R markdown and LaTeX/Sweave and LaTeX/knitr to prepare guides, slides, and reports.

The 8 document types are

  1. HTML webpages prepared with Rmarkdown
  2. PDF guide documents prepared with Rmarkdown
  3. PDF report documents prepared with Rmarkdown
  4. PDF guide documents prepared with LaTeX/NoWeb using the knitr code chunk engine
  5. PDF guide documents prepared with LaTeX/NoWeb using the Sweave code chunk engine
  6. PDF report documents prepared with LaTeX/NoWeb using the knitr code chunk engine
  7. PDF report documents prepared with LaTeX/NoWeb using the Sweave code chunk engine
  8. PDF slides prepared with a KU customized Beamer slide template and the Sweave code chunk engine.

At the current time, the customization and integration of the Sweave code chunk engine with our document types is considerably further along. The knitr chunk engine documents are much more difficult to customize and we are currently in a default state with them.

Today we have uploaded the first version for general testing, numbered 0.71. To install in R, run the following.

CRAN <-""
KRAN <-""
options(repos =c(KRAN,CRAN)) 
install.packages("stationery", dep=TRUE, type="source")

After the signature, see the R session that proves this really will work.

Right now, all 8 of the document types can be initiated by the function we provide called "initWriteup". To create a new template folder, it is as simple as running


That will create a directory "writeup/rnw2pdf-slides-sweave". The first time you compile the document, it will create a folder named "theme" and copy some template files and logos into that folder. After that, if you want to replace the logos with your own pictures, or customize the templates, it will be quite obvious how to do so.

The vignettes directory right now has several instruction sheets that are nearing a presentable state. The vignettes "Rmarkdown", "code_chunks", and "HTML_special_features" are readable for humans, if not prize winning. The document-type-specific vignettes are going to undergo clarification this weekend.

Among those document specific guides listed here, the asterixed writeups are most complete:

instructions-rnw2pdf-guide-knit.pdf (*)
instructions-rnw2pdf-report-knit.pdf (*)


> CRAN <-""
> KRAN <-""
> options(repos =c(KRAN,CRAN))
> install.packages("stationery", dep=TRUE, type="source")
Installing package into ‘/home/pauljohn/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL ''
Content type 'application/x-gzip' length 6786069 bytes (6.5 MB)
downloaded 6.5 MB

* installing *source* package ‘stationery’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
   ‘HTML_special_features.Rmd’ using ‘UTF-8’
   ‘Rmarkdown.Rmd’ using ‘UTF-8’
   ‘code_chunks.Rmd’ using ‘UTF-8’
   ‘instructions-rmd2pdf-guide.Rmd’ using ‘UTF-8’
   ‘instructions-rmd2pdf-report.Rmd’ using ‘UTF-8’
   ‘instructions-rnw2pdf-guide-knit.Rnw’ using ‘UTF-8’
   ‘instructions-rnw2pdf-report-knit.Rnw’ using ‘UTF-8’
   ‘instructions-rnw2pdf-guide-sweave.Rnw’ using ‘UTF-8’
   ‘instructions-rnw2pdf-report-sweave.Rnw’ using ‘UTF-8’
   ‘instructions-rnw2pdf-slides-sweave.Rnw’ using ‘UTF-8’
   ‘stationery.Rnw’ using ‘UTF-8’
** testing if installed package can be loaded
* DONE (stationery)

The downloaded source packages are in
Posted in Data Analysis | Leave a comment

Using too many files in your cluster account? What to do!

After the cluster restart yesterday, I got this message indicating I'm over the limit on the number of files allowed in a home account. Previously I was allowed 100,000, now it appears that is lowered to 85,000. Oh, well.

The email looks like this:

PanActive Manager Warning: User Quota Violation Soft (files)

Date:        Wed Feb 07 01:19:29 CST 2018           
System Name: pfs.local                              
System IP: 
Version      6.3.1.a-1371318.1                      
Customer ID: 1021288                                

User Quota Violation Soft (files):  Limit reached on volume /home for Unix User:   (Id:  uid:142424) Limit = 85.00 K.

The above message applies to the following component:
    Volume: /home 

If I wanted to find the directories in my account where I'm using a lot of storage, I would do this

$ du -h --max-depth=1

to see the total disk size used by the files in /home. However, that does not help me isolate the number of files in use.

I checked with Riley at KU CRC and he gave me this magic recipe.

$ for i in $(find . -maxdepth 1 ! -path . -type d); do echo -n $i": "; (find $i -type f | wc -l); done | sort -k 2 -n

I'd seen that before, but forgotten. I can use same trick to get a proper sorting of the du output, so I can also see which folders are holding the most knowledge (er, files).

$ du --max-depth=1 | sort -n

Note I removed the human-readable flag ("-h") and I instruct the sort function to treat the first return value as a number, rather than text.

Posted in Data Analysis | Leave a comment

New update of kutils package for R available

I have uploaded kutils-1.31 on KRAN. To try it, fix your repositories like so:

CRAN <- ""
KRAN <- ""
options(repos = c(KRAN, CRAN))

If you have kutils already, then fire away with an update:

update.packages(ask = FALSE, checkBuilt = TRUE)

In case you don't have kutils before now (how could that possibly be?), run

install.packages("kutils", dep = TRUE)

Todays updates include new functions

importQualtrics : Imports Qualtrics files (csv or xlsx), gets column names correct and avoids corruption caused by lines 2 and 3 in most Qualtrics files. I need an example file for this one, if you have a Qualtrics file for which you have legal rights, lets see about it.

mergeCheck : reports on bad and unmatched values of the "by" (key) variable. Has examples!

anonomize : convert ID to conceal identity of survey participants

keyTemplateSPSS : creates a key describing the value_old, value_new changes implied by SPSS value labels

Please test mergeCheck, it is the function I've been promising you for some time. (example pasted below). I need to hear what it does to mangle your own key variables and such.

Also this includes the miscellaneous little updates for variable key that we've been implementing for the past 2 months.

keysPoolCheck : for several keys from related data sets, checks for mis-matches in class

keysPool: tries to blend keys. We have had this since summer, but did not export it into the workspace because it was not well tested. So please test it.

Why now?

Soon we will have a separate package for the SEM table writer functions, semTable. I mean, we have that now, soon to be on CRAN. That other package relies on some general purpose functions in kutils and it is necessary to upload kutils on CRAN and have it approved before we can upload semTable.

mergeCheck output to tantalyze you.

> library(kutils)
> example(mergeCheck)

mrgChc> df1 <- data.frame(id = 1:7, x = rnorm(7))

mrgChc> df2 <- data.frame(id = c(2:6, 9:10), x = rnorm(7))

mrgChc> mc1 <- mergeCheck(df1, df2, by = "id")
Merge difficulties detected

Unmatched cases from df1 and df2 :
  id          x
1  1 -0.9292122
7  7  0.3819611
  id          x
6  9 -1.3481932
7 10  0.6325223

mrgChc> ## Use mc1 objects mc1$keysBad, mc1$keysDuped, mc1$unmatched
mrgChc> df1 <- data.frame(id = c(1:3, NA, NaN, "", " "), x = rnorm(7))

mrgChc> df2 <- data.frame(id = c(2:6, 5:6), x = rnorm(7))

mrgChc> mergeCheck(df1, df2, by = "id")
Merge difficulties detected

Unacceptable key values
    id          x
4 <NA> -0.5295721
5  NaN -0.1538291
6       0.3073913
Duplicated key values
  id           x
4  5  0.57832104
5  6 -0.01897907
6  5  1.08232745
7  6  1.88448038
Unmatched cases from df1 and df2 :
    id          x
1    1  0.9593052
4 <NA> -0.5295721
5  NaN -0.1538291
6       0.3073913
7      -0.3304785
  id           x
3  4 -0.47677841
4  5  0.57832104
5  6 -0.01897907
6  5  1.08232745
7  6  1.88448038

mrgChc> df1 <- data.frame(idx = c(1:5, NA, NaN), x = rnorm(7))

mrgChc> df2 <- data.frame(idy = c(2:6, 9:10), x = rnorm(7))

mrgChc> mergeCheck(df1, df2, by.x = "idx", by.y = "idy")
Merge difficulties detected

Unacceptable key values
  idx         x
6  NA -1.457980
7 NaN -0.707062
Unmatched cases from df1 and df2 :
  idx          x
1   1  0.8150989
6  NA -1.4579796
7 NaN -0.7070620
  idy         x
5   6 1.2318111
6   9 1.1306955
7  10 0.1466441
Posted in Data Analysis | Leave a comment

kutils package update: more powerful SEM table maker

I have uploaded kutils-1.25, a beta testing version on KRAN. If your computer is not set to automatically pull updates from KRAN already, run this:

CRAN <- ""
KRAN <- ""
options(repos = c(KRAN, CRAN))
update.packages(ask = FALSE, checkBuilt = TRUE)

In case you don't have kutils before now (how could that possibly be?), run

install.packages("kutils", dep = TRUE)

The special feature we are testing now is the enhanced semTable function. In a nutshell, here is what this can do. Provide one or more structural equation models fitted with lavaan. Then summary tables can be presented that display the models side by side. Using the arguments columns and paramSets, the user is able to control which columns are displayed for which model, and which model sections are included.

The examples section of the help page for semTable includes 20 examples. It also demonstrates a new function called testtable, which can compile and display the PDF output from LaTeX tables.

To whet your appetite, this code fits a multi-group CFA:

 tempdir <- tempdir()
 HS.model <- ' visual  =~ x1 + x2 + x3
                   textual =~ x4 + x5 + x6
                   speed   =~ x7 + x8 + x9'
 fit1.g <- cfa(HS.model, data = HolzingerSwineford1939, = TRUE, group = "school")
 fit1.gt1 <- semTable(fit1.g, columns = c("estsestars", "p"),
                    columnLabels = c(estsestars = "Est w/stars", p = "p-value"),
                    file = file.path(tempdir, "fit1.gt1"))
if (interactive()) testtable("fit1.gt1", tempdir)

Creates a PDF file that shows the 2 groups side by side:

The default settings will display all of the groups, side by side. That might become crowded, so we allow the user to select which columns are to be displayed for the groups. It is possible also to use the groups argument to name one or more groups to be displayed in the presentation (helps when there are too many groups to fit in one table). In the multi group model, we do not allow different columns to be displayed for the various groups.

If the user has several SEM, they can be displayed side by side and the user is allowed to customize the list of displayed columns for each separate model. As we demonstrate here, we can fit a non-standardized SEM and a standardized model and display them side by side.

## Fit same model with standardization
fit1 <- cfa(HS.model, data = HolzingerSwineford1939,
    = TRUE, meanstructure = TRUE)
 fit1.std <- update(fit1, = TRUE, std.ov = TRUE, meanstructure = TRUE)
 ## include 2 models in table request
 fit1.t2 <- semTable(list("Ordinary" = fit1, "Standardized" = fit1.std),
                     file = file.path(tempdir, "fit1.2.1"))
 semTable(list("Ordinary" = fit1, "Standardized" = fit1.std),
     columns = list("Ordinary" = c("est", "se"), "Standardized" = c("est")),
     columnLabels = c(est = "Est", se = "SE"), file = file.path(tempdir, "fit1.2.2"))
 if (interactive()) testtable("fit1.2.2", tempdir)

These examples demonstrate the ability to create LaTeX tables, because that's what I use. However, the function
is designed to also create tables in HTML and CSV formats, The magic recipe for that it to insert, say, type = c("latex", "html", "csv") into the semTable funciton call. If you also add file="whatever" then three files will be created, "whatever.tex", "whatever.html" and "whatever.csv". The HTML file can be viewed within the R session by running


The appeal of the HTML output is that an HTML file can be opened by a word processor, say LibreOffice or MS Word, and it will generally be turned into a table object which can be edited. The CSV file may be used in the same way, in conjunction with a spread sheet program.

This semTable edition is intended to allow one to fit different models that can be combined into one table. The standardized example above is perhaps not the most persuasive demonstration. Consider the following instead.

## Model 5 - Mediation model with equality constraints              
model5 <-                                                           
    # latent variable definitions                                   
    ind60 =~ x1 + x2 + x3                                           
    dem60 =~ y1 + e*y2 + d*y3 + y4                                  
    dem65 =~ y5 + e*y6 + d*y7 + y8                                  
    # regressions                                                   
    dem60 ~ a*ind60                                                 
    dem65 ~ c*ind60 + b*dem60                                       
    # residual correlations                                         
    y1 ~~ y5                                                        
    y2 ~~ y4 + y6                                                   
    y3 ~~ y7                                                        
    y4 ~~ y8                                                        
    y6 ~~ y8                                                        

    # indirect effect (a*b)                                         
    ## := operator defines new parameters                           
    ab := a*b                                                       

    ## total effect                                                 
    total := c + (a*b)                                              
 fit5 <- sem(model5, data=PoliticalDemocracy)                        
 fit5boot <- sem(model5, data=PoliticalDemocracy, se = "bootstrap", boot = 100)                                               
 ## Model 5b - Revision of Model 5s                                  
 model5b <-                                                          
    # latent variable definitions                                   
    ind60 =~ x1 + x2                                                
    dem60 =~ y1 + e*y2 + d*y3 + y4                                  
    dem65 =~ y5 + e*y6 + d*y7 + y8                                  
    # regressions                                                   
    dem60 ~ a*ind60                                                 
    dem65 ~ c*ind60 + b*dem60                                       
    # no residual correlations                                      
    # indirect effect (a*b)                                         
    ## := operator defines new parameters                           
    ab := a*b                                                       

    ## total effect                                                 
    total := c + (a*b)                                              

fit5b <- sem(model5b, data=PoliticalDemocracy, se = "bootstrap",    
bootstrap = 100)                                                    
semTable(list("Model 5" = fit5boot, "Model 5b" = fit5b),            
         columns = c("estsestars", "rsquare"),                      
         file = file.path(tempdir, "fit5.5"),                       
          type = c("latex", "html", "csv"),                         
         longtable = TRUE)                                          
testtable("fit5.5", tempdir)                                        

It seems to me this result is not exactly perfect. I wish it did not print "NA" on the omitted loadings. But I'm headed in the right direction. This lines up the parts of the models, makes it plain which pieces are included in each model.

Posted in Data Analysis, R | Leave a comment

R modules: Super Exciting New Updates

This is revised Monday, July 24, 2017.

Some of you have reported segmentation faults during the past week. We learned they come from 3 different problems. First, some people have R packages compiled in their user accounts. These fall out-of-date with the R packages we provide, causing incompatability. Second, some new compute nodes came on line during the past 2 weeks and some are missing support libraries. When these are missing, the R packages that rely on them (such as our beloved kutils or rockchalk) would fail to load. This was a tricky problem because it only happened on some nodes, which only became recently available. Third, I did not understand the gravity and drama involved with the user account setup and the Rmpi package.

Lets skip to the chase. What should users do now.

Step 1. remove module statements from submission scripts.

Those statements are not having the anticipated effect, and they will destroy the benefits of the changes I suggest next.

I'm told this problem does not affect all MPI jobs, just ones that use R and the style of parallelization that we understand.

Step 2. Configure your individual user account to load appropriate modules.

Some module should be available for every session launched for your account, in every node. These have to be THE SAME in all nodes and cores launched by the job. There are 2 ways to get this done.

Option 1. The easy way: Use my R package module stanza,

In the cluster file system, I have a file /panfs/pfs.local/work/crmda/tools/bin/ with contents like this:


module purge
module load legacy
module load emacs
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3
export OMPI_MCA_btl

I say "like this" because I may insert new material there. The last 2 lines were inserted July 22, 2017. The goal is to conceal all of the details from users by putting them in a module that's loaded, such as Rstats/3.3. When we are ready to transition to R-3.4, I'll change that line accordingly.

In your user accounts, there are 2 files where you can incorporate this information, they are ~/.bashrc and ~/.bash_profile. Add a last line in those files like this:

source /panfs/pfs.local/work/crmda/tools/bin/

I'll show you my ~/.bashrc file so you can see the larger context:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc

# User specific aliases and functions
export LS_COLORS=$LS_COLORS:'di=0;33:'
# alert for rm, cp, mv
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# color and with classification
alias ls='ls -F --color=auto'
alias ll='ls -alF'

source /panfs/pfs.local/work/crmda/tools/bin/

I strongly urge all of our cluster users to include the "alert for rm, cp, mv" piece. This causes the system to ask for confirmation before deleting or replacing files. But that's up to you. I also have some an adjustment to the colors of the directory listing.

I insert the same "source" line at the end of ~/.bash_profile as well.

On 2017-07-23, I made a minor edit in my .bashrc and .bash_profile files:

export PATH=/panfs/pfs.local/work/crmda/tools/bin:$PATH

This is equivalent, but gives me a side benefit. Instead of adding the source function with the full path, I inserted that bin folder into my path. That means I can use any script in that folder without typing out the full path. When I find very handy shell scripts that I use often, and I think the other users should have access to them as well, then I will put them in that folder. For example, if you look there today, you should see "", which is the new one I'm working on. When that's ready, it will become "" and the old one will get renamed as "", where xxxx is the date on which it becomes the old one.

Option 2. Add your own module statements in ~/.bashrc and ~/.bash_profile

Make sure you put the same modules in both ~./bashrc and ~./bash_profile. Look at the file /panfs/pfs.local/work/crmda/tools/bin/ to get ideas of what you need. For example, run

$ cat /panfs/pfs.local/work/crmda/tools/bin/

You might consider creating a file similar to /panfs/pfs.local/work/crmda/tools/bin/ in your account. Then source that at the end of your ~/.bashrc and ~/.bash_profile. If you do that, they will always stay consistent.

Frequently Asked Questions that I anticipate

Can you explain why the segmentation fault happens?

Answer: Yes, I have some answers.

Here is the basic issue. Suppose you have a submission script that looks like this:

#MSUB -N RParallelHelloWorld 
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -m bea


module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.3

mpiexec -n 1 R --vanilla -f parallel-hello.R 

I though we were supposed to do that, until last week. Here's what is wrong with it.

The environment specifies Rstats/3.3, but that ONLY applies to the "master" node in the R session. It does not apply to the "child" nodes that are spawned by Rmpi. Those nodes are spawned, they are completely separate shell sessions and they are launched by settings in ~/.bash_profile. If your ~/.bash_profile does not have the required modules, then the new nodes are going to have the system default R session, and guess what you get with that? The wrong shared libraries for just about everything. Possibly you get a different version of Rmpi or Rcpp loaded, and when the separate nodes start taking to each other, they notice the difference and sometimes crash.

As a result, the submission scripts, for example, in hpcexample/Ex65-R-parallel, will now look like this:

#MSUB -N RParallelHelloWorld
#MSUB -q crmda
#MSUB -l nodes=1:ppn=11:ib
#MSUB -l walltime=00:50:00
#MSUB -m bea


## Please check your ~/.bash_profile to make sure
## the correct modules will be loaded with new shells.
## See discussion:

mpiexec -n 1 R --vanilla -f parallel-hello.R

Why is this needed for both ~/.bashrc and ~/.bash_profile?

Answer: You ask a lot of questions.

The short answer is "there's some computer nerd detail". The long answer is, "when you log in on a system, the settings in ~/.bash_profile are used. That is a 'login shell'. If you are in already, and you run a command that launches a new shell inside your session, for example by running "bash", then your new shell is not a 'login shell'. It will be created with settings in ~./bashrc.

If you will never run an interactive session, never interact with R via Emacs or Rstudio, then it might be enough to change ~/.bash_profile. If you think you might ever want to log in and run a small test case, then you should have same in both ~/.bashrc and ~/.bash_profile.

What are the benefits of Option 1?

Answer: Over time, the CRMDA R setup may evolve. Right now, I've already built a setup Rstats/3.4. After we do some bug-testing, then I can easily update the shell file (/panfs/pfs.local/work/crmda/tools/bin/ and use that. If you maintain your own modules, then you have to do that yourself.

What are the dangers of Option 1?

Answer: If I get it wrong, then you get it wrong.

Does this mean you need to revise all of the code examples in the hpcexample (​) set?

Answer: Yes. It has not been a good week. And it looks like it won't be a good week again.

Why didn't we hear about this in the old community cluster, or in CRMDA's HPC cluster

Answer: Because "we" were in control of the cluster settings and user accounts, the cluster administrators would work all of this out for us and they inserted the settings in the shell for us. Some of you may open your ~/.bashrc or ~/.bash_profile and see the old cluster settings. When I opened mine on 2017-07-07, I noticed that I had modules loaded from the old cluster. I also noticed I'd made an error of editing ~/.bashrc and not ~/.bash_profile.

Why didn't we see these problems before?

Answer: Dumb luck.

In the new CRC-supervised cluster, some modules are loaded automatically. As those modules were more-or-less consistent with what we need to do, then the different environments were not causing segmentation faults. However, when we update the R packages like Rstan, Rcpp, and, well, anything with a lot of shared libraries, then we hit the crash.

I notice you don't have oreterun in your submission example. Do you mean mpiexec really?

Answer: The documentation says that orterun, mpiexec, and mpirun are all interchangeable. I rather enjoyed orterun, it sounds fancy. However, it appears mpiexec is more widely used. There are more advanced tools (such as mpiexec.hydra, which we might start using).

In your submission script, why don't you specify the $PBS_NODEFILE any more.

Answer: The program mpiexec is compiled in a way that makes this no longer necessary. It is not harmful to specify $PBS_NODEFILE, but it is not needed either. The hpcexamples will get cleaned up. The CRMDA cluster documentation will need to be corrected.

Posted in Computing, R | Leave a comment

Rstats/3.3 and Rstats/3.4 updates: dealing with OpenMPI and Infiniband library concerns.

Dear CRMDA cluster users

During the past 2 months, some of us have seen the MPI warning from parallel R programs:

An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

We have wrestled with this. Today I've made a decision what to do. The CRMDA modules for Rstats/3.3 and 3.4 will prevent the OpenMPI (parallel computing) framework from trying to access the Infiniband network devices. That makes the warning go away. Because the ethernet communication devices are slower than Infiniband, this is not a decision taken lightly.

The CRMDA R module stanza should "just work", either

module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.3


module purge 
module load legacy 
module load emacs 
module use /panfs/pfs.local/work/crmda/tools/modules 
module load Rstats/3.4

How is this done?

I've rebuilt openmpi-1.10.7, which is also now in our module collection, so I have power to insert the special configuration described below.

R Packages

The packages list that is kept up to date, system-wide, is the same in Rstats-3.3 or Rstats-3.4. A full list is included at the end of this announcement.

If you find that updates cause your applications to break, it is allowed for users to install old versions of R packages in ~/R.

Details about the OpenMPI/openib warning message.

Embarrassingly, while googling for help on this message, I've discovered that, in 2010, I was in exact same situation setting up the CRMDA cluster that used to be in the Structural Biology Center. It had completely gone out of my mind, but with the new cluster in 2017 and fresh installs of OpenMPI, we hit the problem again.

Here is what I've learned about OpenMPI and Rmpi during the past 2 weeks.

I don't understand computer science enough to understand fully the dangers of forks and data corruption when OpenMPI uses infiniband. However, perhaps one of you can tell me.

  1. Rmpi will compile with OpenMPI >= 2.0, but it is not fully compatible. The Rmpi author has written to me directly that he is working on revisions that will make these compatible. One symptom of the problem we find is that stopCluster() does not work. It hangs the session entirely. The only way to shut down the cluster is mpi.quit(), which terminates the R session entirely.

  2. Rmpi will compile/run with OpenMPI < 2.0.

However, on systems that have Infiniband connective devices and openib libraries, there will be warnings about threads and forks as well as a danger of data corruption. The warning from OpenMPI is triggered by such innocuous R functions as sessionInfo().

Here is a session that shows the warning, using R-3.4 in the cluster.

$ R

R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Microsoft R Open 3.4.0
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2017 Microsoft Corporation

Using the Intel MKL for parallel mathematical computing(using 1 cores).

Default CRAN mirror snapshot taken on 2017-05-01.

[Previously saved workspace restored]

> library(Rmpi)
> sessionInfo()
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          n410 (PID 34456)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

Matrix products: default
BLAS: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/
LAPACK: /panfs/pfs.local/software/install/MRO/3.4.0/microsoft-r/3.4/lib64/R/lib/

[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Rmpi_0.6-6           RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
[1] compiler_3.4.0   RevoUtils_10.0.4 parallel_3.4.0

I do not know how how dangerous forks might be, but if you go read this message, it appears they can cause data corruption, and this has been known since 2010:

It is above my understanding to say whether garden variety R users will cause these problems. I do know the R parallel documentation warns against system calls and forks, possibly for same reason. R functions that use disk--dir.create, list.files--make a system call that would fall into the dangerous fork category. Possibly. This is a little above my pay grade.

Conservative approach

My "better safe than sorry" instinct leads to this conclusion: TURN OFF INFINIBAND SUPPORT IN OpenMPI. This is the policy we adopted in 2010. It was in place on the KU community cluster. In the new cluster, it was not in place, resulting in the warning message. I had forgotten about this for a long time. With newly installed OpenMPI, I ran into same old problem.

This can be done in the user account, by adding ~/.openmpi/mca-params.conf (or, systemwide in the openmpi install folder etc/openmpi-mca-params.conf) with this line.

btl = ^openib

That prevents OpenMPI from using Infiniband transport layer. I am doing this in the CRMDA OpenMPI module configuration.

One can tell that an Infiniband device is detected with the shell program "ompi_info" provided by OpenMPI. Load the module Rstats/3.3 or Rstats/3.4. After running "ompi_info", look for the btl stanza. The return from ompi_info is like this if you have Infiniband.

   MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)

And like this after changing either ~/openmpi/mca-params.conf or, etc/openmpi-mca-params.conf, to include btl = ^openib.

   MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
   MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)

I believe it is worth mentioning that, if some of your compute nodes have Infiniband, an some do not, then OpenMPI jobs will crash if they try to integrate nodes connected with ethernet and Infiniband. That is another reason to tell OpenMPI not to try to use Infiniband at all.

If users do want to use Infiniband within OpenMPI, they can do so by editing a personal configuration file, in ~./openmpi.

Alphabetical R package list.

As of 2017-07-05, these are the packages we install in the directory "/panfs/pfs.local/work/crmda/tools/mro/3.3" (or 3.4)

c("ADGofTest", "AER", "Amelia", "BH", "BMA", "BradleyTerry2", 
"Cairo", "Cubist", "DBI", "DCluster", "DEoptimR", "Devore7", 
"DiagrammeR", "ENmisc", "Ecdat", "Ecfun", "Formula", "GPArotation", 
"HistData", "Hmisc", "HyperbolicDist", "ISwR", "Iso", "JGR", 
"JM", "JMdesign", "JavaGD", "Kendall", "LearnBayes", "MCMCpack", 
"MCPAN", "MEMSS", "MNP", "MPV", "MatchIt", "Matching", "MatrixModels", 
"MplusAutomation", "NMF", "PASWR", "PolynomF", "R2HTML", "R2OpenBUGS", 
"RColorBrewer", "RCurl", "RGtk2", "RSvgDevice", "RUnit", "RandomFields", 
"Rcmdr", "RcmdrMisc", "Rcpp", "RcppArmadillo", "RcppEigen", "Rd2roxygen", 
"Rmpi", "SASmixed", "SemiPar", "SoDA", "SparseM", "StanHeaders", 
"StatDataML", "SweaveListingUtils", "", "TeachingDemos", 
"UsingR", "VGAM", "VIM", "XML", "Zelig", "abind", "acepack", 
"actuar", "ada", "ade4", "adehabitat", "akima", "alr3", "amap", 
"aod", "ape", "aplpack", "arm", "arules", "assertthat", "backports", 
"base64enc", "bayesm", "bcp", "bdsmatrix", "bestglm", "betareg", 
"biglm", "bit", "bit64", "bitops", "bnlearn", "brew", "brglm", 
"caTools", "cairoDevice", "car", "caret", "cellranger", "censReg", 
"chron", "clue", "clv", "cocorresp", "coda", "coin", "colorspace", 
"combinat", "copula", "corpcor", "crayon", "cubature", "data.table", 
"deldir", "descr", "dichromat", "digest", "diptest", "distr", 
"dlm", "doBy", "doMC", "doMPI", "doParallel", "doSNOW", "dotCall64", 
"dse", "e1071", "earth", "ecodist", "effects", "eha", "eiPack", 
"emplik", "evaluate", "expm", "faraway", "fastICA", "fastmatch", 
"fda", "ffmanova", "fields", "flexmix", "foreach", "formatR", 
"forward", "gam", "gamlss", "", "gamlss.dist", "gamm4", 
"gbm", "gclus", "gdata", "gee", "geepack", "geoR", "geoRglm", 
"ggm", "ggplot2", "glmc", "glmmBUGS", "glmmML", "glmnet", "glmpath", 
"gmodels", "gmp", "gpclib", "gridBase", "gridExtra", "gsl", "gsubfn", 
"gtable", "gtools", "hexbin", "highr", "htmltools", "htmlwidgets", 
"igraph", "ineq", "influence.ME", "inline", "iplots", "irlba", 
"iterators", "itertools", "jpeg", "jsonlite", "kernlab", "knitr", 
"kutils", "labeling", "laeken", "languageR", "lars", "latticeExtra", 
"lava", "lavaan", "lavaan.survey", "lazyeval", "leaps", "lme4", 
"lmeSplines", "lmec", "lmm", "lmtest", "locfit", "logspline", 
"longitudinal", "longitudinalData", "lpSolve", "ltm", "magrittr", 
"manipulate", "maps", "maptools", "markdown", "matrixcalc", "maxLik", 
"mboost", "mcgibbsit", "mclust", "mcmc", "mda", "memisc", "memoise", 
"mi", "micEcon", "mice", "microbenchmark", "mime", "minqa", "misc3d", 
"miscTools", "mitools", "mix", "mixtools", "mlbench", "mnormt", 
"modeltools", "msm", "multcomp", "munsell", "mvProbit", "mvbutils", 
"mvtnorm", "network", "nloptr", "nnls", "nor1mix", "norm", "nortest", 
"np", "numDeriv", "nws", "openxlsx", "ordinal", "orthopolynom", 
"pan", "partDSA", "party", "pbivnorm", "pbkrtest", "pcaPP", "permute", 
"pixmap", "pkgKitten", "pkgmaker", "plm", "plotmo", "plotrix", 
"pls", "plyr", "pmml", "pmmlTransformations", "png", "polspline", 
"polycor", "polynom", "portableParallelSeeds", "ppcor", "profileModel", 
"proto", "proxy", "pscl", "psidR", "pspline", "psych", "quadprog", 
"quantreg", "randomForest", "randomForestSRC", "rattle", "rbenchmark", 
"rbugs", "rda", "readxl", "registry", "relimp", "rematch", "reshape", 
"reshape2", "rgenoud", "rgl", "rlang", "rlecuyer", "rmarkdown", 
"rms", "rngtools", "robustbase", "rockchalk", "roxygen2", "rpart.plot", 
"rpf", "rprojroot", "rrcov", "rstan", "rstudioapi", "sandwich", 
"scales", "scatterplot3d", "segmented", "sem", "semTools", "setRNG", 
"sets", "sfsmisc", "shapefiles", "simsem", "sm", "smoothSurv", 
"sna", "snow", "snowFT", "sp", "spam", "spatialCovariance", "spdep", 
"splancs", "stabledist", "stabs", "startupmsg", "statmod", "statnet.common", 
"stepwise", "stringi", "stringr", "strucchange", "subselect", 
"survey", "systemfit", "tables", "tcltk2", "tensorA", "testthat", 
"texreg", "tfplot", "tframe", "tibble", "tidyverse", "timeDate", 
"tis", "tree", "triangle", "trimcluster", "trust", "ucminf", 
"urca", "vcd", "vegan", "visNetwork", "waveslim", "wnominate", 
"xtable", "xts", "yaml", "zipfR", "zoo", "KernSmooth", "MASS", 
"Matrix", "MicrosoftR", "R6", "RUnit", "RevoIOQ", "RevoMods", 
"RevoUtils", "RevoUtilsMath", "base", "boot", "checkpoint", "class", 
"cluster", "codetools", "compiler", "curl", "datasets", "deployrRserve", 
"doParallel", "foreach", "foreign", "grDevices", "graphics", 
"grid", "iterators", "jsonlite", "lattice", "methods", "mgcv", 
"nlme", "nnet", "parallel", "png", "rpart", "spatial", "splines", 
"stats", "stats4", "survival", "tcltk", "tools", "utils")
Posted in Data Analysis, R | Leave a comment

kutils update

kutils, our utility package that includes the Variable Key framework, was updated to version 1.0 on CRAN last week.

Minor bug fixes will be offered in our package server KRAN, which users can access by running R code like this

CRAN <- ""
KRAN <- ""
options(repos = c(KRAN, CRAN))
update.packages(ask = F, checkBuilt = TRUE)

That presupposes you have kutils already, of course. If not, run install.packages instead.

I've just uploaded to KRAN version 1.10, which has a little fix in the reverse function, which is intended to reverse the ordering of factor levels. In case you wonder what this is, here is a code snippit:

##' Reverse the levels in a factor
##' Simple literal reversal. Will stop with an error message if x is
##' not a factor (or ordered) variable.
##' Sometimes people want to
##' reverse some levels, excluding others and leaving them at the end
##' of the list. The "eol" argument sets aside some levels and puts
##' them at the end of the list of levels.
##' The use case for the \code{eol} argument is a factor
##' with several missing value labels, as appears in SPSS. With
##' up to 18 different missing codes, we want to leave them
##' at the end. In the case for which this was designed, the
##' researcher did not want to designate those values as
##' missing before inspecting the pattern of observed values.
##' @param x a factor variable
##' @param eol values to be kept at the end of the list
##' @export
##' @return a new factor variable with reversed values
##' @author Paul Johnson <>
##' @examples
##' ## Consider alphabetication of upper and lower
##' x <- factor(c("a", "b", "c", "C", "a", "c"))
##' levels(x)
##' xr1 <- reverse(x)
##' xr1
##' ## Keep "C" at end of list, after reverse others
##' xr2 <- reverse(x, eol = "C")
##' xr2
##' y <- ordered(x, levels = c("a", "b", "c", "C"))
##' yr1 <- reverse(y)
##' yr1
##' ## Hmm. end of list amounts to being "maximal".
##' ## Unintended side-effect, but interesting.
##' yr2 <- reverse(y, eol = "C")
##' yr2
reverse <- function(x, eol = c("Skip", "DNP")){
    if (!is.factor(x)) stop("your variable is not a factor")
    rlevels <- rev(levels(x))
    if (length(eol) > 0){
        for (jj in eol){
            if (length(yyy <- grep(jj, rlevels))){
                rlevels <- c(rlevels[-yyy], jj)
    factor(x, levels = rlevels)

If for some reason you don't want to install/update kutils, you can just as well paste that code into your R file and use it as the example demonstrates.

Posted in Data Analysis | Leave a comment

Cluster faster, Rstan optimized as of 2017-05-17

Special thanks to Wes Mason of the ITTC. There are 2 breakthroughs to report today.

Nodes are faster

During the spring, users reported that calculations were taking longer. I raised the problem with Wes and he did some diagnosis. It appeared the node BIOS could be adjusted to allow calculations to run faster--nearly two times faster! The CRC administrators understood the issue and they implemented the fixes on May 15, 2017.

Testing on May 16 confirmed that MCMC jobs that were taking 25 hours now take 12 hours.

Now Rstan is optimized as well

I had a lot of trouble getting the settings corrected to build Rstan in the cluster. It turns out that the user who builds Rstan needs to have special settings in a hidden file in the user account. I tried that in February and failed for various reasons, but now victory is at hand. This is one of the examples why we don't suggest individual users try to compile these packages--it is simply too difficult/frustrating.

To use the specially built Rstan, it is necessary to do the 5 step incantation described in the previous post, R Packages available for CRMDA cluster members.

These packages are compiled with GCC-6.3, the latest and greatest, with the C++ optimizer dialed up to "-O3".

In case you need to compile Rstan with GCC-6.3, here is what I have in the ~/.R/Makevars file:

R_XTRA_CPPFLAGS =  -I$(R_INCLUDE_DIR)   #set_by_rstan
## for OpenMx
CXX1X = g++
CXX1XSTD =  -std=c++0x
## For Rstan
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-Wno-ignored-attributes -Wno-deprecated-declarations

The Rstan installation manual suggests two other flags, "-flto -ffat-lto-objects", but these cause a compilation failure. We believe these are not compatible with GCC-6.3.

The other thing worth knowing is that the GCC compiler will demand much more memory than you expect. In February, I was failing over and over because the node was allowing me access to 500MB, but 5GB was necessary. Unfortunately, the error message is completely opaque, suggesting an internal bug in GCC, rather than exhaustion of memory. That was another problem that Wes Mason diagnosed for us.

Posted in Data Analysis | Leave a comment

Apply for our Student Hourly Position

The link for students to apply is:

The last day students can apply is May 23, 2017, and committee members can review candidates by logging into the BrassRing system on or after May 24, 2017.

Posted in Data Analysis | Leave a comment