Revolutions R in new acf cluster

The cluster runs on RedHat RHEL 6, which is too old to support the new versions of R. The principal weakness is the older gcc compiler in RHEL6.

In the cluster, however, we have access to much newer Intel MKL compiler and math libraries, so the R program, and the things on which it relies, can be built with the Intel compiler. It appears as though we can stay up to date with the troublesome R modules like Rstan, Rcpp, RcppArmadillo.

Wes Mason of ITTC worked this out for us. The scheme we are testing now can be accessed as follows.

For people in the crmda user group, try this interactively

$ module purge
$ module use /panfs/pfs.local/work/crmda/tools/modules
$ module load Rstats/3.3

After that, observe

$ R

 > library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev: 
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

We are still in a testing phase on this setup, surely there will be problems. I do not understand what is necessary to compile new R packages with this setup. We don't want packages built with gcc if we can avoid it, there is always danger of incompatability when shared libraries are built with different compilers.

But the key message is still encouraging. Even though the OS does now have the needed parts, there is a work around.

Why is this "Revolution R"? The company Revolution R, which was later purchased by Microsoft, popularized the use of the Intel MKL on Ubuntu Linux. A version of R built with Intel's compiler was used, with permission, on Ubuntu in 2012. The version of R we are using now goes by the moniker "MRO". Can you guess what the M and the R stand for?

Posted in Data Analysis | Leave a comment

Making sure fonts are embedded in LaTeX thesis and dissertation documents

KU thesis rules require that all fonts used in the submitted PDF document must be embedded in the document itself. This is required to eliminate the problem that special symbols are not legible in the document on the receiver's computer.

Making sure all fonts are embedded appears to be not so easy across platforms. When I compile the ku thesis document, I notice the Wingding and symbols are not embedded.

However, this is not a flaw in pdflatex as it currently exists. It was a pdflatex flaw in the past. So far as I can tell, all fonts needed in the pdflatex run are embedded if you use a LaTeX distribution that is reasonably modern.

The major problem arises when a document includes other PDF documents, using \includegraphics{} for example. If those included documents are lacking in embedded fonts, then pdflatex does not fix that.

In my example document, before 20160503, the fonts were missing because they were not embedded in the R plots that are included in the example chapters. I had to to go back and re-run the R code to make sure the fonts are embedded in the pdf files for the graphs. After that, the pdflatex output of the thesis template is fine.

You can check for yourself, Run

$ pdffonts thesis-ku.pdf

If we don't fix the R output files before compiling the thesis itself, we are in a somewhat dangerous situation. People suggest using various magic wands to add fonts, but all of them seem to have major flaws. They either corrupt the quality of the output or destroy its internal structure.

I found ways to embed fonts using ghostscript. This converts document over to ps and then back to pdf.

$ pdf2ps  thesis-ku.pdf
$ ps2pdf14 -dPDFSettings=/prepress -dEmbedAllFonts=true

The bad news. 1 It destroys internal hyperlinks. 2 IT DOES NOT embed fonts needed for material in embedded graphs (things inserted by \includegraphics, such as PDF produced by R).


In my opinion, this is a bad outcome, should not happen. But it does.

As a result, it seems necessary to fix the individual PDF graphics files before compiling the larger thesis document.

This reminds me that at one point I had a post-processing script written for R Sweave sessions that would embed fonts in all pdf output files.

The shell script would cycle through all of the R output and embed fonts. Enjoy!

for i in *.pdf; do
base=`basename $i .pdf`;

##echo "$i base: $base new: $basenew"
  /usr/bin/gs -o $basenew -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite $i

mv -f $basenew $i

Same can be achieved inside R. Each time a PDF is created, embed the fonts with the embedFonts() function. See ?embedFonts

Posted in Data Analysis | Leave a comment

ACF Cluster resource limits: home file space and file quota

User home folders are limited at 100GB and no customization is allowed. To our users who were previously limited to 20GB, that's great news. To the others who had 600GB allocations, that's disaster. Oh, well. Just one among many.

When you log in on, a system status message appears. One report is the disk usage. Here's what I see today:

Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn

   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  65.04  85.00 100.00 :  136150  85000 100000 : /home/pauljohn uid:xxxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

In case you want to see the same output, the new cluster has a command called "mystats" which will display it again. In the terminal, run


In the output about my home folder, there is a "hard limit" at 100GB, as you can see. That is not adjustable in the current regime.

The main concern today is that I'm over the limit on the number of files. The limit is now 100,000 files but I have 136150. If I'm over the limit, I am not allowed to create new files. If I remain over the limit, the system can prevent me from doing my job.

Wait a minute. 136,150 files? WTH? Last time I checked, there were only 135,998 files and I'm sure I did not add any. Did some make babies? Do you suppose some R files found some C++ files and made an Rcpp project? (That's programmer humor. It knocks them out at conferences.)

I probably have files I don't need any more. I'm pretty sure that, for example, when I compile R, it uses tens of thousands of files. Maybe I can move that work somewhere else.

I wondered how I could find out where I have all those files. We asked and the best suggestion so far is to run the following, which sifts through all directories and counts the files.

for i in $(find . -maxdepth 1 -type d);do echo $i;find $i -type f |wc -l;done

The return shows directory names and file counts, like this:

995 .

I'll have to sift through that. Clearly, there are some files I can live without. I've got about 20K files in TMPRlib, which is a building spot for R packages before I put them in the generally accessible part of the system. .ccache is the compiler cache, I can delete those files. They just get regenerated and saved to speed up C compiler jobs, but I have to make a choice there.

So far, I've obliterated the temporary build information, but I remain over the quota. I'll show the output from "mystats" so that you can see the difference:

$ mystats
Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn
   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  63.26  85.00 100.00 :  113510  85000 100000 : /home/pauljohn uid:xxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

Oh, well, I'll have to cut/move more things.

The take-aways from this post are

  1. The CRC put in place a hard, unchangeable 100GB limit on user home directories.

  2. There is a limit of 100,000 on the number of files that can be stored within that. Users will need to cut files to be under the limit.

  3. One can use the find command in the shell to find out where the files are.

How to avoid the accidental buildup of files? The main issue is that compiling software (R packages) creates intermediate object files that are not needed once the work is done. It is difficult to police these files (at least it is for me).

I don't have time to write all this down now, but here is a hint. The question is where to store "temporary" files that are need to compile software or run a program, but they are not needed after that. In many programming chores, one can link the "build" folder to a faster, temporary storage device that is not in the network file system. In the past, I've usually used "/tmp/a_folder_i_create" because that is on the disk "in" the compute node. Disk access on the local disk is much faster than the network file system. Lately, I'm told it is even faster to put temporary material in "/dev/shm", but have not much experience. By a little clever planning, one can write the temporary files in a much faster memory disk that will be easily disposed of and, so far as I can see today, do not count within the file quota. This is not to be taken lightly. I've compared the time required to compile R using the network file storage against the local temporary storage. The difference is 45 minutes versus 15 minutes.

Posted in Programming | Tagged | Leave a comment

Interactive sessions on HPC

Danger: new smaller memory default!

At the user meeting on April 12, we found out that requesting 1 core will automatically provide only 500MB of memory. This is a BIG change, because in older cluster we received 2GB per core and that was generally sufficient. That is to say, we almost always did not specify memory.

The default interactive session is not likely to be sufficient, so it will be required to specify memory.

As a result, the command to ask for 1 node with 1 processor (core) on that node would be

msub -X -I -l nodes=1:ppn=1,pmem=2048m 

This asks for graphics X11 forwarding (-X). The memory can also be specified as "2gb".

If you only want 1 core on 1 node, the simpler notation would be to use the flag "procs".

msub -X -I -l procs=1,pmem=2048m 

To ask for several cores on 1 node (test multicore project), run

msub -X -I -l nodes=1:ppn=5,pmem=2048m

** Specify a queue **

Interactive jobs can be run on any queue. By default, they go to the user's nodes.

The default queue is displayed with 'mystats'. If you wish to run on a node that is not in your owner group, like a GPGPU node, you will then need to specify the sixhour queue and the node name. You will only have a maximum of 6 hours on this node. There is no time limit to your default queue.

msub -X -I -l nodes=1:ppn=5,pmem=2048m -q sixhour

One can specify a particular node, "g0001", with a request likee:

msub -X -I -lnodes=g001:ppn=1 -q sixhour

CRC made a page regarding queues and has relocated it at

Update 20170413

We requested a simpler way to launch the usual type of interactive session--one node, one core--as we had in the old cluster. The administrators created a script "qxlogin" which the user can run from the login node.

$ qxlogin
qsub: waiting for job 40565091.sched to start
qsub: job 40565091.sched ready

We suggest caution with this, since the new memory default limit is 500MB and CRMDA users have regularly reported frustration with unanticipated job failures.

In case you want to write your own login script, you can take an example from the new qxlogin, which I found is installed in /usr/local/bin on the new cluster.

$ cat /usr/local/bin/qxlogin


/opt/moab/bin/msub -X -I -lnodes=1:ppn=1 $ARGS

If you want more interactive nodes, or more ppn, just change the 1's. To test that, suppose you save it as "qxlogin2", then run

$ sh qxlogin2

If you enjoy the result, save that file in your $HOME/bin directory, make it executable, and then it will be more generally available within your sessions. After that, there is no need to run "sh" before "qxlogin2". Try it out, let me know if there is trouble.

Posted in Data Analysis | Leave a comment

Cluster user update

We will have a cluster update meeting on Friday at 10AM in Watson Room 440D (within the suite of the Digital Humanities group).

Today the Center for Research Computing announced the re-opening of the compute cluster. A number of features we have come to depend on were removed. All of the CRMDA documentation ( will need to be revised. This will take some time. These changes were not well publicized during the six-month-long runup to the cluster administration changeover, we are playing catchup.

They have declined to support NoMachine GUI connections and that the cluster storage is not externally accessible via Windows Server or Network File System protocols. We will have to find ways to navigate around those changes.

The top priority right now is updating the hpc example collection,

Most of that work has been kindly attended to by Wes Mason at KU ITTC.

Here is a copy of the announcement.

KU Community Cluster Users,

Over the course of the last few weeks we have been working to transition the administration of the KU Community Cluster to the Center for Research Computing (CRC). We have completed testing with a subset of users and we are now restoring access for all users who are part of an owner group. If you know someone in your group that did not get this announcement, please email

We have kept the underlying legacy software environment the same to make this transition simpler, but have made some improvements and updates that you will need to be aware of to use the cluster. We will be building upon these initial improvements over the coming months to standardize, implement best practices, update and integrate the software stack, provide transparency of resources utilization, integrate with KU, and help you optimize your use of the cluster.


We have integrated with KU's identity management system so you will use your KU username and password to access the cluster. We have 2 login nodes that you will randomly be assigned to when you login to the address:



'env-selector' was removed and only 'module' is available to load different software packages.

When issuing the command:

> module avail

you will see the new software we have compiled that is optimized for the latest version of the CPUs in the cluster.

To see the software installed before this transition, you must enter:

> module load legacy

and then you can see all legacy software by entering the command:

> module avail

You must place these commands in your job submit scripts as well if you choose to use the legacy software.


'qsub' has been replaced w‚Äčith 'msub'. All your submit scripts will still work with 'msub'. The #PBS directives in your job submit scripts are also compatible with "msub', but we suggest when you create new job submit scripts to use the #MSUB directives.


Your home directory now has a 100GB quota. We have integrated the cluster with KU's identity management system so your home directory also matches the KU home directory path (e.g., /home/a123b456).

All data from /research, /projects, /data, and if you had your own root directory (for example: /compbio), this has all been placed in

/panfs/pfs.local/work/<owner group>/<user>

If your owner group has used all their storage allocation or if your group does not have a storage allocation, some of your data had to be moved to $SCRATCH:

/panfs/pfs.local/scratch/<owner group>/<user>

We organized the data to better keep track of usage for owner groups. Scratch has been set up in the same manner. Some groups were previously allocated more storage than they purchased and you will see your quota for your $HOME, $WORK, and $SCRATCH directories when you log on. If you see any directory at 100%, then you must remove files before writing to it.

To see your quota, group, and queue stats at anytime, run:

> mystats

on the submit nodes.

NO data was deleted. If you see that you are missing something, please contact Please check all paths first, please.


Your default queue will be displayed when you log in. This is the queue you will run in if you do not specify a queue name. If you wish to run across the whole cluster, you must specify:

#MSUB -q sixhour

in your job script or from command line:

> msub -q sixhour

You may only run a maximum of 6 hours on the 'sixhour' queue, but your jobs goes across all nodes.

Most users will only have access to their owner group queue and the 'sixhour' queue. Others will be part of multiple groups and have access to other queues as well.

All of this information will be displayed when you login to the cluster for at least the first few months after coming back online.

We are continuing to write documentation and help pages about the new setup of the cluster. These pages can be found at under the HPC tab and more will be added as time goes on so check back often. We will also have an introduction to the cluster next Wednesday, March 8, at 10:30am during our regular monthly HPC meeting (location TBD).

We understand that change can some times be a little jarring so if you have any questions feel free to contact us at and we will get back to you as soon as we can.

Thank you, Center for Research Computing Team

Posted in Data Analysis | Leave a comment

Long-running Mplus Bootstrapping Example

In the high performance computing example archive, we've just inserted Example 05, a long-running multi-core Mplus exercise.

This one demonstrates how I suggest we ought to keep the data, code files, and output files in separate folders, even if we are using Mplus!

Special thanks to Chong Xing, of the KU Dept. of Communications, for the example and the real-life data set that goes with it. This explores mediation in an structural equation model with the Children of Immigrants data set.

Posted in Data Analysis | Leave a comment

Text Analysis with R

We are having a little practice session. Quick notes about working through the examples in Matthew L. Jockers fine book, Text Analysis with R for Students of Literature.

Browse here:

and download the zip file that it points at here:

Save that zip file INSIDE a project folder. My folder is called R-text.

Unzip that package! (Don't just let the file manager peer inside it. You need to extract it.) It creates a directory called:


Inside there, there is a directory structure, including text files and R code. Use the file manager to change into the directory "start.up.code" and then a chapter, such as "chapter.3".

If you open the R file in an R-aware editor (eg Rstudio), the code won't run as it is. But is easy to fix. Change the name of the data file by inserting "../../" at the beginning. Like so

text.v <- scan("../../data/plainText/melville.txt", what="character", sep="\n")

After doing that, you can step through the example code line by line.

It appears to me the starter code has all of the basic data manipulation work. It does not include the code to manufacture graphs.

We can explore that when we meet together...

Posted in Data Analysis, R | Leave a comment

Portable Parallel Seeds for R (and other cluster updates)

We've been reworking the high performance computing examples so that they line up with the latest and greatest advice about how to organize submissions on the ACF cluster computing system. Please update your copy of the hpcexample archive (see

In the process, we notice that some updates are possible in our "portableParallelSeeds" package for R. Because this package alters the random number generator structure in the R environment, we are not releasing it to the CRAN system. It can be installed from our KU server, however. We suggest you try the following;

CRAN <- ""
KRAN <- ""
options(repos = c(KRAN, CRAN))

Remember this: if you want to get updates by running "update.packages()" inside R, it is necessary to run the first 3 lines here to set your system to look for packages in KRAN.

The portableParallelSeeds package is delivered with two vignettes (essays) named "PRNG-basics" and "pps". To see all about it, run

help(package = "portableParallelSeeds")

Along with the package, a prototype design for Monte Carlo simulations is included. It is in the install folder of the package. There is a directory named "examples" and the prototype is "paramSweep-1.R".

Posted in Data Analysis | Leave a comment

Do not let this happen to your survey project

survey_joke Bring in your survey, let us take a look! Perhaps one of our GRAs can save you from this ignominious fate. We have open (free) walk-in consulting. Find out more on the Open Consulting Page

Posted in Data Analysis | Leave a comment

Cluster Computing Updates

Hot off the press! Advanced Computing Facility Cluster news.

In the past 2 months, several of user have noticed that some jobs they submit take a long long time to start. Some have noticed that the same job will fail or succeed apparently at random.

This came to a head on Tuesday; we've exerted ourselves and have a mostly understandable set of answers. I promised news as soon as I understand it, and here it is. We will have most of this written up in detail in the hpcexample collection ( within a week.

To explain the intermittent problems, there are several separate items I need to mention. There are several balls in the air. Or are they spinning plates?

1. Our nodes are shared, to a certain extent.

That's the "community" in community cluster.

I asked for our nodes to be reserved so we can get quick access to them, but there is a detail I did not understand.

Our nodes in the cluster--24 systems, each with 20 cores--are open to short (6 hours or less) jobs from other users in the cluster. If one of those other users gets active, submitting 1000s of jobs, then our jobs wait as long as 6 hours before they start. Our jobs have "first priority" when currently running jobs end.

While it is technically possible to block those other users, I believe it would be impolite to do so at the current moment. We may revisit that decision in the future.

2. We are asking for resources in the wrong way.

We have been following principles that were established in the "" cluster we used previously. That cluster was only for CRMDA and its affiliates, it did not include such a large audience as the current cluster.

In hpc, if we wanted 19 cores, we were told to allow the cluster to scatter them across nodes if it wanted. Submission for a 19 core job would read like this:

#PBS -l nodes=19:ppn=1

However, in the new cluster that's more likely to cause a total fail because our jobs can get scattered across a lot of different nodes on which strangers are also running jobs.

Those strangers might not be well behaved. I mean, their jobs might not be well behaved. They may request 2GB memory, but then actually use 20GB memory. By gobbling up all memory, they cause our jobs to fail.

Hence, if we want a job using 19 cores, in the brave new world it is smarter and faster to run

#PBS -l nodes=1:ppn=19:ib

That is to say, if you have the whole node, or most of its cores, you are protected from bad strangers. In the old hpc cluster, that would have made a job slower to start. In the new cluster, it is the best bet.

I inserted that special bit, ":ib", and the next sections explain why.

We are getting revised, detailed instructions on how to ask for more cores, and how to specify memory. I'll follow up with information about that later.

3. Please understand this heterogeneous "transport layer" problem.

In the old hpc cluster, we had 1/3 nodes with Infiniband. Jobs would fail all the time because of that diversity. Nodes that had Infiniband could not interact with nodes that did not. We learned to insert ":noib" to prevent the Infiniband layer from being used. All of our nodes had ethernet connections, they could use that slightly slower method.

In the new cluster, we thought all nodes had infiniband. So we told people "get rid of :noib". But some people did not get the message.

But now, it appears in the new cluster, it is not true all nodes have infiniband. Possibly the slow/bad ones don't. Keep that in mind if you read the next point.

4. In the new cluster, we are asking for resources in an especially stupid way (as a result of issue in part 3).

Some of use are still using submission scripts from the old hpc cluster, where we had "noib".

PBS -l nodes=19:ppn=1:noib

Here's why that is especially wrong. All of the nodes that CRMDA controls now DO have Infiniband, and so if you insert "noib", you are PREVENTING the job from starting within the nodes on which we actually have a reservation. That means your jobs are orphaned among the cluster, waiting for a kind stranger to take them in.

That stranger node that lets you in is likely to be a slow, old one. Explaining the "my job takes forever" problem you have noted recently.

Instead, FOR SURE, we need

PBS -l nodes=19:ppn=1:ib

5. Heterogeneous error type 1.

Suppose you don't put ":ib". You launch the job and ask for 50 cores. The first one has Infiniband, but the available node it finds is not, then you get this crash.

This should look familiar to a few of you

If the node chosen to be your Rmpi master is on Infiniband, and it finds other nodes that are not, we see a crash like so:

  > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
[[41508,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
   Host: n301

Another transport will be used instead, although this may result in
lower performance.
orterun noticed that process rank 0 with PID 22368 on node n301 exited on signal 11 (Segmentation fault).

6. We need to get better at module and R package management.

User scripts can specify modules. That's possible, and will fix it.

But it looks like an uphill battle for me to keep all of that clear among all of the different users. We'll try to do it in our part so you don't bother.

You'll know if you hit this problem.

Some jobs fail like this:

 > library(snow)
 > p <- rnorm(123, m = 33)
 > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
orterun noticed that process rank 0 with PID 110202 on node n404 exited on signal 11 (Segmentation fault).

That's caused by a shared library version mismatch.

The R 3.2.3 packages are build against the OpenMPI C library. WHen I built those packages, the OpenMPI version was 1.6.5.

Since then, OpenMPI has has updates, but I did not rebuild the packages. I thought the R-3.2.3 libraries and setup were going to remember OpenMPI1.6.6.


Our default module, Rstats, did not connect R-3.2.3 with the OpenMPI version. Thus, sometimes, when a job would run, it would use a particular OpenMPI feature, we had a compiled library mismatch.

We can fix that ourselves by altering the submission script, say by specifying

module purge
module load openmpi/1.6.5
module load Rstats/3.2.2

However, I don't want to be that detailed. The fix for that, which will be in place very soon (by Monday, for sure, maybe sooner) is

1. The cluster admin will tie together the R choice with the OpenMPI choice.

2. I'll rebuild the R packages for R -3.3.1 with the newer OpenMPI and then admins will update the default module for crmda_remote users (that's you).

Posted in Computing | Tagged | Leave a comment