kutils update

kutils, our utility package that includes the Variable Key framework, was updated to version 1.0 on CRAN last week.

Minor bug fixes will be offered in our package server KRAN, which users can access by running R code like this

CRAN <- "http://rweb.crmda.ku.edu/cran"
KRAN <- "http://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
update.packages(ask = F, checkBuilt = TRUE)

That presupposes you have kutils already, of course. If not, run install.packages instead.

I've just uploaded to KRAN version 1.10, which has a little fix in the reverse function, which is intended to reverse the ordering of factor levels. In case you wonder what this is, here is a code snippit:

##' Reverse the levels in a factor
##'
##' Simple literal reversal. Will stop with an error message if x is
##' not a factor (or ordered) variable.
##'
##' Sometimes people want to
##' reverse some levels, excluding others and leaving them at the end
##' of the list. The "eol" argument sets aside some levels and puts
##' them at the end of the list of levels.
##'
##' The use case for the \code{eol} argument is a factor
##' with several missing value labels, as appears in SPSS. With
##' up to 18 different missing codes, we want to leave them
##' at the end. In the case for which this was designed, the
##' researcher did not want to designate those values as
##' missing before inspecting the pattern of observed values.
##' 
##' @param x a factor variable
##' @param eol values to be kept at the end of the list
##' @export
##' @return a new factor variable with reversed values
##' @author Paul Johnson <pauljohn@@ku.edu>
##' @examples
##' ## Consider alphabetication of upper and lower
##' x <- factor(c("a", "b", "c", "C", "a", "c"))
##' levels(x)
##' xr1 <- reverse(x)
##' xr1
##' ## Keep "C" at end of list, after reverse others
##' xr2 <- reverse(x, eol = "C")
##' xr2
##' y <- ordered(x, levels = c("a", "b", "c", "C"))
##' yr1 <- reverse(y)
##' yr1
##' ## Hmm. end of list amounts to being "maximal".
##' ## Unintended side-effect, but interesting.
##' yr2 <- reverse(y, eol = "C")
##' yr2
reverse <- function(x, eol = c("Skip", "DNP")){
    if (!is.factor(x)) stop("your variable is not a factor")
    rlevels <- rev(levels(x))
    if (length(eol) > 0){
        for (jj in eol){
            if (length(yyy <- grep(jj, rlevels))){
                rlevels <- c(rlevels[-yyy], jj)
            }
        }
    }
    factor(x, levels = rlevels)
}

If for some reason you don't want to install/update kutils, you can just as well paste that code into your R file and use it as the example demonstrates.

Posted in Data Analysis | Leave a comment

Cluster faster, Rstan optimized as of 2017-05-17

Special thanks to Wes Mason of the ITTC. There are 2 breakthroughs to report today.

Nodes are faster

During the spring, users reported that calculations were taking longer. I raised the problem with Wes and he did some diagnosis. It appeared the node BIOS could be adjusted to allow calculations to run faster--nearly two times faster! The CRC administrators understood the issue and they implemented the fixes on May 15, 2017.

Testing on May 16 confirmed that MCMC jobs that were taking 25 hours now take 12 hours.

Now Rstan is optimized as well

I had a lot of trouble getting the settings corrected to build Rstan in the cluster. It turns out that the user who builds Rstan needs to have special settings in a hidden file in the user account. I tried that in February and failed for various reasons, but now victory is at hand. This is one of the examples why we don't suggest individual users try to compile these packages--it is simply too difficult/frustrating.

To use the specially built Rstan, it is necessary to do the 5 step incantation described in the previous post, R Packages available for CRMDA cluster members.

These packages are compiled with GCC-6.3, the latest and greatest, with the C++ optimizer dialed up to "-O3".

In case you need to compile Rstan with GCC-6.3, here is what I have in the ~/.R/Makevars file:

R_XTRA_CPPFLAGS =  -I$(R_INCLUDE_DIR)   #set_by_rstan
## for OpenMx
CXX1X = g++
CXX1XFLAGS = -g -O2
CXX1XPICFLAGS = -fpic
CXX1XSTD =  -std=c++0x
## For Rstan
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-Wno-unused-local-typedefs
CXXFLAGS+=-Wno-ignored-attributes -Wno-deprecated-declarations

The Rstan installation manual suggests two other flags, "-flto -ffat-lto-objects", but these cause a compilation failure. We believe these are not compatible with GCC-6.3.

The other thing worth knowing is that the GCC compiler will demand much more memory than you expect. In February, I was failing over and over because the node was allowing me access to 500MB, but 5GB was necessary. Unfortunately, the error message is completely opaque, suggesting an internal bug in GCC, rather than exhaustion of memory. That was another problem that Wes Mason diagnosed for us.

Posted in Data Analysis | Leave a comment

Apply for our Student Hourly Position

The link for students to apply is:

https://employment.ku.edu/student/8685BR

The last day students can apply is May 23, 2017, and committee members can review candidates by logging into the BrassRing system on or after May 24, 2017.

Posted in Data Analysis | Leave a comment

R Packages available for CRMDA cluster members

This is the 20170425 update, which includes an updated module set and reports of success with Java and TkTcl-based R packages. In other words, an almost complete victory is achieved. Special thanks to Wes Mason of ITTC.

To use R, here is a set of commands I run to set the environment. This is necessary every time I want to use R with Emacs. Let's call this the magic 5 line stanza, for sake of discussion.

module purge
module load legacy
module load emacs  
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3

I agree if you say "it is a pain in the rump to have to remember to do that every time I log in." In the old cluster, I was in a position to place those startup commands into all of the CRMDA user environments. That is no longer the case.

I'm checking on ways you can automate this within your own account. Details are posted at the end of this article.

Consider (strongly) obliterating your $HOME/R package folder

When you want to work with R on the CRC cluster, please consider using the R packages we install within the $WORK folder for CRMDA group members. These packages have some special features and if you try to install them in your user folder (under $HOME/R, as R invites you to do if you run "install.packages()" in a session), then they may not compile correctly.

Recently, we have had runtime errors because the R we are recommending, as described below, is not compatible with packages that users build and install with other versions of R (or the same version of R in a different build environment). In particular, if

  1. You have packages built on the old ACF cluster, or
  2. You have recently installed packages without loading the modules listed below

then you should delete the packages you have under $HOME/R. I think it is best if you let us try to install what you need, but if you install R packages in your own home folder, please do so only AFTER loading the modules listed below. Please DO NOT load the CRC-provided module "R/3.3". It does not provide the services we need.

Background information

The module Rstats/3.3 is built by Wes Mason of ITTC and it is installed into the $WORK folder for CRMDA (hence the module use command above). We work together to make sure the OpenMPI layer is compiled correctly, so it is possible to use Rmpi and the R package parallel. The compiler used is GCC-6.3, which is quite a bit newer than the standard GCC which is provided with the cluster node operating system. This is the principal reason why the CRC-provided "R/3.3" is not acceptable. It does not make sure that the OpenMPI and GCC components are kept in lock-step with R itself. Observe, if we start with an empty session and run

module purge
module use /panfs/pfs.local/work/crmda/tools/modules
module load Rstats/3.3

we find that we actually load several modules:

$ module list

Currently Loaded Modules:
1) compiler/gcc/6.3   
2) openmpi/2.0
3) java/1.8.0_131
4) xz/5.2.3
5) icu/59.1
6) tcltk/8.6.6
7) Rstats/3.3

The openmpi version must be kept in lock-step with R and the packages we have installed in the past. gcc-6.3 is the compiler version we use for all of the packages. It is necessary to have that new version because of demands by packages like Rstan and OpenMX. The java and tcltk modules are needed by various R packages, such as rJava and tkrplot. The xz module is a decompression suite, needed to interact with source code itself. The Rstats module itself is, for the most part, a "holding company" that keeps all of this together. It simply loads the requirements of gcc, openmpi, java, xz, icu, and tcltk, and then it accesses the R provided by the CRC system maintainers from /panfs/pfs.local/software/install/MRO/3.3. The R packages provided by the base R install are found in the directory /panfs/pfs.local/work/crmda/tools/mro/3.3/site-library.

The R packages in our collection are, in most cases, going to be updates & replacements of those packages because we are building with the different compiler. Users who load Rstats/3.3 should notice that our directory comes into the user path before the system-wide folder. Inside R, we see:

> .libPaths()
[1] "/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library"
[2] "/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library"

In my basic 5 line session starter sequence, I also have modules named legacy and emacs. In my opinion, that is a little dangerous because I'm jumbling together modules from the old and new cluster. That's necessary because I need an IDE with which to interact with R. Emacs was configured with ESS by Wes Mason, and it is in the legacy module set. If you prefer, legacy also provides RStudio version 9.98.978. That is, unfortunately, outdated and unmaintained. I've filed a request with CRC to get a new version of Rstudio.

What do you get if you load our environment?

After the magic 5 line stanza above, within your R session, you have access to these packages (Run "library()" to see all folders your path, and all packages within):

Packages in library '/panfs/pfs.local/work/crmda/tools/mro/3.3/site-library':

ADGofTest               Anderson-Darling GoF test
AER                     Applied Econometrics with R
Amelia                  A Program for Missing Data
BH                      Boost C++ Header Files
BMA                     Bayesian Model Averaging
BradleyTerry2           Bradley-Terry Models
Cairo                   R graphics device using cairo graphics library
                        for creating high-quality bitmap (PNG, JPEG,
                        TIFF), vector (PDF, SVG, PostScript) and
                        display (X11 and Win32) output
Cubist                  Rule- And Instance-Based Regression Modeling
DBI                     R Database Interface
DCluster                Functions for the Detection of Spatial Clusters
                        of Diseases
DEoptimR                Differential Evolution Optimization in Pure R
Devore7                 Data sets from Devore's "Prob and Stat for Eng
                        (7th ed)"
DiagrammeR              Create Graph Diagrams and Flowcharts Using R
ENmisc                  Neuwirth miscellaneous
Ecdat                   Data Sets for Econometrics
Ecfun                   Functions for Ecdat
Formula                 Extended Model Formulas
GPArotation             GPA Factor Rotation
HistData                Data Sets from the History of Statistics and
                        Data Visualization
Hmisc                   Harrell Miscellaneous
HyperbolicDist          The hyperbolic distribution
ISwR                    Introductory Statistics with R
Iso                     Functions to Perform Isotonic Regression
JGR                     JGR - Java GUI for R
JM                      Joint Modeling of Longitudinal and Survival
                        Data
JMdesign                Joint Modeling of Longitudinal and Survival
                        Data - Power Calculation
JavaGD                  Java Graphics Device
Kendall                 Kendall rank correlation and Mann-Kendall trend
                        test
LearnBayes              Functions for Learning Bayesian Inference
MCMCglmm                MCMC Generalised Linear Mixed Models
MCMCpack                Markov Chain Monte Carlo (MCMC) Package
MCPAN                   Multiple Comparisons Using Normal Approximation
MEMSS                   Data sets from Mixed-effects Models in S
MNP                     R Package for Fitting the Multinomial Probit
                        Model
MPV                     Data Sets from Montgomery, Peck and Vining's
                        Book
MatchIt                 Nonparametric Preprocessing for Parametric
                        Casual Inference
Matching                Multivariate and Propensity Score Matching with
                        Balance Optimization
MatrixModels            Modelling with Sparse And Dense Matrices
ModelMetrics            Rapid Calculation of Model Metrics
MplusAutomation         Automating Mplus Model Estimation and
                        Interpretation
NMF                     Algorithms and Framework for Nonnegative Matrix
                        Factorization (NMF)
OpenMx                  Extended Structural Equation Modelling
PASWR                   PROBABILITY and STATISTICS WITH R
PBSmapping              Mapping Fisheries Data and Spatial Analysis
                        Tools
PolynomF                Polynomials in R
R2HTML                  HTML Exportation for R Objects
R2OpenBUGS              Running OpenBUGS from R
R6                      Classes with Reference Semantics
RColorBrewer            ColorBrewer Palettes
RCurl                   General Network (HTTP/FTP/...) Client Interface
                        for R
RGtk2                   R bindings for Gtk 2.8.0 and above
RSvgDevice              An R SVG graphics device.
RandomFields            Simulation and Analysis of Random Fields
RandomFieldsUtils       Utilities for the Simulation and Analysis of
                        Random Fields
Rcmdr                   R Commander
RcmdrMisc               R Commander Miscellaneous Functions
Rcpp                    Seamless R and C++ Integration
RcppArmadillo           'Rcpp' Integration for the 'Armadillo'
                        Templated Linear Algebra Library
RcppEigen               'Rcpp' Integration for the 'Eigen' Templated
                        Linear Algebra Library
Rd2roxygen              Convert Rd to 'Roxygen' Documentation
Rmpi                    Interface (Wrapper) to MPI (Message-Passing
                        Interface)
Rook                    Rook - a web server interface for R
SAScii                  Import ASCII files directly into R using only a
                        SAS input script
SASmixed                Data sets from "SAS System for Mixed Models"
SemiPar                 Semiparametic Regression
SoDA                    Functions and Examples for "Software for Data
                        Analysis"
SparseM                 Sparse Linear Algebra
StanHeaders             C++ Header Files for Stan
StatDataML              Read and Write StatDataML Files
SweaveListingUtils      Utilities for Sweave Together with TeX
                        'listings' Package
TH.data                 TH's Data Archive
TeachingDemos           Demonstrations for Teaching and Learning
UsingR                  Data Sets, Etc. for the Text "Using R for
                        Introductory Statistics", Second Edition
VGAM                    Vector Generalized Linear and Additive Models
VIM                     Visualization and Imputation of Missing Values
XML                     Tools for Parsing and Generating XML Within R
                        and S-Plus
Zelig                   Everyone's Statistical Software
abind                   Combine Multidimensional Arrays
acepack                 ACE and AVAS for Selecting Multiple Regression
                        Transformations
actuar                  Actuarial Functions and Heavy Tailed
                        Distributions
ada                     The R Package Ada for Stochastic Boosting
ade4                    Analysis of Ecological Data : Exploratory and
                        Euclidean Methods in Environmental Sciences
adehabitat              Analysis of Habitat Selection by Animals
akima                   Interpolation of Irregularly and Regularly
                        Spaced Data
alr3                    Data to accompany Applied Linear Regression 3rd
                        edition
amap                    Another Multidimensional Analysis Package
aod                     Analysis of Overdispersed Data
ape                     Analyses of Phylogenetics and Evolution
aplpack                 Another Plot PACKage: stem.leaf, bagplot,
                        faces, spin3R, plotsummary, plothulls, and some
                        slider functions
arm                     Data Analysis Using Regression and
                        Multilevel/Hierarchical Models
arules                  Mining Association Rules and Frequent Itemsets
assertthat              Easy Pre and Post Assertions
backports               Reimplementations of Functions Introduced Since
                        R-3.0.0
base64enc               Tools for base64 encoding
bayesm                  Bayesian Inference for
                        Marketing/Micro-Econometrics
bcp                     Bayesian Analysis of Change Point Problems
bdsmatrix               Routines for Block Diagonal Symmetric matrices
bestglm                 Best Subset GLM
betareg                 Beta Regression
biglm                   bounded memory linear and generalized linear
                        models
bit                     A class for vectors of 1-bit booleans
bit64                   A S3 Class for Vectors of 64bit Integers
bitops                  Bitwise Operations
bnlearn                 Bayesian Network Structure Learning, Parameter
                        Learning and Inference
brew                    Templating Framework for Report Generation
brglm                   Bias reduction in binomial-response generalized
                        linear models.
broom                   Convert Statistical Analysis Objects into Tidy
                        Data Frames
caTools                 Tools: moving window statistics, GIF, Base64,
                        ROC AUC, etc.
cairoDevice             Embeddable Cairo Graphics Device Driver
car                     Companion to Applied Regression
caret                   Classification and Regression Training
cellranger              Translate Spreadsheet Cell Ranges to Rows and
                        Columns
censReg                 Censored Regression (Tobit) Models
checkmate               Fast and Versatile Argument Checks
chron                   Chronological Objects which can Handle Dates
                        and Times
clue                    Cluster Ensembles
clv                     Cluster Validation Techniques
cocorresp               Co-Correspondence Analysis Methods
coda                    Output Analysis and Diagnostics for MCMC
coin                    Conditional Inference Procedures in a
                        Permutation Test Framework
colorspace              Color Space Manipulation
combinat                combinatorics utilities
commonmark              High Performance CommonMark and Github Markdown
                        Rendering in R
copula                  Multivariate Dependence with Copulas
corpcor                 Efficient Estimation of Covariance and
                        (Partial) Correlation
crayon                  Colored Terminal Output
cslogistic              Conditionally Specified Logistic Regression
cubature                Adaptive Multivariate Integration over
                        Hypercubes
data.table              Extension of `data.frame`
deldir                  Delaunay Triangulation and Dirichlet (Voronoi)
                        Tessellation
desc                    Manipulate DESCRIPTION Files
descr                   Descriptive Statistics
dichromat               Color Schemes for Dichromats
digest                  Create Compact Hash Digests of R Objects
diptest                 Hartigan's Dip Test Statistic for Unimodality -
                        Corrected
distr                   Object Oriented Implementation of Distributions
dlm                     Bayesian and Likelihood Analysis of Dynamic
                        Linear Models
doBy                    Groupwise Statistics, LSmeans, Linear
                        Contrasts, Utilities
doMC                    Foreach Parallel Adaptor for 'parallel'
doMPI                   Foreach parallel adaptor for the Rmpi package
doSNOW                  Foreach Parallel Adaptor for the 'snow' Package
dplyr                   A Grammar of Data Manipulation
dse                     Dynamic Systems Estimation (Time Series
                        Package)
e1071                   Misc Functions of the Department of Statistics,
                        Probability Theory Group (Formerly: E1071), TU
                        Wien
earth                   Multivariate Adaptive Regression Splines
ecodist                 Dissimilarity-based functions for ecological
                        analysis
effects                 Effect Displays for Linear, Generalized Linear,
                        and Other Models
eha                     Event History Analysis
eiPack                  eiPack: Ecological Inference and
                        Higher-Dimension Data Management
emplik                  Empirical Likelihood Ratio for
                        Censored/Truncated Data
evaluate                Parsing and Evaluation Tools that Provide More
                        Details than the Default
expint                  Exponential Integral and Incomplete Gamma
                        Function
expm                    Matrix Exponential, Log, 'etc'
faraway                 Functions and Datasets for Books by Julian
                        Faraway
fastICA                 FastICA Algorithms to perform ICA and
                        Projection Pursuit
fastmatch               Fast match() function
fda                     Functional Data Analysis
ffmanova                Fifty-fifty MANOVA
fields                  Tools for Spatial Data
flexmix                 Flexible Mixture Modeling
forcats                 Tools for Working with Categorical Variables
                        (Factors)
formatR                 Format R Code Automatically
forward                 Forward search
gam                     Generalized Additive Models
gamlss                  Generalised Additive Models for Location Scale
                        and Shape
gamlss.data             GAMLSS Data
gamlss.dist             Distributions to be Used for GAMLSS Modelling
gamm4                   Generalized Additive Mixed Models using 'mgcv'
                        and 'lme4'
gbm                     Generalized Boosted Regression Models
gclus                   Clustering Graphics
gdata                   Various R Programming Tools for Data
                        Manipulation
gee                     Generalized Estimation Equation Solver
geepack                 Generalized Estimating Equation Package
geoR                    Analysis of Geostatistical Data
geoRglm                 A Package for Generalised Linear Spatial Models
ggm                     Functions for graphical Markov models
ggplot2                 Create Elegant Data Visualisations Using the
                        Grammar of Graphics
glmc                    Fitting Generalized Linear Models Subject to
                        Constraints
glmmBUGS                Generalised Linear Mixed Models with BUGS and
                        JAGS
glmmML                  Generalized Linear Models with Clustering
glmnet                  Lasso and Elastic-Net Regularized Generalized
                        Linear Models
glmpath                 L1 Regularization Path for Generalized Linear
                        Models and Cox Proportional Hazards Model
gmodels                 Various R Programming Tools for Model Fitting
gmp                     Multiple Precision Arithmetic
gpclib                  General Polygon Clipping Library for R
gridBase                Integration of base and grid graphics
gridExtra               Miscellaneous Functions for "Grid" Graphics
grpreg                  Regularization Paths for Regression Models with
                        Grouped Covariates
gsl                     Wrapper for the Gnu Scientific Library
gsubfn                  Utilities for strings and function arguments.
gtable                  Arrange 'Grobs' in Tables
gtools                  Various R Programming Tools
haven                   Import and Export 'SPSS', 'Stata' and 'SAS'
                        Files
hexbin                  Hexagonal Binning Routines
highr                   Syntax Highlighting for R Source Code
hms                     Pretty Time of Day
htmlTable               Advanced Tables for Markdown/HTML
htmltools               Tools for HTML
htmlwidgets             HTML Widgets for R
httpuv                  HTTP and WebSocket Server Library
httr                    Tools for Working with URLs and HTTP
igraph                  Network Analysis and Visualization
ineq                    Measuring Inequality, Concentration, and
                        Poverty
influence.ME            Tools for Detecting Influential Data in Mixed
                        Effects Models
influenceR              Software Tools to Quantify Structural
                        Importance of Nodes in a Network
inline                  Functions to Inline C, C++, Fortran Function
                        Calls from R
iplots                  iPlots - interactive graphics for R
irlba                   Fast Truncated SVD, PCA and Symmetric
                        Eigendecomposition for Large Dense and Sparse
                        Matrices
itertools               Iterator Tools
jpeg                    Read and write JPEG images
kernlab                 Kernel-Based Machine Learning Lab
knitr                   A General-Purpose Package for Dynamic Report
                        Generation in R
kutils                  Project Management Tools
labeling                Axis Labeling
laeken                  Estimation of indicators on social exclusion
                        and poverty
languageR               Data sets and functions with "Analyzing
                        Linguistic Data: A practical introduction to
                        statistics".
lars                    Least Angle Regression, Lasso and Forward
                        Stagewise
latticeExtra            Extra Graphical Utilities Based on Lattice
lava                    Latent Variable Models
lavaan                  Latent Variable Analysis
lavaan.survey           Complex Survey Structural Equation Modeling
                        (SEM)
lazyeval                Lazy (Non-Standard) Evaluation
leaps                   Regression Subset Selection
lme4                    Linear Mixed-Effects Models using 'Eigen' and
                        S4
lmeSplines              Add smoothing spline modelling capability to
                        nlme.
lmec                    Linear Mixed-Effects Models with Censored
                        Responses
lmerTest                Tests in Linear Mixed Effects Models
lmm                     Linear Mixed Models
lmtest                  Testing Linear Regression Models
locfit                  Local Regression, Likelihood and Density
                        Estimation.
logspline               Logspline Density Estimation Routines
longitudinal            Analysis of Multiple Time Course Data
longitudinalData        Longitudinal Data
lpSolve                 Interface to 'Lp_solve' v. 5.5 to Solve
                        Linear/Integer Programs
ltm                     Latent Trait Models under IRT
lubridate               Make Dealing with Dates a Little Easier
magic                   create and investigate magic squares
magrittr                A Forward-Pipe Operator for R
manipulate              Interactive Plots for RStudio
maps                    Draw Geographical Maps
maptools                Tools for Reading and Handling Spatial Objects
markdown                'Markdown' Rendering for R
matrixcalc              Collection of functions for matrix calculations
maxLik                  Maximum Likelihood Estimation and Related Tools
mboost                  Model-Based Boosting
mcgibbsit               Warnes and Raftery's MCGibbsit MCMC diagnostic
mclust                  Gaussian Mixture Modelling for Model-Based
                        Clustering, Classification, and Density
                        Estimation
mcmc                    Markov Chain Monte Carlo
mda                     Mixture and Flexible Discriminant Analysis
mediation               Causal Mediation Analysis
memisc                  Tools for Management of Survey Data and the
                        Presentation of Analysis Results
memoise                 Memoisation of Functions
mi                      Missing Data Imputation and Model Checking
micEcon                 Microeconomic Analysis and Modelling
mice                    Multivariate Imputation by Chained Equations
microbenchmark          Accurate Timing Functions
mime                    Map Filenames to MIME Types
minqa                   Derivative-free optimization algorithms by
                        quadratic approximation
misc3d                  Miscellaneous 3D Plots
miscTools               Miscellaneous Tools and Utilities
mitools                 Tools for multiple imputation of missing data
mix                     Estimation/Multiple Imputation for Mixed
                        Categorical and Continuous Data
mixtools                Tools for Analyzing Finite Mixture Models
mlbench                 Machine Learning Benchmark Problems
mnormt                  The Multivariate Normal and t Distributions
modelr                  Modelling Functions that Work with the Pipe
modeltools              Tools and Classes for Statistical Models
msm                     Multi-State Markov and Hidden Markov Models in
                        Continuous Time
multcomp                Simultaneous Inference in General Parametric
                        Models
munsell                 Utilities for Using Munsell Colours
mvProbit                Multivariate Probit Models
mvbutils                Workspace organization, code and documentation
                        editing, package prep and editing, etc.
mvtnorm                 Multivariate Normal and t Distributions
neighbr                 Classification, Regression, Clustering with K
                        Nearest Neighbors
network                 Classes for Relational Data
nloptr                  R interface to NLopt
nnls                    The Lawson-Hanson algorithm for non-negative
                        least squares (NNLS)
nor1mix                 Normal (1-d) Mixture Models (S3 Classes and
                        Methods)
norm                    Analysis of multivariate normal datasets with
                        missing values
nortest                 Tests for Normality
np                      Nonparametric kernel smoothing methods for
                        mixed data types
numDeriv                Accurate Numerical Derivatives
nws                     R functions for NetWorkSpaces and Sleigh
openssl                 Toolkit for Encryption, Signatures and
                        Certificates Based on OpenSSL
openxlsx                Read, Write and Edit XLSX Files
ordinal                 Regression Models for Ordinal Data
orthopolynom            Collection of functions for orthogonal and
                        orthonormal polynomials
pan                     Multiple Imputation for Multivariate Panel or
                        Clustered Data
pander                  An R Pandoc Writer
partDSA                 Partitioning Using Deletion, Substitution, and
                        Addition Moves
party                   A Laboratory for Recursive Partytioning
pbivnorm                Vectorized Bivariate Normal CDF
pbkrtest                Parametric Bootstrap and Kenward Roger Based
                        Methods for Mixed Model Comparison
pcaPP                   Robust PCA by Projection Pursuit
permute                 Functions for Generating Restricted
                        Permutations of Data
pixmap                  Bitmap Images (``Pixel Maps'')
pkgKitten               Create Simple Packages Which Do not Upset R
                        Package Checks
pkgmaker                Package development utilities
plm                     Linear Models for Panel Data
plotmo                  Plot a Model's Response and Residuals
plotrix                 Various Plotting Functions
pls                     Partial Least Squares and Principal Component
                        Regression
plyr                    Tools for Splitting, Applying and Combining
                        Data
pmml                    Generate PMML for Various Models
pmmlTransformations     Transforms Input Data from a PMML Perspective
polspline               Polynomial Spline Routines
polycor                 Polychoric and Polyserial Correlations
polynom                 A Collection of Functions to Implement a Class
                        for Univariate Polynomial Manipulations
portableParallelSeeds   Allow Replication of Simulations on Parallel
                        and Serial Computers
ppcor                   Partial and Semi-Partial (Part) Correlation
praise                  Praise Users
profileModel            Tools for profiling inference functions for
                        various model classes
proto                   Prototype Object-Based Programming
proxy                   Distance and Similarity Measures
pscl                    Political Science Computational Laboratory,
                        Stanford University
psidR                   Build Panel Data Sets from PSID Raw Data
pspline                 Penalized Smoothing Splines
psych                   Procedures for Psychological, Psychometric, and
                        Personality Research
purrr                   Functional Programming Tools
quadprog                Functions to solve Quadratic Programming
                        Problems.
quantreg                Quantile Regression
rJava                   Low-Level R to Java Interface
randomForest            Breiman and Cutler's Random Forests for
                        Classification and Regression
randomForestSRC         Random Forests for Survival, Regression and
                        Classification (RF-SRC)
rattle                  Graphical User Interface for Data Mining in R
rbenchmark              Benchmarking routine for R
rbugs                   Fusing R and OpenBugs and Beyond
rda                     Shrunken Centroids Regularized Discriminant
                        Analysis
readr                   Read Rectangular Text Data
readxl                  Read Excel Files
registry                Infrastructure for R Package Registries
relimp                  Relative Contribution of Effects in a
                        Regression Model
rematch                 Match Regular Expressions with a Nicer 'API'
reshape                 Flexibly Reshape Data
reshape2                Flexibly Reshape Data: A Reboot of the Reshape
                        Package
rgenoud                 R Version of GENetic Optimization Using
                        Derivatives
rgexf                   Build, Import and Export GEXF Graph Files
rgl                     3D Visualization Using OpenGL
rlecuyer                R Interface to RNG with Multiple Streams
rmarkdown               Dynamic Documents for R
rms                     Regression Modeling Strategies
rngtools                Utility functions for working with Random
                        Number Generators
robustbase              Basic Robust Statistics
rockchalk               Regression Estimation and Presentation
roxygen2                In-Line Documentation for R
rpart.plot              Plot 'rpart' Models: An Enhanced Version of
                        'plot.rpart'
rpf                     Response Probability Functions
rprojroot               Finding Files in Project Subdirectories
rrcov                   Scalable Robust Estimators with High Breakdown
                        Point
rstan                   R Interface to Stan
rstudio                 Tools and Utilities for RStudio
rstudioapi              Safely Access the RStudio API
rvest                   Easily Harvest (Scrape) Web Pages
sandwich                Robust Covariance Matrix Estimators
scales                  Scale Functions for Visualization
scatterplot3d           3D Scatter Plot
segmented               Regression Models with Breakpoints/Changepoints
                        Estimation
selectr                 Translate CSS Selectors to XPath Expressions
sem                     Structural Equation Models
semTools                Useful Tools for Structural Equation Modeling
setRNG                  Set (Normal) Random Number Generator and Seed
sets                    Sets, Generalized Sets, Customizable Sets and
                        Intervals
sfsmisc                 Utilities from "Seminar fuer Statistik" ETH
                        Zurich
shapefiles              Read and Write ESRI Shapefiles
shiny                   Web Application Framework for R
simsem                  SIMulated Structural Equation Modeling
sm                      Smoothing methods for nonparametric regression
                        and density estimation
smoothSurv              Survival Regression with Smoothed Error
                        Distribution
sna                     Tools for Social Network Analysis
snow                    Simple Network of Workstations
snowFT                  Fault Tolerant Simple Network of Workstations
sourcetools             Tools for Reading, Tokenizing and Parsing R
                        Code
sp                      Classes and Methods for Spatial Data
spam                    SPArse Matrix
spatialCovariance       Computation of Spatial Covariance Matrices for
                        Data on Rectangles
spatialkernel           Nonparameteric estimation of spatial
                        segregation in a multivariate point process
spdep                   Spatial Dependence: Weighting Schemes,
                        Statistics and Models
splancs                 Spatial and Space-Time Point Pattern Analysis
stabledist              Stable Distribution Functions
stabs                   Stability Selection with Error Control
startupmsg              Utilities for Start-Up Messages
statmod                 Statistical Modeling
statnet.common          Common R Scripts and Utilities Used by the
                        Statnet Project Software
stepwise                Stepwise detection of recombination breakpoints
stringi                 Character String Processing Facilities
stringr                 Simple, Consistent Wrappers for Common String
                        Operations
strucchange             Testing, Monitoring, and Dating Structural
                        Changes
subselect               Selecting Variable Subsets
survey                  Analysis of Complex Survey Samples
survival                Survival Analysis
systemfit               Estimating Systems of Simultaneous Equations
tables                  Formula-Driven Table Generation
tcltk2                  Tcl/Tk Additions
tensorA                 Advanced tensors arithmetic with named indices
testthat                Unit Testing for R
texreg                  Conversion of R Regression Output to LaTeX or
                        HTML Tables
tfplot                  Time Frame User Utilities
tframe                  Time Frame Coding Kernel
tibble                  Simple Data Frames
tidyr                   Easily Tidy Data with 'spread()' and 'gather()'
                        Functions
tidyverse               Easily Install and Load 'Tidyverse' Packages
timeDate                Rmetrics - Chronological and Calendar Objects
tis                     Time Indexes and Time Indexed Series
tkrplot                 TK Rplot
tree                    Classification and Regression Trees
triangle                Provides the Standard Distribution Functions
                        for the Triangle Distribution
trimcluster             Cluster analysis with trimming
trust                   Trust Region Optimization
ucminf                  General-Purpose Unconstrained Non-Linear
                        Optimization
urca                    Unit Root and Cointegration Tests for Time
                        Series Data
vcd                     Visualizing Categorical Data
vegan                   Community Ecology Package
viridis                 Default Color Maps from 'matplotlib'
viridisLite             Default Color Maps from 'matplotlib' (Lite
                        Version)
visNetwork              Network Visualization using 'vis.js' Library
waveslim                Basic wavelet routines for one-, two- and
                        three-dimensional signal processing
wnominate               Roll Call Analysis Software
xgboost                 Extreme Gradient Boosting
xml2                    Parse XML
xtable                  Export Tables to LaTeX or HTML
xts                     eXtensible Time Series
yaml                    Methods to Convert R Data to YAML and Back
zipfR                   Statistical models for word frequency
                        distributions
zoo                     S3 Infrastructure for Regular and Irregular
                        Time Series (Z's Ordered Observations)

Packages in library '/panfs/pfs.local/software/install/MRO/3.3/microsoft-r/3.3/lib64/R/library':

KernSmooth              Functions for Kernel Smoothing Supporting Wand
                        & Jones (1995)
MASS                    Support Functions and Datasets for Venables and
                        Ripley's MASS
Matrix                  Sparse and Dense Matrix Classes and Methods
MicrosoftR              Microsoft R umbrella package
R6                      Classes with Reference Semantics
RUnit                   R Unit test framework
RevoIOQ                 Microsoft R Services Test Suite
RevoMods                R Functions Modified For Revolution R
RevoUtils               Microsoft R Utility Package
RevoUtilsMath           Microsoft R Services Math Utilities Package
base                    The R Base Package
boot                    Bootstrap Functions (Originally by Angelo Canty
                        for S)
checkpoint              Install Packages from Snapshots on the
                        Checkpoint Server for Reproducibility
class                   Functions for Classification
cluster                 "Finding Groups in Data": Cluster Analysis
                        Extended Rousseeuw et al.
codetools               Code Analysis Tools for R
compiler                The R Compiler Package
curl                    A Modern and Flexible Web Client for R
datasets                The R Datasets Package
deployrRserve           Binary R server
doParallel              Foreach Parallel Adaptor for the 'parallel'
                        Package
foreach                 Provides Foreach Looping Construct for R
foreign                 Read Data Stored by Minitab, S, SAS, SPSS,
                        Stata, Systat, Weka, dBase, ...
grDevices               The R Graphics Devices and Support for Colours
                        and Fonts
graphics                The R Graphics Package
grid                    The Grid Graphics Package
iterators               Provides Iterator Construct for R
jsonlite                A Robust, High Performance JSON Parser and
                        Generator for R
lattice                 Trellis Graphics for R
methods                 Formal Methods and Classes
mgcv                    Mixed GAM Computation Vehicle with GCV/AIC/REML
                        Smoothness Estimation
nlme                    Linear and Nonlinear Mixed Effects Models
nnet                    Feed-Forward Neural Networks and Multinomial
                        Log-Linear Models
parallel                Support for Parallel computation in R
png                     Read and write PNG images
rpart                   Recursive Partitioning and Regression Trees
spatial                 Functions for Kriging and Point Pattern
                        Analysis
splines                 Regression Spline Functions and Classes
stats                   The R Stats Package
stats4                  Statistical Functions using S4 Classes
survival                Survival Analysis
tcltk                   Tcl/Tk Interface
tools                   Tools for Package Development
utils                   The R Utils Package

As usual, if these don't work right, its something I got wrong and will fix. Email me.

As of 2017-04-25, we have solved the problems of compiling Java and tk-based R packages. In other words, we find ourselves roughly back in the place where we were in October, 2016, or perhaps a little bit ahead of that. Now that the gcc issues have been addressed, we are able to stay up to date with changes in the cutting edge packages like Rcpp, Rstan and OpenMx.

If you need other packages, I'll install them if you email me .

If you launch R and you don't find packages (in the output of library(), for example), it probably means you forgot the module magic.

If you are having trouble with Rstan, the likely sources of trouble are 1) errors in your ~/.R/Makevars file, or 2) old packages in your home folder ~/R/ that do not cooperate with the new R and the other packages we make available.

Make a module script file

I have another more lesson. Instead of re-typing that stanza whenever it is needed, put those lines in a file. I just tested this. I put module stanza rstats.sh. I saved that in $HOME/bin and made it executable ("chmod +x rstats.sh"). It seems to succeed then to run

source rstats.sh

Building your own packages?

If we don't build packages for you, you have to build your own. Here is a lesson from the school of hard knocks. In the new CRC cluster, the memory limits of your sessions are strictly enforced. The compiler will often use more than 2GB memory. As a result, when you try to build a package inside R with "install.packages", you may get a vague message of failure. To protect yourself against that, it is wise to ask for an interactive session with more memory. I do this, for example:

$ msub -I -X -l nodes=1:ppn=1,pmem=6144m

That is sufficient to compile Rstan, which is the most intensive package I have tried to build.

Posted in Data Analysis | Leave a comment

Revolutions R in new acf cluster

The cluster runs on RedHat RHEL 6, which is too old to support the new versions of R. The principal weakness is the older gcc compiler in RHEL6.

In the cluster, however, we have access to much newer Intel MKL compiler and math libraries, so the R program, and the things on which it relies, can be built with the Intel compiler. It appears as though we can stay up to date with the troublesome R modules like Rstan, Rcpp, RcppArmadillo.

Wes Mason of ITTC worked this out for us. The scheme we are testing now can be accessed as follows.

For people in the crmda user group, try this interactively

$ module purge
$ module use /panfs/pfs.local/work/crmda/tools/modules
$ module load Rstats/3.3

After that, observe

$ R

 > library("rstan")
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev: 
5fa1e80eb817)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

We are still in a testing phase on this setup, surely there will be problems. I do not understand what is necessary to compile new R packages with this setup. We don't want packages built with gcc if we can avoid it, there is always danger of incompatability when shared libraries are built with different compilers.

But the key message is still encouraging. Even though the OS does now have the needed parts, there is a work around.

Why is this "Revolution R"? The company Revolution R, which was later purchased by Microsoft, popularized the use of the Intel MKL on Ubuntu Linux. A version of R built with Intel's compiler was used, with permission, on Ubuntu in 2012. The version of R we are using now goes by the moniker "MRO". Can you guess what the M and the R stand for?

Posted in Data Analysis | Leave a comment

Making sure fonts are embedded in LaTeX thesis and dissertation documents

KU thesis rules require that all fonts used in the submitted PDF document must be embedded in the document itself. This is required to eliminate the problem that special symbols are not legible in the document on the receiver's computer.

Making sure all fonts are embedded appears to be not so easy across platforms. When I compile the ku thesis document, I notice the Wingding and symbols are not embedded.

However, this is not a flaw in pdflatex as it currently exists. It was a pdflatex flaw in the past. So far as I can tell, all fonts needed in the pdflatex run are embedded if you use a LaTeX distribution that is reasonably modern.

The major problem arises when a document includes other PDF documents, using \includegraphics{} for example. If those included documents are lacking in embedded fonts, then pdflatex does not fix that.

In my example document, before 20160503, the fonts were missing because they were not embedded in the R plots that are included in the example chapters. I had to to go back and re-run the R code to make sure the fonts are embedded in the pdf files for the graphs. After that, the pdflatex output of the thesis template is fine.

You can check for yourself, Run

$ pdffonts thesis-ku.pdf

If we don't fix the R output files before compiling the thesis itself, we are in a somewhat dangerous situation. People suggest using various magic wands to add fonts, but all of them seem to have major flaws. They either corrupt the quality of the output or destroy its internal structure.

I found ways to embed fonts using ghostscript. This converts document over to ps and then back to pdf.

$ pdf2ps  thesis-ku.pdf test.ps
$ ps2pdf14 -dPDFSettings=/prepress -dEmbedAllFonts=true test.ps
test.pdf

The bad news. 1 It destroys internal hyperlinks. 2 IT DOES NOT embed fonts needed for material in embedded graphs (things inserted by \includegraphics, such as PDF produced by R).

See:

http://askubuntu.com/questions/50274/fonts-are-not-embedded-into-a-pdf

In my opinion, this is a bad outcome, should not happen. But it does.

As a result, it seems necessary to fix the individual PDF graphics files before compiling the larger thesis document.

This reminds me that at one point I had a post-processing script written for R Sweave sessions that would embed fonts in all pdf output files.

The shell script would cycle through all of the R output and embed fonts. Enjoy!

for i in *.pdf; do
base=`basename $i .pdf`;
basenew="${base}/newtemp.pdf"

##echo "$i base: $base new: $basenew"
  /usr/bin/gs -o $basenew -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite $i

mv -f $basenew $i
done;

Same can be achieved inside R. Each time a PDF is created, embed the fonts with the embedFonts() function. See ?embedFonts

Posted in Data Analysis | Leave a comment

ACF Cluster resource limits: home file space and file quota

User home folders are limited at 100GB and no customization is allowed. To our users who were previously limited to 20GB, that's great news. To the others who had 600GB allocations, that's disaster. Oh, well. Just one among many.

When you log in on hpc.crc.ku.edu, a system status message appears. One report is the disk usage. Here's what I see today:

Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn

   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  65.04  85.00 100.00 :  136150  85000 100000 : /home/pauljohn uid:xxxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/work
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/scratch
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

In case you want to see the same output, the new cluster has a command called "mystats" which will display it again. In the terminal, run

mystats

In the output about my home folder, there is a "hard limit" at 100GB, as you can see. That is not adjustable in the current regime.

The main concern today is that I'm over the limit on the number of files. The limit is now 100,000 files but I have 136150. If I'm over the limit, I am not allowed to create new files. If I remain over the limit, the system can prevent me from doing my job.

Wait a minute. 136,150 files? WTH? Last time I checked, there were only 135,998 files and I'm sure I did not add any. Did some make babies? Do you suppose some R files found some C++ files and made an Rcpp project? (That's programmer humor. It knocks them out at conferences.)

I probably have files I don't need any more. I'm pretty sure that, for example, when I compile R, it uses tens of thousands of files. Maybe I can move that work somewhere else.

I wondered how I could find out where I have all those files. We asked and the best suggestion so far is to run the following, which sifts through all directories and counts the files.

for i in $(find . -maxdepth 1 -type d);do echo $i;find $i -type f |wc -l;done

The return shows directory names and file counts, like this:

./tmp
17365
./work
46
./.emacs.d 
0
./src
25519
./texmf 
1794 
./packages 
5041 
./SVN 
 4321 
./Software 
12014 
./.ccache 
995 .
/TMPRlib-3.3 
19316

I'll have to sift through that. Clearly, there are some files I can live without. I've got about 20K files in TMPRlib, which is a building spot for R packages before I put them in the generally accessible part of the system. .ccache is the compiler cache, I can delete those files. They just get regenerated and saved to speed up C compiler jobs, but I have to make a choice there.

So far, I've obliterated the temporary build information, but I remain over the quota. I'll show the output from "mystats" so that you can see the difference:

$ mystats
Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn
   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  63.26  85.00 100.00 :  113510  85000 100000 : /home/pauljohn uid:xxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/work
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/scratch
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

Oh, well, I'll have to cut/move more things.

The take-aways from this post are

  1. The CRC put in place a hard, unchangeable 100GB limit on user home directories.

  2. There is a limit of 100,000 on the number of files that can be stored within that. Users will need to cut files to be under the limit.

  3. One can use the find command in the shell to find out where the files are.

How to avoid the accidental buildup of files? The main issue is that compiling software (R packages) creates intermediate object files that are not needed once the work is done. It is difficult to police these files (at least it is for me).

I don't have time to write all this down now, but here is a hint. The question is where to store "temporary" files that are need to compile software or run a program, but they are not needed after that. In many programming chores, one can link the "build" folder to a faster, temporary storage device that is not in the network file system. In the past, I've usually used "/tmp/a_folder_i_create" because that is on the disk "in" the compute node. Disk access on the local disk is much faster than the network file system. Lately, I'm told it is even faster to put temporary material in "/dev/shm", but have not much experience. By a little clever planning, one can write the temporary files in a much faster memory disk that will be easily disposed of and, so far as I can see today, do not count within the file quota. This is not to be taken lightly. I've compared the time required to compile R using the network file storage against the local temporary storage. The difference is 45 minutes versus 15 minutes.

Posted in Programming | Tagged | Leave a comment

Interactive sessions on HPC

Danger: new smaller memory default!

At the user meeting on April 12, we found out that requesting 1 core will automatically provide only 500MB of memory. This is a BIG change, because in older cluster we received 2GB per core and that was generally sufficient. That is to say, we almost always did not specify memory.

The default interactive session is not likely to be sufficient, so it will be required to specify memory.

As a result, the command to ask for 1 node with 1 processor (core) on that node would be

msub -X -I -l nodes=1:ppn=1,pmem=2048m 

This asks for graphics X11 forwarding (-X). The memory can also be specified as "2gb".

If you only want 1 core on 1 node, the simpler notation would be to use the flag "procs".

msub -X -I -l procs=1,pmem=2048m 

To ask for several cores on 1 node (test multicore project), run

msub -X -I -l nodes=1:ppn=5,pmem=2048m

** Specify a queue **

Interactive jobs can be run on any queue. By default, they go to the user's nodes.

The default queue is displayed with 'mystats'. If you wish to run on a node that is not in your owner group, like a GPGPU node, you will then need to specify the sixhour queue and the node name. You will only have a maximum of 6 hours on this node. There is no time limit to your default queue.

msub -X -I -l nodes=1:ppn=5,pmem=2048m -q sixhour

One can specify a particular node, "g0001", with a request likee:

msub -X -I -lnodes=g001:ppn=1 -q sixhour

CRC made a page regarding queues and has relocated it at http://crc.ku.edu/using-hpc#Submitting http://crc.ku.edu/queues

Update 20170413

We requested a simpler way to launch the usual type of interactive session--one node, one core--as we had in the old cluster. The administrators created a script "qxlogin" which the user can run from the login node.

$ qxlogin
qsub: waiting for job 40565091.sched to start
qsub: job 40565091.sched ready

We suggest caution with this, since the new memory default limit is 500MB and CRMDA users have regularly reported frustration with unanticipated job failures.

In case you want to write your own login script, you can take an example from the new qxlogin, which I found is installed in /usr/local/bin on the new cluster.

$ cat /usr/local/bin/qxlogin
#!/bin/sh

ARGS=$@

/opt/moab/bin/msub -X -I -lnodes=1:ppn=1 $ARGS

If you want more interactive nodes, or more ppn, just change the 1's. To test that, suppose you save it as "qxlogin2", then run

$ sh qxlogin2

If you enjoy the result, save that file in your $HOME/bin directory, make it executable, and then it will be more generally available within your sessions. After that, there is no need to run "sh" before "qxlogin2". Try it out, let me know if there is trouble.

Posted in Data Analysis | Leave a comment

Cluster user update

We will have a cluster update meeting on Friday at 10AM in Watson Room 440D (within the suite of the Digital Humanities group).

Today the Center for Research Computing announced the re-opening of the compute cluster. A number of features we have come to depend on were removed. All of the CRMDA documentation (http://crmda.ku.edu/computing) will need to be revised. This will take some time. These changes were not well publicized during the six-month-long runup to the cluster administration changeover, we are playing catchup.

They have declined to support NoMachine GUI connections and that the cluster storage is not externally accessible via Windows Server or Network File System protocols. We will have to find ways to navigate around those changes.

The top priority right now is updating the hpc example collection,

https://gitlab.crmda.ku.edu/crmda/hpcexample

Most of that work has been kindly attended to by Wes Mason at KU ITTC.

Here is a copy of the announcement.

KU Community Cluster Users,

Over the course of the last few weeks we have been working to transition the administration of the KU Community Cluster to the Center for Research Computing (CRC). We have completed testing with a subset of users and we are now restoring access for all users who are part of an owner group. If you know someone in your group that did not get this announcement, please email crchelp@ku.edu.

We have kept the underlying legacy software environment the same to make this transition simpler, but have made some improvements and updates that you will need to be aware of to use the cluster. We will be building upon these initial improvements over the coming months to standardize, implement best practices, update and integrate the software stack, provide transparency of resources utilization, integrate with KU, and help you optimize your use of the cluster.

HOW DO I LOGIN TO THE CLUSTER?

We have integrated with KU's identity management system so you will use your KU username and password to access the cluster. We have 2 login nodes that you will randomly be assigned to when you login to the address:

> KU_USERNAME@hpc.crc.ku.edu

SOFTWARE

'env-selector' was removed and only 'module' is available to load different software packages.

When issuing the command:

> module avail

you will see the new software we have compiled that is optimized for the latest version of the CPUs in the cluster.

To see the software installed before this transition, you must enter:

> module load legacy

and then you can see all legacy software by entering the command:

> module avail

You must place these commands in your job submit scripts as well if you choose to use the legacy software.

QSUB REPLACED BY MSUB

'qsub' has been replaced w‚Äčith 'msub'. All your submit scripts will still work with 'msub'. The #PBS directives in your job submit scripts are also compatible with "msub', but we suggest when you create new job submit scripts to use the #MSUB directives.

DATA

Your home directory now has a 100GB quota. We have integrated the cluster with KU's identity management system so your home directory also matches the KU home directory path (e.g., /home/a123b456).

All data from /research, /projects, /data, and if you had your own root directory (for example: /compbio), this has all been placed in

/panfs/pfs.local/work/<owner group>/<user>

If your owner group has used all their storage allocation or if your group does not have a storage allocation, some of your data had to be moved to $SCRATCH:

/panfs/pfs.local/scratch/<owner group>/<user>

We organized the data to better keep track of usage for owner groups. Scratch has been set up in the same manner. Some groups were previously allocated more storage than they purchased and you will see your quota for your $HOME, $WORK, and $SCRATCH directories when you log on. If you see any directory at 100%, then you must remove files before writing to it.

To see your quota, group, and queue stats at anytime, run:

> mystats

on the submit nodes.

NO data was deleted. If you see that you are missing something, please contact crchelp@ku.edu. Please check all paths first, please.

QUEUES

Your default queue will be displayed when you log in. This is the queue you will run in if you do not specify a queue name. If you wish to run across the whole cluster, you must specify:

#MSUB -q sixhour

in your job script or from command line:

> msub -q sixhour

You may only run a maximum of 6 hours on the 'sixhour' queue, but your jobs goes across all nodes.

Most users will only have access to their owner group queue and the 'sixhour' queue. Others will be part of multiple groups and have access to other queues as well.

All of this information will be displayed when you login to the cluster for at least the first few months after coming back online.

We are continuing to write documentation and help pages about the new setup of the cluster. These pages can be found at https://crc.ku.edu under the HPC tab and more will be added as time goes on so check back often. We will also have an introduction to the cluster next Wednesday, March 8, at 10:30am during our regular monthly HPC meeting (location TBD).

We understand that change can some times be a little jarring so if you have any questions feel free to contact us at crchelp@ku.edu and we will get back to you as soon as we can.

Thank you, Center for Research Computing Team

Posted in Data Analysis | Leave a comment

Long-running Mplus Bootstrapping Example

In the high performance computing example archive, we've just inserted Example 05, a long-running multi-core Mplus exercise.

https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex05-Mplus-1

This one demonstrates how I suggest we ought to keep the data, code files, and output files in separate folders, even if we are using Mplus!

Special thanks to Chong Xing, of the KU Dept. of Communications, for the example and the real-life data set that goes with it. This explores mediation in an structural equation model with the Children of Immigrants data set.

Posted in Data Analysis | Leave a comment