ACF Cluster resource limits: home file space and file quota

User home folders are limited at 100GB and no customization is allowed. To our users who were previously limited to 20GB, that's great news. To the others who had 600GB allocations, that's disaster. Oh, well. Just one among many.

When you log in on hpc.crc.ku.edu, a system status message appears. One report is the disk usage. Here's what I see today:

Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn

   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  65.04  85.00 100.00 :  136150  85000 100000 : /home/pauljohn uid:xxxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/work
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/scratch
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

In case you want to see the same output, the new cluster has a command called "mystats" which will display it again. In the terminal, run

mystats

In the output about my home folder, there is a "hard limit" at 100GB, as you can see. That is not adjustable in the current regime.

The main concern today is that I'm over the limit on the number of files. The limit is now 100,000 files but I have 136150. If I'm over the limit, I am not allowed to create new files. If I remain over the limit, the system can prevent me from doing my job.

Wait a minute. 136,150 files? WTH? Last time I checked, there were only 135,998 files and I'm sure I did not add any. Did some make babies? Do you suppose some R files found some C++ files and made an Rcpp project? (That's programmer humor. It knocks them out at conferences.)

I probably have files I don't need any more. I'm pretty sure that, for example, when I compile R, it uses tens of thousands of files. Maybe I can move that work somewhere else.

I wondered how I could find out where I have all those files. We asked and the best suggestion so far is to run the following, which sifts through all directories and counts the files.

for i in $(find . -maxdepth 1 -type d);do echo $i;find $i -type f |wc -l;done

The return shows directory names and file counts, like this:

./tmp
17365
./work
46
./.emacs.d 
0
./src
25519
./texmf 
1794 
./packages 
5041 
./SVN 
 4321 
./Software 
12014 
./.ccache 
995 .
/TMPRlib-3.3 
19316

I'll have to sift through that. Clearly, there are some files I can live without. I've got about 20K files in TMPRlib, which is a building spot for R packages before I put them in the generally accessible part of the system. .ccache is the compiler cache, I can delete those files. They just get regenerated and saved to speed up C compiler jobs, but I have to make a choice there.

So far, I've obliterated the temporary build information, but I remain over the quota. I'll show the output from "mystats" so that you can see the difference:

$ mystats
Primary group: hpc_crmda
Default Queue: crmda

$HOME = /home/pauljohn
   <GB> <soft> <hard> : <files> <soft> <hard> : <path to volume> <pan_identity(name)>
  63.26  85.00 100.00 :  113510  85000 100000 : /home/pauljohn uid:xxxxx(pauljohn)

$WORK = /panfs/pfs.local/work/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/work
                       14T  1.6T   13T  12% /panfs/pfs.local/work/crmda/pauljohn

$SCRATCH = /panfs/pfs.local/scratch/crmda/pauljohn
Filesystem            Size  Used Avail Use% Mounted on
panfs://pfs.local/scratch
                       55T   37T   19T  67% /panfs/pfs.local/scratch/crmda/pauljohn

Oh, well, I'll have to cut/move more things.

The take-aways from this post are

  1. The CRC put in place a hard, unchangeable 100GB limit on user home directories.

  2. There is a limit of 100,000 on the number of files that can be stored within that. Users will need to cut files to be under the limit.

  3. One can use the find command in the shell to find out where the files are.

How to avoid the accidental buildup of files? The main issue is that compiling software (R packages) creates intermediate object files that are not needed once the work is done. It is difficult to police these files (at least it is for me).

I don't have time to write all this down now, but here is a hint. The question is where to store "temporary" files that are need to compile software or run a program, but they are not needed after that. In many programming chores, one can link the "build" folder to a faster, temporary storage device that is not in the network file system. In the past, I've usually used "/tmp/a_folder_i_create" because that is on the disk "in" the compute node. Disk access on the local disk is much faster than the network file system. Lately, I'm told it is even faster to put temporary material in "/dev/shm", but have not much experience. By a little clever planning, one can write the temporary files in a much faster memory disk that will be easily disposed of and, so far as I can see today, do not count within the file quota. This is not to be taken lightly. I've compared the time required to compile R using the network file storage against the local temporary storage. The difference is 45 minutes versus 15 minutes.

Posted in Programming | Tagged | Leave a comment

Interactive sessions on HPC

The "qxlogin" shortcut doesn't exist anymore, I've suggested that CRC should re-implement that convenience. Nevertheless, interactive seesions can be had. Riley Epperson kindly provided this information.

Interactive jobs can be run on any nodes. Those queues are nodes that the owner groups purchased.

The new command would be:

msub -X -I -lnodes=1:ppn=1

That will run in your default queue, which is displayed with 'mystats'. If you wish to run on a node that is not in your owner group, like a GPGPU node, you will then need to specify the sixhour queue and the node name. You will only have a maximum of 6 hours on this node. There is no time limit to your default queue.

That command would be:

msub -X -I -lnodes=g005:ppn=1 -q sixhour

CRC made a page regarding queues at http://crc.ku.edu/queues

Posted in Data Analysis | Leave a comment

Cluster user update

We will have a cluster update meeting on Friday at 10AM in Watson Room 440D (within the suite of the Digital Humanities group).

Today the Center for Research Computing announced the re-opening of the compute cluster. A number of features we have come to depend on were removed. All of the CRMDA documentation (http://crmda.ku.edu/computing) will need to be revised. This will take some time. These changes were not well publicized during the six-month-long runup to the cluster administration changeover, we are playing catchup.

They have declined to support NoMachine GUI connections and that the cluster storage is not externally accessible via Windows Server or Network File System protocols. We will have to find ways to navigate around those changes.

The top priority right now is updating the hpc example collection,

https://gitlab.crmda.ku.edu/crmda/hpcexample

Most of that work has been kindly attended to by Wes Mason at KU ITTC.

Here is a copy of the announcement.

KU Community Cluster Users,

Over the course of the last few weeks we have been working to transition the administration of the KU Community Cluster to the Center for Research Computing (CRC). We have completed testing with a subset of users and we are now restoring access for all users who are part of an owner group. If you know someone in your group that did not get this announcement, please email crchelp@ku.edu.

We have kept the underlying legacy software environment the same to make this transition simpler, but have made some improvements and updates that you will need to be aware of to use the cluster. We will be building upon these initial improvements over the coming months to standardize, implement best practices, update and integrate the software stack, provide transparency of resources utilization, integrate with KU, and help you optimize your use of the cluster.

HOW DO I LOGIN TO THE CLUSTER?

We have integrated with KU's identity management system so you will use your KU username and password to access the cluster. We have 2 login nodes that you will randomly be assigned to when you login to the address:

> KU_USERNAME@hpc.crc.ku.edu

SOFTWARE

'env-selector' was removed and only 'module' is available to load different software packages.

When issuing the command:

> module avail

you will see the new software we have compiled that is optimized for the latest version of the CPUs in the cluster.

To see the software installed before this transition, you must enter:

> module load legacy

and then you can see all legacy software by entering the command:

> module avail

You must place these commands in your job submit scripts as well if you choose to use the legacy software.

QSUB REPLACED BY MSUB

'qsub' has been replaced w​ith 'msub'. All your submit scripts will still work with 'msub'. The #PBS directives in your job submit scripts are also compatible with "msub', but we suggest when you create new job submit scripts to use the #MSUB directives.

DATA

Your home directory now has a 100GB quota. We have integrated the cluster with KU's identity management system so your home directory also matches the KU home directory path (e.g., /home/a123b456).

All data from /research, /projects, /data, and if you had your own root directory (for example: /compbio), this has all been placed in

/panfs/pfs.local/work/<owner group>/<user>

If your owner group has used all their storage allocation or if your group does not have a storage allocation, some of your data had to be moved to $SCRATCH:

/panfs/pfs.local/scratch/<owner group>/<user>

We organized the data to better keep track of usage for owner groups. Scratch has been set up in the same manner. Some groups were previously allocated more storage than they purchased and you will see your quota for your $HOME, $WORK, and $SCRATCH directories when you log on. If you see any directory at 100%, then you must remove files before writing to it.

To see your quota, group, and queue stats at anytime, run:

> mystats

on the submit nodes.

NO data was deleted. If you see that you are missing something, please contact crchelp@ku.edu. Please check all paths first, please.

QUEUES

Your default queue will be displayed when you log in. This is the queue you will run in if you do not specify a queue name. If you wish to run across the whole cluster, you must specify:

#MSUB -q sixhour

in your job script or from command line:

> msub -q sixhour

You may only run a maximum of 6 hours on the 'sixhour' queue, but your jobs goes across all nodes.

Most users will only have access to their owner group queue and the 'sixhour' queue. Others will be part of multiple groups and have access to other queues as well.

All of this information will be displayed when you login to the cluster for at least the first few months after coming back online.

We are continuing to write documentation and help pages about the new setup of the cluster. These pages can be found at https://crc.ku.edu under the HPC tab and more will be added as time goes on so check back often. We will also have an introduction to the cluster next Wednesday, March 8, at 10:30am during our regular monthly HPC meeting (location TBD).

We understand that change can some times be a little jarring so if you have any questions feel free to contact us at crchelp@ku.edu and we will get back to you as soon as we can.

Thank you, Center for Research Computing Team

Posted in Data Analysis | Leave a comment

Long-running Mplus Bootstrapping Example

In the high performance computing example archive, we've just inserted Example 05, a long-running multi-core Mplus exercise.

https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex05-Mplus-1

This one demonstrates how I suggest we ought to keep the data, code files, and output files in separate folders, even if we are using Mplus!

Special thanks to Chong Xing, of the KU Dept. of Communications, for the example and the real-life data set that goes with it. This explores mediation in an structural equation model with the Children of Immigrants data set.

Posted in Data Analysis | Leave a comment

Text Analysis with R

We are having a little practice session. Quick notes about working through the examples in Matthew L. Jockers fine book, Text Analysis with R for Students of Literature.

Browse here:

http://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature

and download the zip file that it points at here:

http://www.matthewjockers.net/wp-content/uploads/2014/05/TextAnalysisWithR.zip

Save that zip file INSIDE a project folder. My folder is called R-text.

Unzip that package! (Don't just let the file manager peer inside it. You need to extract it.) It creates a directory called:

TextAnalysisWithR

Inside there, there is a directory structure, including text files and R code. Use the file manager to change into the directory "start.up.code" and then a chapter, such as "chapter.3".

If you open the R file in an R-aware editor (eg Rstudio), the code won't run as it is. But is easy to fix. Change the name of the data file by inserting "../../" at the beginning. Like so

text.v <- scan("../../data/plainText/melville.txt", what="character", sep="\n")

After doing that, you can step through the example code line by line.

It appears to me the starter code has all of the basic data manipulation work. It does not include the code to manufacture graphs.

We can explore that when we meet together...

Posted in Data Analysis, R | Leave a comment

Portable Parallel Seeds for R (and other cluster updates)

We've been reworking the high performance computing examples so that they line up with the latest and greatest advice about how to organize submissions on the ACF cluster computing system. Please update your copy of the hpcexample archive (see http://crmda.ku.edu/parallel-programs).

In the process, we notice that some updates are possible in our "portableParallelSeeds" package for R. Because this package alters the random number generator structure in the R environment, we are not releasing it to the CRAN system. It can be installed from our KU server, however. We suggest you try the following;

CRAN <- "http://rweb.crmda.ku.edu/cran"
KRAN <- "http://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
install.packages("portableParallelSeeds")

Remember this: if you want to get updates by running "update.packages()" inside R, it is necessary to run the first 3 lines here to set your system to look for packages in KRAN.

The portableParallelSeeds package is delivered with two vignettes (essays) named "PRNG-basics" and "pps". To see all about it, run

help(package = "portableParallelSeeds")

Along with the package, a prototype design for Monte Carlo simulations is included. It is in the install folder of the package. There is a directory named "examples" and the prototype is "paramSweep-1.R".

Posted in Data Analysis | Leave a comment

Do not let this happen to your survey project

survey_joke Bring in your survey, let us take a look! Perhaps one of our GRAs can save you from this ignominious fate. We have open (free) walk-in consulting. Find out more on the Open Consulting Page

Posted in Data Analysis | Leave a comment

Cluster Computing Updates

Hot off the press! Advanced Computing Facility Cluster news.

In the past 2 months, several of user have noticed that some jobs they submit take a long long time to start. Some have noticed that the same job will fail or succeed apparently at random.

This came to a head on Tuesday; we've exerted ourselves and have a mostly understandable set of answers. I promised news as soon as I understand it, and here it is. We will have most of this written up in detail in the hpcexample collection (https://gitlab.crmda.ku.edu/crmda/hpcexample) within a week.

To explain the intermittent problems, there are several separate items I need to mention. There are several balls in the air. Or are they spinning plates?

1. Our nodes are shared, to a certain extent.

That's the "community" in community cluster.

I asked for our nodes to be reserved so we can get quick access to them, but there is a detail I did not understand.

Our nodes in the cluster--24 systems, each with 20 cores--are open to short (6 hours or less) jobs from other users in the cluster. If one of those other users gets active, submitting 1000s of jobs, then our jobs wait as long as 6 hours before they start. Our jobs have "first priority" when currently running jobs end.

While it is technically possible to block those other users, I believe it would be impolite to do so at the current moment. We may revisit that decision in the future.

2. We are asking for resources in the wrong way.

We have been following principles that were established in the "hpc.crmda.ku.edu" cluster we used previously. That cluster was only for CRMDA and its affiliates, it did not include such a large audience as the current cluster.

In hpc, if we wanted 19 cores, we were told to allow the cluster to scatter them across nodes if it wanted. Submission for a 19 core job would read like this:

#PBS -l nodes=19:ppn=1

However, in the new cluster that's more likely to cause a total fail because our jobs can get scattered across a lot of different nodes on which strangers are also running jobs.

Those strangers might not be well behaved. I mean, their jobs might not be well behaved. They may request 2GB memory, but then actually use 20GB memory. By gobbling up all memory, they cause our jobs to fail.

Hence, if we want a job using 19 cores, in the brave new world it is smarter and faster to run

#PBS -l nodes=1:ppn=19:ib

That is to say, if you have the whole node, or most of its cores, you are protected from bad strangers. In the old hpc cluster, that would have made a job slower to start. In the new cluster, it is the best bet.

I inserted that special bit, ":ib", and the next sections explain why.

We are getting revised, detailed instructions on how to ask for more cores, and how to specify memory. I'll follow up with information about that later.

3. Please understand this heterogeneous "transport layer" problem.

In the old hpc cluster, we had 1/3 nodes with Infiniband. Jobs would fail all the time because of that diversity. Nodes that had Infiniband could not interact with nodes that did not. We learned to insert ":noib" to prevent the Infiniband layer from being used. All of our nodes had ethernet connections, they could use that slightly slower method.

In the new cluster, we thought all nodes had infiniband. So we told people "get rid of :noib". But some people did not get the message.

But now, it appears in the new cluster, it is not true all nodes have infiniband. Possibly the slow/bad ones don't. Keep that in mind if you read the next point.

4. In the new cluster, we are asking for resources in an especially stupid way (as a result of issue in part 3).

Some of use are still using submission scripts from the old hpc cluster, where we had "noib".

PBS -l nodes=19:ppn=1:noib

Here's why that is especially wrong. All of the nodes that CRMDA controls now DO have Infiniband, and so if you insert "noib", you are PREVENTING the job from starting within the nodes on which we actually have a reservation. That means your jobs are orphaned among the cluster, waiting for a kind stranger to take them in.

That stranger node that lets you in is likely to be a slow, old one. Explaining the "my job takes forever" problem you have noted recently.

Instead, FOR SURE, we need

PBS -l nodes=19:ppn=1:ib

5. Heterogeneous error type 1.

Suppose you don't put ":ib". You launch the job and ask for 50 cores. The first one has Infiniband, but the available node it finds is not, then you get this crash.

This should look familiar to a few of you

If the node chosen to be your Rmpi master is on Infiniband, and it finds other nodes that are not, we see a crash like so:

  > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
[[41508,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
   Host: n301

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 22368 on node n301 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

6. We need to get better at module and R package management.

User scripts can specify modules. That's possible, and will fix it.

But it looks like an uphill battle for me to keep all of that clear among all of the different users. We'll try to do it in our part so you don't bother.

You'll know if you hit this problem.

Some jobs fail like this:

 > library(snow)
 >
 > p <- rnorm(123, m = 33)
 >
 > CLUSTERSIZE <- 18
 >
 > cl <- makeCluster(CLUSTERSIZE, type = "MPI")
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 110202 on node n404 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

That's caused by a shared library version mismatch.

The R 3.2.3 packages are build against the OpenMPI C library. WHen I built those packages, the OpenMPI version was 1.6.5.

Since then, OpenMPI has has updates, but I did not rebuild the packages. I thought the R-3.2.3 libraries and setup were going to remember OpenMPI1.6.6.

However....

Our default module, Rstats, did not connect R-3.2.3 with the OpenMPI version. Thus, sometimes, when a job would run, it would use a particular OpenMPI feature, we had a compiled library mismatch.

We can fix that ourselves by altering the submission script, say by specifying

module purge
module load openmpi/1.6.5
module load Rstats/3.2.2

However, I don't want to be that detailed. The fix for that, which will be in place very soon (by Monday, for sure, maybe sooner) is

1. The cluster admin will tie together the R choice with the OpenMPI choice.

2. I'll rebuild the R packages for R -3.3.1 with the newer OpenMPI and then admins will update the default module for crmda_remote users (that's you).

Posted in Computing | Tagged | Leave a comment

kutils Package for R: New Updates Available

The KRAN server, an R repository hosted by the Center for Research Methods and Data analysis at the University of Kansas, offers packages being prepared for researchers that use R. A suite of tools for research project management, dubbed "kutils", is undergoing rapid development. This includes functions that can initialize projects, scan input data, create quick overviews of the information in the data, and guide the recoding and data refactoring process. The package includes the Variable Key System, a framework to import and revise data in a team/project oriented manner. An essay about this is available (and also is included with the package): The Variable Key Data Management Framework.

In case you might like to try this out, the KRAN server is available. It will be necessary to pair this with the general purpose CRAN server (and any other servers you use currently, such as OmegaHat or Bioconductor). We suggest trying this R code:

CRAN <- "https://rweb.crmda.ku.edu/cran"
KRAN <- "https://rweb.crmda.ku.edu/kran"
options(repos = c(KRAN, CRAN))
## We suggest installing updates, but this next step is not required
update.packages(ask = FALSE, checkBuilt = TRUE)
## Then install our new package
install.packages("kutils", dep = TRUE)

In case you use Bioconductor, for example, here is the way we integrate it with the update process

CRAN <- "https://rweb.crmda.ku.edu/cran"
KRAN <- "https://rweb.crmda.ku.edu/kran"
BIOC <- "https://www.bioconductor.org/packages/3.3/bioc"
options(repos = c(KRAN, CRAN, BIOC))
## We suggest installing updates, but this next step is not required
update.packages(ask = FALSE, checkBuilt = TRUE)
install.packages("kutils", dep = TRUE)

After running that, then using

kutils is as easy as:

library(kutils)

The functions that we suggest you check first are

peek and initProject. We include with this package the first fully functional version of the Variable Key, a custom development process developed within CRMDA. The variable key offers an enhanced project management framework along with an easier-to-use, tabular system of notation that makes it easier for non-technicians to guide and supervise research exercises. There is a vignette about the Variable Key provided with the package. After library(kutils), just run

vignette("variablekey")

and a PDF should be displayed. If your system's PDF viewer can't be found by R, you'll get an error message that points you in the right direction. The most recent changes in the package concern the Variable Key. We have streamlined the "round trip" research process. A researcher should approach data importation and re-coding in these steps.

  1. Run keyTemplate. This scans the data and creates a table that can be used for recoding.
  2. Edit the key template document. This can be done in a spreadsheet program (MS Excel) or a text editor such as Emacs, Notepad++, Sublime Text, or Textmate (basically, any programmer's file editor, NOT Microsoft Word)
  3. Run keyImport.  This reviews the requested data revisions from the revised template.
  4. Run keyApply. The requested data changes in the new key are applied to the data frame being considered. In our original design, we thought the four step process was the end of this. However, we have run into a few cases in which the Variable Key system exposes problems in the original data frame that cause the data owners to revise their data frame.  Our original thought was that the teams that revise the data will repeat the original 4 step process--create a new key template, revise it, import and apply it. However, in a case where the key includes 100s of variables, this implies a lot of repeated work. In the most recent version of kutils, we include a new function that adds a fifth step that can address this situation.
  1. Run keyUpdate. This scans the new data, checks for new variables and new values, and then incorporates them into the previously prepared variable key.

While this is the newest function in a

development package, we encourage researchers to try it and let us know how it works. For troubleshooting purposes, here is the sessionInfo output.

Update 2016-10-28. A zip file variablekey-anes-20161028 is available with an example that we have used to test the variable key setup. This revealed some challenges with "fancy quotes" that we need to solve in the future. If a person edits the key and inserts fancy slanted quotes, then the re-import process fails because slanted quotes are unrecognized.

> sessionInfo()
    R version 3.3.1 (2016-06-21)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.1 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
     [9] LC_ADDRESS=C               LC_TELEPHONE=C
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base
    
    other attached packages:
    [1] kutils_0.31
    
    loaded via a namespace (and not attached):
    [1] plyr_1.8.4     tools_3.3.1    Rcpp_0.12.7    xtable_1.8-2   openxlsx_3.0.0
Posted in Uncategorized | Leave a comment

Best of Lawrence Chooses CRMDA

The CRMDA team has been selected for this prestigious honor.

Posted in Uncategorized | Leave a comment