Getting Started

This guide is designed for researchers who are new to the UVA HPC System. Throughout this guide, we shall use the placeholder mst3k to represent the user’s login ID. The user should substitute his/her own login ID for mst3k

System Overview

Rivanna provides a high-performance computing environment for all user levels.  

Number of cores per node RAM per node Number of nodes
20 128GB 240
28 256GB 25
16 1TB 5

All nodes share a Lustre filesystem for temporary storage with up to 1.4PB of storage space for all users.   Each user is assigned space in /scratch/$USER with a default quota of 10TB of storage per user.  Groups may lease permanent storage from ITS which can be mounted to Rivanna.

Accessing the System

Allocations

Time on Rivanna is allocated as Service Units (SUs).  One SU corresponds to one core-hour.  Allocations are managed through MyGroups accounts.   The group owner is the Principal Investigator (PI) of the allocation.  Faculty, staff, and postdoctoral associates are eligible to be PIs. Students—both graduate and undergraduate—must be members of an allocation group sponsored by a PI.   Researchers are eligible for Standard allocations of 50,000 units with a short justification; renewals also require a brief description of results achieved, including publications and grants.  Larger allocations can be obtained through Administration allocations by submitting a short proposal.  If the PI is an affiliate of the College of Arts and Sciences or the School of Engineering and Applied Science, the PI can file a short proposal to request a larger Administrative allocation.  PIs affiliated with other units should submit allocation requests to the Data Sciences Institute. Time can also be purchased through external funding at a rate determined by the HPC Steering Committee.  Non-research staff such as LSPs are eligible for a 5,000 SU Trial allocation. Trial, Standard, and Administrative allocation grants are for a limited time period and may be renewed.  Purchased time does not expire during the active interval of the grant.  Details and application forms can be found at our  allocations page.

Each PI is ultimately responsible for managing the roster of users in the group although PIs may delegate day-to-day management to one or more other members.  When users are added or deleted, accounts are automatically created.  PIs who must keep projects separated, such as to distinguish those funded externally from those granted internally, may have more than one allocation group. 

If a group exhausts its allocation, all members of the group will be unable to submit new jobs.  If an individual user exceeds the /scratch filesystem limitations, only that user will be blocked from submitting new jobs on any partition.

Logging In

The system is accessed through ssh (Secure Shell) connections using the hostname rivanna.hpc.virginia.edu.  Your password is your Eservices password. We recommend MobaXterm for Windows users. Mac OSX and Unix users may connect through a terminal using the command ssh -Y mst3k@rivanna.hpc.virginia.edu.  Users working from off Grounds must run the UVA Anywhere VPN client.

Users who wish to run X11 graphical applications may prefer the FastX remote desktop client.

Software Access

The Modules Environment

User-level software is installed into a shared directory /share/apps.  The modules software enables users to manage their environments to access specific software, or even specific versions of the software.  The most commonly used commands include:

  • module avail (prints a list of all software packages available through a module)
  • module avail <package> (prints a list of all versions available for <package>)
  • module load <package> (loads the default version of <package>)
  • module load <package>/<version> (loads the specific <version> of <package>)
  • module unload <package> (removes <package> from the current environment)
  • module purge (removes all loaded modules from the environment)
  • module list (prints a list of modules loaded in the user’s current environment)

For more details about modules see the documentation.

Software Requests

Software accessed through modules is available for all users. Users may install their own software to their home directory or to shared leased space provided they are legally permitted to do so, either because it is open source or because they have obtained their own license. User-installed software may not require root privileges to install or operate under any circumstances.  User software may run daemons (services) provided that those services do not interfere with other users.

Users may petition ARCS to install software into the common directories. Each request will be considered on an individual basis and may be granted if it is determined that the software will be of wide interest. In other cases ARCS may help users install software into their own space.

Running Jobs

Submitting Jobs to the Compute Nodes

Rivanna resources are managed by the SLURM workload manager. The login rivanna.hpc.virginia.edu consists of multiple dedicated servers but their use is restricted to editing, compiling, and running very short test processes.  All other work must be submitted to SLURM to be scheduled onto a compute node. 

SLURM divides the system into partitions which provide different combinations of resource limits, including wallclock time, aggregate cores for all running jobs, and charging rates against the SU allocation. There is no default and users must choose a partition in each script.

Users may run the command queues to determine which partitions are enabled for them.  This command will also show the limitations in effect on each queue.

Users may run the command allocations to view the allocation groups to which they belong and to check their balances.

High-Performance Queues

Jobs submitted to these partitions are charged against the group’s allocation.

  • parallel: jobs that can take advantage of the InfiniBand interconnect. 
  • request: like parallel but users may access all high-performance cores.  Limited to intervals following maintenance.
  • largemem: jobs that require more than one core’s worth of memory per core requested.
  • serial: single-core jobs that need higher-speed access to temporary storage.
  • gpu: access to two Kepler-equipped nodes for testing general-purpose GPU (GPGPU) codes.

Job Management

SLURM jobs are shell scripts consisting of a preamble of directives or pseudocomments that specify the resource requests and other information for the scheduler, followed by the commands required to load any required modules and run the user’s program. Directives begin with the “pseudocomment” #SBATCH followed by options. Most SLURM options have two forms; a shorter form consisting of a single letter preceded by a single hyphen and followed by a space, and a longer form preceded by a double hyphen and followed by an equal sign (=). In SLURM a “task” corresponds to a process; therefore threaded applications should request one task and specify the number of cpus (cores) per task.

Frequently-used SLURM Options:

Number of nodes requested:

#SBATCH -N <N>
#SBATCH --nodes=<N>

Number of tasks per node:

#SBATCH --ntasks-per-node=<n>

Total tasks (processes) distributed across nodes by the scheduler:

#SBATCH -n <n>
#SBATCH --ntasks=<n>

Tasks per node

#SBATCH --ntasks-per-node=<n>

Number of tasks per core (SLURM still refers to a core as a "cpu"); this directive ensures that all cores are assigned on the same node, which is necessary for threaded programs:

#SBATCH --ntasks-per-cpu=<n>

Wallclock time requested:

#SBATCH –t d-hh:mm:ss
#SBATCH --time=d-hh:mm:ss

Memory request in megabytes over each node (the default is 1000 (1GB)):

#SBATCH --mem=<M>

Memory request in megabytes per core (may not be used with --mem):

#SBATCH --mem-per-cpu=<M>

Request partition <part>:

#SBATCH –p <part>
#SBATCH --partition=<part>

Specify the account to be charged for the job (this should be present even for economy jobs; the account name is the name of the MyGroups allocation group to be used for the specified run):

#SBATCH –A <account>
#SBATCH --account=<account>

Example Serial Job Script:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -t 12:00:00
#SBATCH -p serial
#SBATCH -A mygroup

# Run program
./myprog myoptions

Example Parallel Job Script:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH -t 12:00:00
#SBATCH -p parallel
#SBATCH -A mygroup

# Run parallel program over Infiniband using MVAPICH2

module load mvapich2/intel
mpirun ./xhpl > xhpl_out

Submitting a Job and Checking Status

Once the job script has been prepared it is submitted with the sbatch command:

sbatch myscript.slurm

The scheduler returns the job ID, which is how the system references the job subsequently.

Submitted batch job 36598

To check the status of the job, the user may type

squeue –u mst3k

Status is indicated with PD for pending, R for running, and CG for exiting.

By default SLURM saves both standard output and standard error into a file called slurm-<jobid>.out.  This file is created in the submit directory and is appended during the run.

Canceling a Job

Queued or running jobs may be canceled with

scancel <jobid>

Note that user-canceled jobs are charged for the time used when applicable.

Usage Policies

Research computing resources at the University of Virginia are for use by faculty, staff, and students of the University and their collaborators in academic research projects.  Personal use is not permitted.  Users must comply with all University policies for access and security to University resources.  The HPC system has additional usage policies to ensure that this shared environment is managed fairly to all users.

Frontends

Exceeding the limits on the frontend will result in the user’s process(es) being killed. Repeated violations will result in a warning; users who ignore warnings risk losing access privileges. 

Software Licenses

Excessive consumption of licenses for commercial software, either in time or number, if determined by system and/or ARCS staff to be interfering with other users' fair use of the software, will subject the violator's processes or jobs to termination without warning.  Staff will attempt to issue a warning before terminating processes or jobs but inadequate response from the violator will not be grounds for permitting the processes/jobs to continue.

Inappropriate Usage

Any violation of the University’s security policies, or any behavior that is considered criminal in nature or a legal threat to the University, will result in the immediate termination of access privileges without warning.