Storage and Data Guide
Storage Areas
Name |
Mount Path |
Quota Policy |
Purge Policy |
Snapshots |
---|---|---|---|---|
home |
|
5GB soft/7GB hard; 7-day grace; No inode quota |
No purge |
Yes; daily for 30 days |
project |
|
Based on investment; 20 million inodes per user |
No purge |
Yes; daily for 30 days |
scratch |
|
10TB/User; No inode quota |
Daily Purge of files older than 30 days |
No |
local scratch |
|
None |
Following job completion |
No |
Home
The /home
area of the filesystem is where you land when you log into the cluster via SSH and is where your $HOME
environment variable points.
This area has a fairly small quota and is meant to contain your configuration files, job output/error files, and smaller software installations.
Your /home
area of the filesystem is automatically provisioned during the account provisioning process and the space is provided by the program.
It is not possible to request an expansion of home directory quota.
The 7-day grace period on /home
means that if the usage in your home directory is over 5GB for a 7-day period, your home directory will stop accepting new writes until data has been cleared to below the 5GB threshold.
Project
The /projects
area of the filesystem is where an investment group’s (single faculty member, lab group, department, or an entire college) storage capacity resides.
You can have access to multiple project subdirectories if you are a member of various investment groups, and have been granted access to the space by the investments PI or Tech Rep.
This storage area is where the bulk of the filesystem’s capacity is allocated.
Project location for allocations starting on/after Sept. 2023: /projects/illinois/$college/$department/$pi_netid
(example: /projects/illinois/eng/physics/bobsmith
)
Project location for legacy allocations (pre-Sept. 2023): /projects/$custom_chosen_name
(example: /projects/smith_lab
)
Scratch
The /scratch
area of the filesystem is where you can place data while it’s under active work.
The scratch area of the cluster is provisioned by the Campus Cluster program for all users to have access to.
As noted in the summary table, files older than 30 days are purged from this space, based on a file’s last access time. The admin team maintains various tools to detect and monitor for users abusing the scratch area (such as by artificially modifying file access times) to attempt to circumvent the purge policy. Doing so is a violation of the cluster’s policy and is not allowed. If you believe you have a legitimate need to retain data in scratch for longer than 30 days, please submit a support request.
Scratch location: /scratch/users/$your_netid
Scratch — Local
The /scratch.local
area is allocated on an individual compute node on the Campus Cluster or HTC system, this disk is provided by a compute node’s local disk, not the shared filesystem.
The size of /scratch.local
will vary across nodes of different investments. Be careful on assuming size, especially when running in the secondary queue where you have less control over what node your job lands on.
Data in /scratch.local
is purged following a job’s completion, prior to the next job beginning on the node.
Storage Policies
Scratch Purge
Files in the /scratch
area of the filesystem are purged daily, based on the file’s access time as recorded in the filesystem’s metadata.
Once data is purged via the purge policy, there is no recovering the data.
It has been permanently destroyed.
Move high value data out of this space to make sure it doesn’t get forgotten and purged.
Filesystem Snapshots
Daily snapshots are run on the filesystem for the /home
and /projects
areas.
These snapshots allow you to go back to a point in time and retrieve data you may have accidentally modified, deleted, or overwritten.
These snapshots are not backups and reside on the same hardware as the primary copy of the data.
To access snapshots for
/home
visit:/home/.snapshots/home_YYYYMMDD*/$USER
To access snapshots for
/projects
visit:Allocations starting on/after September 2023:
/projects/illinois/$college/$department/$pi_netid/.snapshots/YYYYMMDD*/
Legacy allocations (pre-September 2023):
/projects/$PROJECT_NAME/.snapshots/YYYYMMDD*/
Inode Quotas
An inode is a metadata record in the filesystem that tracks information about a file or directory (such as what blocks it lives on disk, permissions, ACL, extended attributes, etc.). There is an inode record for every file or directory in the filesystem.
For project directories, there is a 20 million inode per-user policy. Since metadata is stored on NVME for fast access, this quota ensures a tolerable ratio of data to metadata. If this quota becomes an issue for your team, please submit a support request to discuss a solution. For ways to help decrease inode usage, refer to Data Compression and Consolidation.
Data Classifications
There are many special data classifications in use by researchers across campus. Some types are permitted to be stored on Illinois Research Storage (the Campus Cluster) and some are not. Below are descriptions about some of those data types. If you have any questions, please submit a support request.
International Traffic in Arms Regulation (ITAR) – ITAR data is permitted to be stored on Illinois Research Storage, however all proper procedures and notifications must be followed. For more information about those procedures and who to contact, refer to the OVCR’s documentation.
Health Insurance Portability and Accountability Act (HIPAA) / Personal Identifiable Health (PIH) – HIPAA / PIH data is not permitted to be stored on Illinois Research Storage.
Accessing Storage
There are a variety of ways to access Research Storage, and we continue to work with users to find more ways. Below is a summary table of where/how Research Storage is accessible, with further descriptions below.
Location/Method |
|
|
|
---|---|---|---|
HPC Head Nodes |
Yes |
Yes |
Yes |
HPC Compute Nodes |
Yes |
Yes |
Yes |
HTC Head Nodes |
On Roadmap |
Yes |
No |
HTC Compute Nodes |
On Roadmap |
On Roadmap |
No |
cc-xfer CLI DTN Nodes |
Yes |
Yes |
Yes |
Illinois Research Storage Globus Endpoints |
Yes |
Yes |
Yes |
Globus Shared Endpoints (external sharing) |
No |
Yes |
No |
Research Storage S3 Endpoint |
No |
Coming Soon |
No |
Lab Workstations (Research Labs) |
No |
Yes# |
No |
#Available Upon Request (refer to the NFS / SAMBA sections)
HPC Head Nodes & Compute Nodes
All filesystem areas are accessible via the Campus Cluster’s batch head nodes and compute nodes to run jobs and interact with data on the command line.
HTC Head Nodes
We are working to implement access to investor /projects
areas on the head nodes of the Illinois HTC subsystem.
This work is in progress and this guide will be updated when that capability is available.
Mounting /home
on the HTC head nodes (having a shared $HOME
between the HPC and HTC subsystems) has been discussed and is in the process of being placed on the roadmap of new feature delivery.
HTC Compute Nodes
Mounting /home
and /projects
on the Illinois HTC compute nodes has been discussed and is likely to be placed on our roadmap soon.
An upgrade and architectural shift of some subsystem components will be required to provide this in a stable fashion, and planning for that effort is underway.
This section will be updated as new information becomes available.
CLI DTN Nodes
The filesystem areas are available for access via the cluster’s cc-xfer DTN service. These DTN nodes provide a target for transferring data to and from the cluster using common command line interface (CLI) data transfer methods such as rsync, scp, sftp, and others.
The DTN nodes sit behind the round-robin alias of cc-xfer.campuscluster.illinois.edu
.
Globus Endpoints
Globus is a web-based file transfer system that works in the background to move files between systems with Globus endpoints. Refer to Transferring Files - Globus for complete instructions on using Globus with NCSA computing resources.
The filesystem areas are accessible via the cluster’s Globus endpoint, not just for transfers to/from the system from/to other Globus endpoints, but also Box storage and Google Drive storage via the respective Globus endpoint collections.
You can also create a shared Globus endpoint to share data with people that are not affiliated with the University of Illinois system.
POSIX Endpoint
The Campus Cluster POSIX endpoint collection name is “Illinois Research Storage”.
Box Endpoint
The Campus Cluster Box endpoint collection name is “Illinois Research Storage - Box”.
Google Drive Endpoint
The Campus Cluster Google Drive endpoint collection name is “Illinois Research Storage - Google Drive”.
Lab Workstations/Laptops
Groups can request that their /projects
area be made accessible for mounting on workstations and laptops in their research labs on campus.
This access method is especially helpful for data acquisition from instruments straight onto the Research Storage system, and for viewing files remotely for use in other GUI-based software that is run on local machines instead of in a clustered environment.
This access method is only available to machines on campus, and some security restrictions must be followed.
For more information refer to the NFS / SAMBA sections.
Managing Your Data Usage
Quota Command
Use the quota
command to view a summary of your usage across all areas of the filesystem and your team’s usage of the project space(s) you have access to.
The output of the quota
command is updated on the system every ~15 minutes; there will be a slight delay between the creation/deletion of data and the update to the output of the quota command. When the filesystem is under heavy load, quota data may get a bit out of sync with reality; the system runs a quota verification script daily to force the quota data to sync up across the system.
[testuser1@cc-login1 ~]$ quota
Directories quota usage for user testuser1:
-----------------------------------------------------------------------------------------------------------
| Fileset | User | User | User | Project | Project | User | User | User |
| | Block | Soft | Hard | Block | Block | File | Soft | Hard |
| | Used | Quota | Limit | Used | Limit | Used | Quota | Limit |
-----------------------------------------------------------------------------------------------------------
| labgrp1 | 160K | 1T | 1T | 58.5G | 1T | 6 | 20000000 | 20000000 |
| home | 36.16M | 5G | 7G | 58.5G | 1T | 1180 | 0 | 0 |
| scratch | 2.5M | 20T | 20T | 58.5G | 1T | 15292 | 0 | 0 |
| labgrp2 | 0 | 60T | 60T | 54.14T | 60T | 1 | 20000000 | 20000000 |
| labgrp3 | 0 | 107T | 107T | 88.31T | 107T | 11 | 20000000 | 20000000 |
-----------------------------------------------------------------------------------------------------------
User Block Used - How much capacity the user, testuser1, is consuming in each of the areas.
Project Block Used - How much capacity the entire team is using in their project space.
User File Used - How many inodes the user, testuser1, is consuming in each space.
The relevant soft and hard quotas (limits) for each of the areas are also shown in their respective columns. The last two columns are the inode quotas (limits).
Storage Web Dashboard Interface
The cluster overview dashboard shows an overview of the state of the cluster including:
job counts
node health numbers
number of users logged in
filesystem activity
overall usage
When an investment group purchases storage on the Research Storage service, a dashboard is created to view information related to the group’s usage. The dashboard will show point-in-time usage of the storage resources, trends over time, and a break down on a per-user level for capacity and inodes.
All members of a project should be able to log in and view their storage dashboard. Access to the dashboard is governed by group membership, which PIs and technical representatives control via the online User Portal. There may be a few hour delay between when a user is added/removed from a group and when their access to the group’s dashboard is added/removed.
To access your group/project storage dashboard:
Go to the cluster overview dashboard.
Click on the Sign In button in the upper-right corner of the screen.
Log in with your campus NetID and NetID password.
In the search bar at the top of the screen, search for your project’s name, it should appear in the search results.
“NCSA” project as an example:
Your team’s storage dashboard should look similar to the screenshot below. Trends for individuals will be displayed in the lower left for the username selected from the drop down, and the time period of the graphs can be adjusted in the Time Picker in the upper right corner. The utilization table in the lower right can be sorted by username (alphabetical), capacity used, or inodes used.
Guides/Tutorials
How to Relocate Your .conda Directory to Project Space
Large conda installations can exceed your home directory size. This can be avoided by relocating your .conda
directory to your project space, which has a larger quota than your home directory.
Relocate your .conda directory to your project space using the following steps:
Note, for allocations that started on/after September 2023, <your_proj_dir>
will follow the syntax illinois/$college/$department/$pi_netid
.
Make a
.conda
directory in your project space.[testuser1@cc-login1 ~]$ mkdir -p /projects/<your_proj_dir>/<your_username>/.conda
Copy over existing
.conda
data.[testuser1@cc-login1 ~]$ rsync -aAvP ~/.conda/* /projects/<your_proj_dir>/<your_username>/.conda/
Remove your
.conda
directory from home.[testuser1@cc-login1 ~]$ rm -rf ~/.conda
Create a link to your new
.conda
directory.[testuser1@cc-login1 ~]$ ln -s /projects/<your_proj_dir>/<your_username>/.conda ~/.conda
NFS Access to Research Storage
Investor groups can request that their /projects
area be made available via NFS for mounting on machines local to their lab team.
Project PIs or technical representatives can request NFS access to their /projects
area by submitting a support request with the following information:
Mount Type (Read-Only or Read-Write)
Project Area being exported
List of IP’s or IP CIDR range of the machines that need mount. (These machines must have a public IP address or a Campus internal routed IP address for them to be able to reach the NFS servers.)
NFS exports of the filesystem will be root-squashed which means that a user interacting with the storage on the remote machine via that local machine’s root account will have file access permissions that map to the nfsnobody user (generally UID 65534 on Linux systems).
When NFS mounting storage, it is advised to have user UIDs align with what they are on the Research Storage system. You can find your UID on the system, by running the id
command.
See example below:
[testuser1@cc-login ~]$ id
uid=7861(testuser1) gid=7861(testuser1) groups=7861(testuser1) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[testuser1@cc-login ~]$
Once exported, the filesystem can be NFS mounted via two methods:
(preferred)
autofs
- Refer to the autofs section of the Red Hat guide.Manually add the mount to the host’s
/etc/fstab
file - Refer to the /etc/fstab section of the Red Hat guide.
The round-robin DNS entry for the Research Storage NFS clustered endpoint is “nfs.campuscluster.illinois.edu”. Make sure you have the nfs-utils package installed on your machine. Our recommendations for NFS mount parameters are as follows:
[rw/ro],soft,timeo=200,vers=4,rsize=16384,wsize=16384
Samba Access to Research Storage
Investor groups can request that their /projects
area be made available via Samba for mounting on machines local to their lab team.
Project PIs or technical representatives can request Samba access to their /projects
area by submitting a support request with the following information:
Project Area being Exported
Following the request, the Research Storage team will export that area of the filesystem to users in that project’s group as seen in the User Portal. To add/remove users who can mount the project area, add/remove them from the group in the portal.
Once exported the round robin DNS entry for the SAMBA node pool is “samba.campuscluster.illinois.edu”. There are a guide to mounting a SAMBA share on Windows machines and a guide for mounting on Mac OS based machines available for reference. Make sure the machine is connected to the campus network and has a campus public IP address or an internally routed private IP address.
For Windows the path to the share should look like:
\\samba.campuscluster.illinois.edu\$your_project_name
For MacOS the server address to use is:
smb://samba.campuscluster.illinois.edu
For both operating systems, use your campus AD credentials (the same ones you use to access the cluster) to access these shares.
Data Compression and Consolidation
It can often be handy to bundle up a bunch of files into a single file bundle. This can make data transport easier and more efficient. It also helps reduce the space the data takes up in disks, in capacity and inodes.
As noted in the policy section on inode limits, we will discuss how to compress files together into a bundle and then zip them up to save space. To compress files with tar + gzip, see the example below where the images folder is run through tar + gz to create images_bundle.tar.gz:
## Just for illustration, this folder has 4,896 image files in it
[testuser1@cc-login hubble]~ ls images/ | wc -l
4896
## tar and compress the folder, example:
[testuser1@cc-login hubble]~ tar -zcvf images_bundle.tar.gz images
## There should now be a single archive file that contains all the images
[testuser1@cc-login hubble]~ ls
images images_bundle.tar.gz
## You can now remove the original folder as all its contents are in the tar.gz file
[testuser1@cc-login hubble]~ rm -rf images
CLI Transfer Method: rsync
Use rsync
for small to modest transfers to avoid impacting usability of the login node. Refer to Transferring Files - rsync for instructions on how to use rsync
.
CLI Transfer Method: scp
Use scp
for small to modest transfers to avoid impacting usability of the login node. Refer to Transferring Files - scp for instructions on how to use scp
.
CLI Transfer Method: sftp
Refer to Transferring Files - sftp for instructions on how to use WinSCP and Cyberduck.
CLI Transfer Method: bbcp
Transferring data via bbcp
requires the tool to be installed on both sides of the transfer.
It is installed on the cc-xfer.campuscluster.illinois.edu
endpoint.
You can download bbcp on your local machine from SLAC.
The example shows user, “testuser1”, transferring their “images” directory, using bbcp
, to a project directory on the cluster.
Note, for allocations that started on/after September 2023, $teams_directory
will follow the syntax illinois/$college/$department/$pi_netid
.
## Users wants to transfer the "images" directory
[testuser1@users-machine hubble]~ ls
images
## Transfer using bbcp to a project directory
[testuser1@users-machine hubble]~ bbcp -r -w 4m images [email protected]:/projects/$teams_directory/
CLI Transfer Method: rclone
The Rclone file transfer utility is installed on the cc-xfer.campuscluster.illinois.edu
endpoint.
Configuring Rclone for data transfer can be done for a variety of storage backends.
If you are targeting transferring data to/from Research Storage from/to a local machine, it is best to set up the Research Storage endpoint using the SFTP connector on your local machine.
If you are targeting the transfer of data to/from Research Storage from/to another cloud storage service (such as Amazon S3, Box, Dropbox, Google Drive, or OneDrive), it is best to set up that endpoint as a source/target on the cc-xfer.campuscluster.illinois.edu
endpoint itself.
For instructions on configuring Rclone to send/receive data from your desired location, refer to the Rclone documentation.
Optimizing Data and I/O Access
Understanding Filesystem Block Size
The Research Storage subsystem is currently based on IBM’s Spectrum Scale Filesystem (formerly known as GPFS) v5.1. This filesystem, like others, is formatted at a given block size, however it offers the advanced feature of sub-block allocation. A filesystem’s block size is the name for the smallest unit of allocation on the filesystem.
For example, on a normal filesystem with a block size of 512KB, files smaller than that size (let’s say 200KB) will sit in a single block, using only part of its available capacity. The remaining 312KB in that block will be unusable because that block can only be mapped for a single file and the first file is already associated with that block; this leaves that block’s use efficiency less 50%. That wasted space is counted against the user’s quota as no other user can leverage that space. Setting a smaller block size generally improves the efficiency of the filesystem and wastes less space. However, it comes at the cost of performance. The smaller the block size, the slower the filesystem will perform on high-bandwidth I/O applications.
The ability to have sub-block allocation on the filesystem allows the filesystem to be formatted at a given block size but have the minimum allocatable block size be much smaller. In this way, you can get performance benefits of large block sizes, with much less efficiency loss on small files.
The Research Storage subsystem is formatted at a 16MB block size, which allows for very high system throughput (which many user applications demand). However, that block can be subdivided into 1,024 sub-blocks, so the minimum allocatable block size is 16KB. For example, a 12KB file will render ~4KB of space unusable.
Impact of I/O Size on Throughput performance
The filesystem block size impacts the I/O performance that applications receive when running compute jobs on the HPC and HTC systems. When applications are doing I/O on the filesystem, the size of their I/O requests are a key trait to determining how much performance they will receive. When possible, you should configure your workflows to use larger files.
For applications that can’t escape using tiny files, sometimes the use of HDF files can help improve application performance by creating a virtual filesystem within a file that contains all the data.