Path: blob/master/_slides/08_datamigration_io.md
696 views
---
---
Working efficiently with data {.title}
Outline
Efficient file I/O in HPC systems
Using Allas in batch scripts
Moving data to/from Allas, IDA and LUMI-O
Transferring data in sensitive data computing
Cleaning and backing up data
Parallel file systems
A parallel file system (PFS) provides a common file system area that can be accessed from all nodes in a cluster
Without PFS users would have to always copy all needed data to compute nodes before runs (cf. local disk)
Also the results would not be visible outside the compute node
CSC uses Lustre parallel file system Puhti and Mahti
Lustre
What happens when you access a file?
Managing file I/O (1/3)
Parallel file system (Lustre):
Shared across all nodes in the cluster (e.g.
/scratch
)Optimized for parallel I/O of large files, slow if accessing lots of small files!
Temporary local storage (NVMe):
Accessible on login nodes (
$TMPDIR
) and to jobs on some compute nodes ($LOCAL_SCRATCH
)Automatically purged after the job finishes
Availability varies slightly depending on the supercomputer (Puhti/Mahti/LUMI)
Check the availability of local storage in different job partitions
Managing file I/O (2/3)
To avoid on Lustre:
Accessing lots of small files, opening/closing a single file in a rapid pace
Having many files in a single directory
Use file striping to distribute large files across many OSTs
Use more efficient file formats when possible
Simply using
tar
and compression is a good startHigh-level I/O libraries and portable file formats like HDF5 or NetCDF
Enable fast I/O through a single file format and parallel operations
AI/ML example: TensorFlow's TFRecords – a simple record-oriented binary format
Managing file I/O (3/3)
Use fast local disk to handle file I/O with lots of small files
Requires staging and unstaging of data
tar xf /scratch/<project>/big_dataset.tar.gz -C $LOCAL_SCRATCH
Processing data in memory allows better performance compared to writing to and reading from the disk
"Ramdisk" (
/dev/shm
) can be used on Mahti nodes without NVMeexport TMPDIR=/dev/shm
Do not use databases on
/scratch
Instead, consider hosting DBs on cloud resources (e.g. Pukki DBaaS)
Using Allas in batch jobs
Command-line interface: use either Swift or S3 protocol
Swift (multiple projects, 8-hour) vs. S3 protocol (fixed for a project, persistent)
allas-conf
needs setting up CSC password interactivelyJobs may start late and actual job may take longer than 8 hrs
Use
allas-conf -k
stores password in variable
$OS_PASSWORD
to generate a new token automaticallya-tools regenerate a token using
$OS_PASSWORD
automaticallyrclone
requires explicitly setting environment variable in batch jobs:
Configuring Allas for S3 protocol
Opening Allas connection in s3mode
source allas_conf --mode s3cmd
Connection is persistent
Usage:
s3cmd
with endpoints3:
rclone
with endpoints3allas:
a-put
/a-get
with-S
flag
How to use LUMI-O from Puhti/Mahti?
LUMI-O is very similar to Allas, but it uses only S3 protocol
In Puhti and Mahti, connection to LUMI-O can be opened with command:
allas-conf --lumi
Usage:
Using LUMI-O with
rclone
(endpoint islumi-o:
)e.g.
rclone lsd lumi-o:
One can use a-tools with option
--lumi
e.g.
a-list --lumi
Docs CSC: Using Allas and LUMI-O from LUMI
Moving data between LUMI-O and Allas
Requires activating connections to both LUMI-O and Allas at the same time:
allas-conf --mode s3cmd
allas-conf --lumi
Use
rclone
withs3allas:
as endpoint for Allas andlumi-o
: for LUMI-Orclone copy -P lumi-o:lumi-bucket/object s3allas:allas-bucket/
Moving data between Fairdata IDA and Allas
Needs transfer of data via supercomputer (e.g. Puhti)
Requires configuring Fairdata IDA in CSC supercomputers
Load IDA module:
module load ida
Configure IDA database:
ida_configure
Upload data to IDA:
ida upload <target_in_ida> <local_file>
Download data from IDA:
ida download <target_in_ida> <local_file>
Transferring data for sensitive data computing
CSC sensitive data services: SD Connect and SD Desktop, use service-specific encryption
SD Desktop is able to read encrypted data from Allas
If you want to make your data available in SD Desktop, you need to use SD Connect to upload data to Allas
Open SD Connect compatible connection to Allas with
allas-conf --sdc
Data can then be uploaded by using command
a-put
with option--sdc
Cleaning and backing up data (1/3)
In force for project disk areas under
/scratch
on PuhtiFiles older than 180 days will be removed periodically
Listed in a purge list, e.g.
/scratch/purge_lists/project_2001234/path_summary.txt
LCleaner tool can help you discover which of your files have been targeted for automatic removal
Best practice tips
Don't save everything automatically
Use LUE tool to analyze your disk usage
Avoid
du
andfind -size
, these commands are heavy on the file system
Move important data not in current use to Allas
Cleaning and backing up data (2/3)
allas-backup
command provides an easy-to-use command-line interface for therestic
backup toolBacking up differs from normal storing:
Incremental (efficient) and version control (no overriding)
Based on hashes and requires more computing
Efficient way to store different versions of a dataset
New restic-based "data mover" tool coming soon!
Cleaning and backing up data (3/3)
Please note that Allas is intended for storing active data
Project lifetime is usually 1-5 years
Commands for backing up data:
allas-backup --help
allas-backup [add] file-or-directory
allas-backup list
allas-backup restore snapshot-id