GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/08_datamigration_io.md
⁶⁹⁶ views

---

theme: csc-eurocc-2019
lang: en

---

Working efficiently with data {.title}

![](https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png)

All materials (c) 2020-2025 by CSC – IT Center for Science Ltd. This work is licensed under a **Creative Commons Attribution-ShareAlike** 4.0 Unported License, [http://creativecommons.org/licenses/by-sa/4.0/](http://creativecommons.org/licenses/by-sa/4.0/)

Outline

Efficient file I/O in HPC systems
Using Allas in batch scripts
Moving data to/from Allas, IDA and LUMI-O
Transferring data in sensitive data computing
Cleaning and backing up data

Parallel file systems

A parallel file system (PFS) provides a common file system area that can be accessed from all nodes in a cluster
Without PFS users would have to always copy all needed data to compute nodes before runs (cf. local disk)
- Also the results would not be visible outside the compute node
CSC uses Lustre parallel file system Puhti and Mahti

Lustre

![](img/lustre1.svg){width=100%}

- One or more metadata servers (MDS) with metadata targets (MDT) that store the file system metadata - One or more object storage servers (OSS) with object storage targets (OST) that store the actual file system contents - Connection to nodes via the high-speed interconnect (InfiniBand)

What happens when you access a file?

![](img/lustre2.svg){width=100%}

1. Send metadata request 2. Response with metadata 3. Request data 4. Data response

Managing file I/O (1/3)

Parallel file system (Lustre):
- Shared across all nodes in the cluster (e.g. /scratch)
- Optimized for parallel I/O of large files, slow if accessing lots of small files!
Temporary local storage (NVMe):
- Accessible on login nodes ($TMPDIR) and to jobs on some compute nodes ($LOCAL_SCRATCH)
- Automatically purged after the job finishes
- Availability varies slightly depending on the supercomputer (Puhti/Mahti/LUMI)
  - Check the availability of local storage in different job partitions

Managing file I/O (2/3)

To avoid on Lustre:
- Accessing lots of small files, opening/closing a single file in a rapid pace
- Having many files in a single directory
Use file striping to distribute large files across many OSTs
Use more efficient file formats when possible
- Simply using tar and compression is a good start
- High-level I/O libraries and portable file formats like HDF5 or NetCDF
  - Enable fast I/O through a single file format and parallel operations
  - AI/ML example: TensorFlow's TFRecords – a simple record-oriented binary format
Docs CSC: How to achieve better I/O performance on Lustre

Managing file I/O (3/3)

Use fast local disk to handle file I/O with lots of small files
- Requires staging and unstaging of data
- tar xf /scratch/<project>/big_dataset.tar.gz -C $LOCAL_SCRATCH
Processing data in memory allows better performance compared to writing to and reading from the disk
- "Ramdisk" (/dev/shm) can be used on Mahti nodes without NVMe
- export TMPDIR=/dev/shm
Do not use databases on /scratch
- Instead, consider hosting DBs on cloud resources (e.g. Pukki DBaaS)

Using Allas in batch jobs

Command-line interface: use either Swift or S3 protocol
- Swift (multiple projects, 8-hour) vs. S3 protocol (fixed for a project, persistent)
allas-conf needs setting up CSC password interactively
- Jobs may start late and actual job may take longer than 8 hrs
Use allas-conf -k
- stores password in variable $OS_PASSWORD to generate a new token automatically
  - a-tools regenerate a token using $OS_PASSWORD automatically
  - rclone requires explicitly setting environment variable in batch jobs:
```
source /appl/opt/csc-cli-utils/allas-cli-utils/allas_conf -f -k $OS_PROJECT_NAME
```

Configuring Allas for S3 protocol

Opening Allas connection in s3mode
- source allas_conf --mode s3cmd
Connection is persistent
Usage:
- s3cmd with endpoint s3:
- rclone with endpoint s3allas:
- a-put/a-get with -S flag

How to use LUMI-O from Puhti/Mahti?

LUMI-O is very similar to Allas, but it uses only S3 protocol
In Puhti and Mahti, connection to LUMI-O can be opened with command:
- allas-conf --lumi
Usage:
- Using LUMI-O with rclone (endpoint is lumi-o:)
  - e.g. rclone lsd lumi-o:
- One can use a-tools with option --lumi
  - e.g. a-list --lumi
Docs CSC: Using Allas and LUMI-O from LUMI

Moving data between LUMI-O and Allas

Requires activating connections to both LUMI-O and Allas at the same time:
- allas-conf --mode s3cmd
- allas-conf --lumi
Use rclone with s3allas: as endpoint for Allas and lumi-o: for LUMI-O
- rclone copy -P lumi-o:lumi-bucket/object s3allas:allas-bucket/

Moving data between Fairdata IDA and Allas

Needs transfer of data via supercomputer (e.g. Puhti)
Requires configuring Fairdata IDA in CSC supercomputers
- Load IDA module: module load ida
- Configure IDA database: ida_configure
- Upload data to IDA: ida upload <target_in_ida> <local_file>
- Download data from IDA: ida download <target_in_ida> <local_file>

Transferring data for sensitive data computing

CSC sensitive data services: SD Connect and SD Desktop, use service-specific encryption
SD Desktop is able to read encrypted data from Allas
- If you want to make your data available in SD Desktop, you need to use SD Connect to upload data to Allas
- Open SD Connect compatible connection to Allas with allas-conf --sdc
- Data can then be uploaded by using command a-put with option --sdc
- More information in Docs CSC

Cleaning and backing up data (1/3)

Disk cleaning
- In force for project disk areas under /scratch on Puhti
- Files older than 180 days will be removed periodically
  - Listed in a purge list, e.g. /scratch/purge_lists/project_2001234/path_summary.txt
  - LCleaner tool can help you discover which of your files have been targeted for automatic removal
Best practice tips
- Don't save everything automatically
- Use LUE tool to analyze your disk usage
  - Avoid du and find -size, these commands are heavy on the file system
- Move important data not in current use to Allas

Cleaning and backing up data (2/3)

allas-backup command provides an easy-to-use command-line interface for the restic backup tool
Backing up differs from normal storing:
- Incremental (efficient) and version control (no overriding)
- Based on hashes and requires more computing
- Efficient way to store different versions of a dataset
New restic-based "data mover" tool coming soon!

Cleaning and backing up data (3/3)

Please note that Allas is intended for storing active data
Project lifetime is usually 1-5 years
Commands for backing up data:
- allas-backup --help
- allas-backup [add] file-or-directory
- allas-backup list
- allas-backup restore snapshot-id