Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
csc-training
GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/08_datamigration_io.md
696 views
---
theme: csc-eurocc-2019 lang: en
---

Working efficiently with data {.title}

![](https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png)
All materials (c) 2020-2025 by CSC – IT Center for Science Ltd. This work is licensed under a **Creative Commons Attribution-ShareAlike** 4.0 Unported License, [http://creativecommons.org/licenses/by-sa/4.0/](http://creativecommons.org/licenses/by-sa/4.0/)

Outline

  • Efficient file I/O in HPC systems

  • Using Allas in batch scripts

  • Moving data to/from Allas, IDA and LUMI-O

  • Transferring data in sensitive data computing

  • Cleaning and backing up data

Parallel file systems

  • A parallel file system (PFS) provides a common file system area that can be accessed from all nodes in a cluster

  • Without PFS users would have to always copy all needed data to compute nodes before runs (cf. local disk)

    • Also the results would not be visible outside the compute node

  • CSC uses Lustre parallel file system Puhti and Mahti

Lustre

![](img/lustre1.svg){width=100%}
- One or more metadata servers (MDS) with metadata targets (MDT) that store the file system metadata - One or more object storage servers (OSS) with object storage targets (OST) that store the actual file system contents - Connection to nodes via the high-speed interconnect (InfiniBand)

What happens when you access a file?

![](img/lustre2.svg){width=100%}
1. Send metadata request 2. Response with metadata 3. Request data 4. Data response

Managing file I/O (1/3)

  • Parallel file system (Lustre):

    • Shared across all nodes in the cluster (e.g. /scratch)

    • Optimized for parallel I/O of large files, slow if accessing lots of small files!

  • Temporary local storage (NVMe):

    • Accessible on login nodes ($TMPDIR) and to jobs on some compute nodes ($LOCAL_SCRATCH)

    • Automatically purged after the job finishes

    • Availability varies slightly depending on the supercomputer (Puhti/Mahti/LUMI)

Managing file I/O (2/3)

  • To avoid on Lustre:

    • Accessing lots of small files, opening/closing a single file in a rapid pace

    • Having many files in a single directory

  • Use file striping to distribute large files across many OSTs

  • Use more efficient file formats when possible

    • Simply using tar and compression is a good start

    • High-level I/O libraries and portable file formats like HDF5 or NetCDF

  • Docs CSC: How to achieve better I/O performance on Lustre

Managing file I/O (3/3)

  • Use fast local disk to handle file I/O with lots of small files

    • Requires staging and unstaging of data

    • tar xf /scratch/<project>/big_dataset.tar.gz -C $LOCAL_SCRATCH

  • Processing data in memory allows better performance compared to writing to and reading from the disk

    • "Ramdisk" (/dev/shm) can be used on Mahti nodes without NVMe

    • export TMPDIR=/dev/shm

  • Do not use databases on /scratch

    • Instead, consider hosting DBs on cloud resources (e.g. Pukki DBaaS)

Using Allas in batch jobs

  • Command-line interface: use either Swift or S3 protocol

    • Swift (multiple projects, 8-hour) vs. S3 protocol (fixed for a project, persistent)

  • allas-conf needs setting up CSC password interactively

    • Jobs may start late and actual job may take longer than 8 hrs

  • Use allas-conf -k

    • stores password in variable $OS_PASSWORD to generate a new token automatically

      • a-tools regenerate a token using $OS_PASSWORD automatically

      • rclone requires explicitly setting environment variable in batch jobs:

      source /appl/opt/csc-cli-utils/allas-cli-utils/allas_conf -f -k $OS_PROJECT_NAME

Configuring Allas for S3 protocol

  • Opening Allas connection in s3mode

    • source allas_conf --mode s3cmd

  • Connection is persistent

  • Usage:

    • s3cmd with endpoint s3:

    • rclone with endpoint s3allas:

    • a-put/a-get with -S flag

How to use LUMI-O from Puhti/Mahti?

  • LUMI-O is very similar to Allas, but it uses only S3 protocol

  • In Puhti and Mahti, connection to LUMI-O can be opened with command:

    • allas-conf --lumi

  • Usage:

    • Using LUMI-O with rclone (endpoint is lumi-o:)

      • e.g. rclone lsd lumi-o:

    • One can use a-tools with option --lumi

      • e.g. a-list --lumi

  • Docs CSC: Using Allas and LUMI-O from LUMI

Moving data between LUMI-O and Allas

  • Requires activating connections to both LUMI-O and Allas at the same time:

    • allas-conf --mode s3cmd

    • allas-conf --lumi

  • Use rclone with s3allas: as endpoint for Allas and lumi-o: for LUMI-O

    • rclone copy -P lumi-o:lumi-bucket/object s3allas:allas-bucket/

Moving data between Fairdata IDA and Allas

  • Needs transfer of data via supercomputer (e.g. Puhti)

  • Requires configuring Fairdata IDA in CSC supercomputers

    • Load IDA module: module load ida

    • Configure IDA database: ida_configure

    • Upload data to IDA: ida upload <target_in_ida> <local_file>

    • Download data from IDA: ida download <target_in_ida> <local_file>

Transferring data for sensitive data computing

  • CSC sensitive data services: SD Connect and SD Desktop, use service-specific encryption

  • SD Desktop is able to read encrypted data from Allas

    • If you want to make your data available in SD Desktop, you need to use SD Connect to upload data to Allas

    • Open SD Connect compatible connection to Allas with allas-conf --sdc

    • Data can then be uploaded by using command a-put with option --sdc

    • More information in Docs CSC

Cleaning and backing up data (1/3)

  • Disk cleaning

    • In force for project disk areas under /scratch on Puhti

    • Files older than 180 days will be removed periodically

      • Listed in a purge list, e.g. /scratch/purge_lists/project_2001234/path_summary.txt

      • LCleaner tool can help you discover which of your files have been targeted for automatic removal

  • Best practice tips

    • Don't save everything automatically

    • Use LUE tool to analyze your disk usage

      • Avoid du and find -size, these commands are heavy on the file system

    • Move important data not in current use to Allas

Cleaning and backing up data (2/3)

  • allas-backup command provides an easy-to-use command-line interface for the restic backup tool

  • Backing up differs from normal storing:

    • Incremental (efficient) and version control (no overriding)

    • Based on hashes and requires more computing

    • Efficient way to store different versions of a dataset

  • New restic-based "data mover" tool coming soon!

Cleaning and backing up data (3/3)

  • Please note that Allas is intended for storing active data

  • Project lifetime is usually 1-5 years

  • Commands for backing up data:

    • allas-backup --help

    • allas-backup [add] file-or-directory

    • allas-backup list

    • allas-backup restore snapshot-id