Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
csc-training
GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/11_speed_up_jobs.md
696 views
---
theme: csc-eurocc-2019 lang: en
---

How to speed up jobs {.title}

![](https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png)
All materials (c) 2020-2025 by CSC – IT Center for Science Ltd. This work is licensed under a **Creative Commons Attribution-ShareAlike** 4.0 Unported License, [http://creativecommons.org/licenses/by-sa/4.0/](http://creativecommons.org/licenses/by-sa/4.0/)

Outline

  • How to choose your software

  • Speeding up jobs through parallel programming

  • Benefits and pitfalls of workflows and high-throughput computing

The purpose of large computers

  • In principle, supercomputers are not much faster than laptops -- they are simply bigger

    • For fast computation, they utilize parallelism (and typically have special disk, memory and network solutions, too)

  • Parallelism simplified:

    • You split a problem into smaller subtasks and solve them all simultaneously using numerous cores

First steps for fast jobs (1/2)

  • Spend a little time to investigate:

    • Which of the available software would be the best to solve the kind of problem you have?

  • Consider:

    • The software that solves your problem the fastest might not always be the best!

      • Issues like ease-of-use and memory/disk demands are also highly relevant

    • Start simple and gradually use more complex approaches if needed

First steps for fast jobs (2/2)

Optimize the performance of your code (1/2)

  • When compiling a program, possibly one that you've written yourself, remember to use optimizing compiler flags

  • Construct a small and quick test case and run it in the test queue

    • Docs CSC: Available partitions on Puhti, Mahti and LUMI

    • Use the test case to optimize computations before starting massive ones

Optimize the performance of your code (2/2)

HPC parallel jobs

  • It is not just how your software is constructed and compiled that affects performance -- how it is parallelized and run are important factors, too!

  • Recall: a parallel job distributes the calculation over several cores in order to achieve a shorter wall-clock time

Parallel programming models

  • Parallel execution is based on threads or processes (or both) which run at the same time on different CPU cores

  • Processes

    • Interaction is based on exchanging messages between processes

      • Can form a performance bottleneck!

    • MPI (Message Passing Interface)

  • Threads

    • Shared memory, i.e. each thread can directly access data of other threads

      • Easier to program, but problems may arise with race conditions, i.e. when different threads update same data without proper synchronization

    • OpenMP (Open Multi-Processing)

MPI and OpenMP standards

{.center width=80%}

**MPI: Processes**
  • MPI launches N processes at application startup

  • Independent execution units

  • Have their own memory

  • Works over multiple nodes

OpenMP: Threads

  • Threads are created and destroyed (parallel regions)

  • Threads share memory

  • Limited to a single node

Hybrid programming: Launch threads (OpenMP)
within processes (MPI)

  • Shared memory programming inside a node, message passing between nodes → matches well modern supercomputer hardware

  • Pros and cons:

    • Fewer MPI processes for a given amount of cores

      • All-to-all communication bottlenecks alleviated

      • Decreased memory consumption if implementation uses replicated data

    • Possible to dedicate threads to specific tasks

    • More complicated to program, overhead from thread creation/destruction

  • Optimum MPI task per node ratio depends on the application and should always be experimented!

Parallel scaling

- **Requesting more cores does not automatically equal faster jobs!** - Scalability can be limited by: 1. Load imbalance (variation in workload among cores) 2. Parallel overheads (additional operations which are not present in serial calculation) 3. Synchronization, communication 4. Fraction of serial code limits maximum speedup ([Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law))
![](img/scaling.png){width=100%}

Self-study materials for OpenMP and MPI

Graphics Processing Units (GPUs) can speed up jobs

  • GPUs are massively parallel processors originally developed for computer graphics

  • They can be used for science, but are often challenging to program

    • Not all algorithms can use the full power of GPUs!

  • Check the manual to find out if your software can utilize GPUs

  • Does your code run on AMD GPUs? LUMI has a massive GPU capacity!

GPU programming models

- GPUs are co-processors to the CPU - CPU controls the workflow: 1. *offloads* computations to GPU by launching *kernels* 2. allocates and deallocates memory on GPUs 3. handles data transfers between CPU and GPUs - GPU kernels run multiple threads and multiple kernels may run concurrently on same GPU - When using multiple GPUs, CPU typically runs multiple processes or threads
![](img/gpu-offload.svg){.center width=100%}

Workflows and high-throughput computing

- An **HPC workflow/pipeline** typically consists of: 1. *Pre-processing* of input data 2. *Computation*, often in multiple *steps*, possibly with *dependencies* 3. *Post-processing* of output data - Automation can save human time, but brings additional complexity - You should understand how your tools work - Workflows designed for a laptop/cloud environment must often be modified to run efficiently on HPC - [See general guidelines in Docs CSC](https://docs.csc.fi/computing/running/throughput/)
![](img/dag.png 'Example Snakemake pipeline by CFDE'){width=80%}

Workflows and Slurm

  • Slurm keeps a detailed track of jobs and used resources:

    • Amount and location of CPU cores and GPUs, memory limits, local storage, job lifetime, etc.

    • Information recorded in an internal Slurm job accounting database

  • Large number of short (<30 min) jobs generate lots of accounting data and scheduling uses more resources than the actual jobs!

  • To avoid:

    • Large number of (short) jobs (sbatch commands)

    • Large number of job steps (srun commands)

    • Polling Slurm commands (squeue, sacct, etc. commands)

Workflows and Lustre

  • Workflows quite often involve processing large datasets, which can be problematic for parallel file systems like Lustre at CSC

  • To avoid:

    • Accessing lots of small files, having many files in a single directory

      • Recall the issues caused by Conda!

    • Large files on a single object storage target (i.e. no striping)

    • Opening and closing a file in a rapid pace

      • Includes database operations

    • Using file locks for synchronization

      • Writing to same file from many nodes

How much is too much?

  • It is hard to give exact numbers as Slurm and Lustre are shared resources

    • When total system load is low, it may be OK to run something that is problematic when system is full

  • How many jobs/steps is too many?

    • SHOULD BE OK to run tens of jobs/steps

    • PAY ATTENTION if you run hundreds of jobs/steps

    • DON'T RUN several thousands of jobs/steps

  • How many file operations is too many?

    • SHOULD BE OK to access hundreds of files

    • PAY ATTENTION if you need several thousand files

    • DON'T USE hundreds of thousands of files

It is important to understand your workflow

  • How many tasks?

  • Are there dependencies/conditionals between tasks?

  • Resource requirements and scalability (serial vs. OpenMP vs. MPI)

  • Duration

  • I/O behavior, what kind and how much extra/bookkeeping files are created?

  • Need for error handling and checkpointing?

  • How is your tool polling / using Slurm?

If you have lots of small jobs and/or files (1/2)

  • Task farming -- running many similar independent jobs simultaneously

    • For 100+ jobs, regroup your tasks and execute them in a single job (step)

  • Check the tool you're using, there may be built-in support for running many tasks within a single job step (best option!)

  • External tools: Array jobs*, HyperQueue

    • Array jobs are simply a Slurm feature for submitting many jobs from a single batch script -- don't submit hundreds of short jobs!

    • HyperQueue is a meta-scheduler, which allows you to pack many (non-MPI) jobs within a single job step (recommended!)

If you have lots of small jobs and/or files (2/2)

  • In more complex cases (dependencies, error handling), workflow managers such as Nextflow, Snakemake or FireWorks can be used

    • HyperQueue integration for Nextflow and Snakemake already available!

    • See Docs CSC for more details

  • When working with lots of small files:

Summary (1/2)

  • Different codes may give very different performance for a given use case

  • Before launching massive jobs, start with small/fast tests and look for the most efficient algorithms

    • Start with coarse models and parameter scans and gradually increase precision/resolution (if needed)

    • Brute-force approach vs. enhanced sampling methods or AI/ML models

  • Try to formulate your scientific results when you have a minimum amount of computational results

    • Helps to clarify what you still need to compute, what computations would be redundant and what data you need to store

Summary (2/2)

  • Simply requesting more cores does not automatically equal faster jobs

    • Parallel scaling is usually limited by many factors (e.g. communication, fraction of serial code)

    • Perform a scalability test and also check with seff/sacct if the memory was used efficiently

      • Software log/output files may also contain useful performance information

  • Avoid running lots of short Slurm jobs and mind your I/O

    • Use built-in or external tools to pack many tasks in a single job step

    • Use local disks and containerize Conda environments

  • Use checkpoints/restarts if supported by your software, especially when running long jobs