Path: blob/master/part-2/workflows/snakemake-ht.md
696 views
---
---
Running Snakemake workflows at scale on Puhti
This tutorial is done on Puhti, which requires that:
You have a user account at CSC.
Your account belongs to a project that has access to the Puhti service.
💬 Snakemake is a popular scientific workflow manager, especially within the bioinformatics community. The workflow manager enables scalable and reproducible scientific pipelines by chaining a series of rules in a fully-specified software environment. Snakemake is available as a pre-installed module on Puhti.
Use containers as runtime environment
💬 HPC-friendly containers like Singularity/Apptainer can be used as an alternative to native or Tykky-based installations for better portability and reproducibility. If you don't have a ready-made container image for your needs, you can build a Singularity/Apptainer image on Puhti using --fakeroot
option.
☝🏻 For the purpose of this tutorial, a pre-built container image is provided later to run Snakemake workflows at scale.
Use HyperQueue executor to submit jobs
‼️ If a workflow manager is using sbatch
(or srun
) for each process execution (i.e., a rule in Snakemake terminology), and the workflow has many short processes, it's advisable to use HyperQueue executor to improve throughput and decrease load on the Slurm batch job scheduler.
HyperQueue and Snakemake modules on Puhti can be loaded as below:
‼️ Note! In case you are planning to use Snakemake on LUMI supercomputer, you can use CSC module installations as below:
HyperQueue executor settings for a Snakemake workflow can be changed depending on the version of Snakemake as shown below:
Submit a Snakemake workflow on Puhti
Create and enter a suitable scratch directory on Puhti (replace
<project>
with your CSC project, e.g.project_2001234
):Download the tutorial material, which has been adapted from the official Snakemake documentation, from Allas:
The downloaded material includes scripts and data to run a Snakemake pipeline. You can use
snakemake_hq_puhti.sh
, the contents of which are posted below:
How to parallelize Snakemake workflow jobs?
☝🏻 The default script provided above is not optimized for throughput, as the Snakemake workflow manager just submits one job at a time to the HyperQueue meta-scheduler.
You can run multiple workflow tasks (i.e., rules) concurrently by submitting more jobs using the
snakemake
command as:Replace the above modification in the
snakemake_hq_puhti.sh
batch script (and use your own project number) before submitting the Snakemake workflow job with:
☝🏻 Note that just increasing the value of --jobs
will not automatically make all those jobs run at the same time. This option of the snakemake
command is just a maximum limit for the number of concurrent jobs. Jobs will eventually run when resources are available. In this case, we run 8 concurrent jobs, each using 5 CPU cores to match the reserved 40 CPU cores (one Puhti node) in the batch script. In practice, it is also a good idea to dedicate a few cores for the workflow manager itself.
💡 It is also possible to use more than one node to achieve even higher throughput as HyperQueue can make use of multi-node resource allocations. Just remember that with HyperQueue the workflow tasks themselves should be sub-node (use one node at most) as MPI tasks are poorly supported.
Follow the progress of jobs
💡 You can already check the progress of your workflow by simply observing the current working directory where lots of new task-specific folders are being created. However, there are also formal ways to check the progress of your jobs as shown below.
Monitor the status of submitted Slurm job:
Monitor the progress of the individual sub-tasks using HyperQueue commands:
How to clean task-specific folders automatically?
💭 HyperQueue creates task-specific folders (job-<n>
) in the same directory from where you submitted the batch script. These are sometimes useful for debugging. However, if your code is working fine, the creation of many folders may be annoying besides causing some load on the Lustre parallel file system. You can prevent the creation of such task-specific folders by setting stdout
and stderr
HyperQueue flags to none
as shown below: