Path: blob/master/_slides/06_understanding_usage.md
696 views
---
---
Understanding resource usage {.title}
This lecture helps you to optimize your resource usage in CSC's HPC environment.
Optimal usage on multi-user computing platforms
The computing resources are shared among hundreds of your colleagues, who all have different resource needs
Resources allocated to your job are not available for others to use
Important to request only the resources you need and ensure that the resources are used efficiently
Even if you can use more resources, should you?
One resource type will be a bottleneck
Slurm accounting: batch job resource usage 1/2
Slurm accounting: batch job resource usage 2/2
Not all usage is captured by Slurm accounting
If CPU efficiency seems too low, look at the completion time
Some applications also print timing data in log files
Jobs launched without
srun
don't record properly (e.g.orterun
)
More detailed queries can be tailored with
sacct
sacct -j <slurm jobid> -o jobid,partition,state,elapsed,start,end
sacct -S 2025-01-01
will show all jobs started after that dateNote! Querying data from the Slurm accounting database with
sacct
can be a very heavy operationDon't query long time intervals or run
sacct
in a loop/usingwatch
as this will degrade the performance of the system for all users
Billing units
CPU time and storage space consume "Billing units" (BU)
BUs are a property of computing projects, not users
Monitor the BU usage of your project(s) from the command-line with
csc-projects
For help/options, try
csc-projects -h
Batch job billing scheme:
Amount of resources allocated: All requested resources are billed, i.e. number of cores, amount of memory, NVMe, ...
Time allocated: Resources are billed based on the actual (wall) time a job has used, not the reserved maximum time
Applying for Billing units
Billing units can be applied via the Projects page in MyCSC
Please acknowledge using CSC resources in your publications
Please also inform us about your work by adding your publications to the resource application!
Academic usage is one of the free-of-charge use cases
You can estimate usage with the online billing calculator
The calculator can also be used to estimate the value of the resources
For companies interested in using CSC's HPC services, please see our services for commercial use
LUMI has a substantial amount of affordable computing resources (especially GPUs) available for industrial use!
BUs are also a metric for comparing usage efficiency
Different resources have different rates
1 CPU core hour on Puhti equals 1 BU
1 GPU card hour on Puhti equals 60 BU (+ allocated CPU cores)
1 node hour on Mahti equals 100 BU
1 GiB hour of Memory on Puhti equals 0.1 BU
1st TiB of disk quota (
/scratch
,/projappl
) is free-of-charge (0 BU)Applied excess quota is billed by 5 BU/TiBh. (5 billing units per TiB per hour)
1 used TiB hour in Allas equals 1 BU (i.e. 1 TiB of data consumes 8760 BU per year)
For LUMI billing policy, see the LUMI documentation
Before starting large-scale calculations
Check how the software and your actual input performs
Common job errors are caused by typos in batch/input scripts
Use short runs in the queue
--partition=test
to check that the input works and that the resource requests are interpreted correctlyCheck the output of the
seff
command to ensure that CPU and memory efficiencies are as high as possibleIt's OK if a job is (occasionally) killed due to insufficient resource requests: just adjust and rerun/restart
It's much worse to always run with excessively large requests "just in case"
Parallelizing your workflow
There are multiple ways to parallelize your workflow
Maybe several smaller jobs are better than one large (task farming)?
Is there a more efficient code or algorithm?
Is the file I/O slowing you down (lots of read/write operations)?
Optimize usage considering single job wall-time, overall used CPU time, I/O
Reserving and optimizing batch job resources
Important resource requests that should be monitored with
seff
are:Scaling of a job over several cores and nodes
Parallel jobs must always benefit from all requested resources
When you double the number of cores, the job should run at least 1.5x faster
seff
examples
Left: GPU usage ok! (for this example other metrics also ok)
Bottom: CPU usage way too low, memory usage too high, job killed
{width=90%}