Path: blob/master/_slides/06_understanding_usage.md
1196 views
------Understanding resource usage {.title}
This lecture helps you to optimize your resource usage in CSC's HPC environment.
Optimal usage on multi-user computing platforms
The computing resources are shared among hundreds of your colleagues, who all have different resource needs
Resources allocated to your job are not available for others to use
Important to request only the resources you need and ensure that the resources are used efficiently
Even if you can use more resources, should you?
One resource type will be a bottleneck
Slurm accounting: batch job resource usage 1/2
Slurm accounting: batch job resource usage 2/2
Not all usage is captured by Slurm accounting
If CPU efficiency seems too low, look at the completion time
Some applications also print timing data in log files
Jobs launched without
srundon't record properly (e.g.orterun)
More detailed queries can be tailored with
sacctsacct -j <slurm jobid> -o jobid,partition,state,elapsed,start,endsacct -S 2025-01-01will show all jobs started after that dateNote! Querying data from the Slurm accounting database with
sacctcan be a very heavy operationDon't query long time intervals or run
sacctin a loop/usingwatchas this will degrade the performance of the system for all users
Billing Units
CPU time and storage space consume CPU and Storage Billing Units, respectively
BUs are a property of computing projects, not users
Monitor the BU usage of your project(s) from the command-line with
csc-projectsFor help/options, try
csc-projects -h
Batch job billing scheme:
Amount of resources allocated: All requested resources are billed, i.e. number of cores, amount of memory, NVMe, ...
Time allocated: Resources are billed based on the actual (wall) time a job has used, not the reserved maximum time
Applying for Billing Units
Billing Units can be applied via the Projects page in MyCSC
Please acknowledge using CSC resources in your publications
Please also inform us about your work by adding your publications to the resource application!
Academic usage is one of the free-of-charge use cases
You can estimate usage with the online billing calculator
The calculator can also be used to estimate the value of the resources
For companies interested in using CSC's HPC services, please see our services for commercial use
LUMI has a substantial amount of affordable computing resources (especially GPUs) available for industrial use!
BUs are also a metric for comparing usage efficiency
Different resources have different rates
1 CPU core hour on Puhti equals 1 CPU BU
1 GPU card hour on Puhti equals 60 GPU BU (+ 1 GPU BU per allocated CPU core)
1 CPU node hour on Mahti equals 100 CPU BU
1 GiB hour of Memory on Puhti equals 0.1 CPU BU
1st TiB of disk quota (
/scratch,/projappl) is free-of-charge (0 Storage BU)Applied excess quota is billed by 5 Storage BU/TiBh. (5 Storage Billing Units per TiB per hour)
1 used TiB hour in Allas equals 1 Storage BU (i.e. 1 TiB of data consumes 8760 Storage BU per year)
This and other service billing information in Docs CSC
For LUMI billing policy, see the LUMI documentation
Before starting large-scale calculations
Check how the software and your actual input performs
Common job errors are caused by typos in batch/input scripts
Use short runs in the queue
--partition=testto check that the input works and that the resource requests are interpreted correctlyCheck the output of the
seffcommand to ensure that CPU and memory efficiencies are as high as possibleIt's OK if a job is (occasionally) killed due to insufficient resource requests: just adjust and rerun/restart
It's much worse to always run with excessively large requests "just in case"
Parallelizing your workflow
There are multiple ways to parallelize your workflow
Maybe several smaller jobs are better than one large (task farming)?
Is there a more efficient code or algorithm?
Is the file I/O slowing you down (lots of read/write operations)?
Optimize usage considering single job wall-time, overall used CPU time, I/O
Reserving and optimizing batch job resources
Important resource requests that should be monitored with
seffare:Scaling of a job over several cores and nodes
Parallel jobs must always benefit from all requested resources
When you double the number of cores, the job should run at least 1.5x faster
seff examples
Left: GPU usage ok! (for this example other metrics also ok)
Bottom: CPU usage way too low, memory usage too high, job killed
{width=90%}