Path: blob/master/part-1/batch-resources/tutorial-sacct-seff.md
1206 views
------Using sacct and seff to understand resource usage of finished jobs
💬 In this tutorial we look at the seff and sacct commands. The tutorial should be done on Puhti.
💭 seff shows detailed data on used resources in an easy-to-read format, but can only show one job at a time.
💭 sacct is useful when you want to look at a listing of jobs, but by default it only shows minimal data.
Get details about batch jobs
Try
sacctwhich by default shows the jobs you have run on the current date (i.e. since last midnight):Try specifying the start time of the listing using the
-Soption. Don't query too long time intervals, since this causes significant load on the system (max. queryable interval is three months).Look for a specific job – i.e. specify the job ID using the
-joption (if you can't think of one, you can use29712904):To print out all the available data for a job, try:
Select only the interesting data using the
-ooption. For example, to see job name, job ID, used memory, job state and elapsed wall-clock time, try:Check out the list of all available data fields with:
‼️ Note, running sacct is heavy on the batch queue system.
You should not, for example, write scripts that run it repeatedly.
Running a test job
💬 Run a simple array job to practice using seff and sacct.
☝🏻 If you have limited time, you can skip to Examining the finished job and use the job ID 29925966 (it is the same job).
Create a file named
array.shand paste the following contents in it.Replace
<project>with your actual project name, e.g.project_2001234Submit the job with the command:
You will see a message like:
Make note of the Slurm job ID.
Follow the progress of the job with the command:
💭 How is an array job listed in the queue?
Examining the finished job
When the job has finished (you can no longer see any of the sub jobs with
squeue), you can usesacctto study it:Get a cleaner view by omitting the job steps:
💬
sacctis especially handy here, because it is easy to spot the failed sub jobs.Which sub jobs failed?
Can you figure out why they failed?
How do they compare to jobs that finished?
Use
seffto look at individual sub jobs, e.g.:Try
sacctwith the-ooption (discussed above). This time add the fieldsreqmem(requested memory) andtimelimit(requested time):
💭 Note that in this case we can not use the -X option as we want to see the memory usage for each step.
Adjusting the job-file
Look at the error messages produced by the failed jobs.
When you know which sub jobs failed and why, adjust the resource requests as necessary.
☝🏻 If you have limited time, you can skip to step 4 and use the job ID
29926087(it is the same job with adjusted resource requests).Change time and memory reservations:
Re-run the failed sub jobs:
Use
seffandsacctto look at the jobs. How much memory and time did they use?
More information
💡 You can read more about array jobs and seff and sacct in Docs CSC.