CoCalc -- 05_BatchJobs_SRT_English

GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/SRTFiles/05_BatchJobs_SRT_English_mac.srt
⁶⁹⁶ views

1
00:00:23,000 --> 00:00:25,566
On a laptop when you run something,

2
00:00:25,566 --> 00:00:29,149
you double click an icon and the program starts to run.

3
00:00:29,633 --> 00:00:33,316
Modern computers can run multiple tasks at the same time,

4
00:00:33,316 --> 00:00:38,566
but with too many simultaneous tasks you start to run out of memory or CPU power,

5
00:00:38,566 --> 00:00:41,083
which slows down the computer.

6
00:00:41,433 --> 00:00:45,600
In the HPC environment really many people are using that same computer

7
00:00:45,600 --> 00:00:48,616
and they all need different amounts of resources.

8
00:00:48,933 --> 00:00:54,333
It is not possible to let everyone start their programs and run them in real time.

9
00:00:54,899 --> 00:00:58,816
A batch job tells the batch job scheduler the resource requirements

10
00:00:58,816 --> 00:01:02,899
that how much resources should be available for that particular job.

11
00:01:06,466 --> 00:01:09,849
In addition to the resource request a batch job includes

12
00:01:09,849 --> 00:01:12,650
some script that does the actual computing.

13
00:01:13,099 --> 00:01:17,500
To ensure that there's sufficient computing power available for all users,

14
00:01:17,500 --> 00:01:19,883
the batch jobs are sent to a queue.

15
00:01:20,366 --> 00:01:24,866
Depending on what kind of resources you requested and on the load on the system,

16
00:01:24,866 --> 00:01:28,750
the job may need to queue for some time before it starts.

17
00:01:28,950 --> 00:01:33,233
On HPC systems all heavy computing needs to be done by a batch job

18
00:01:33,233 --> 00:01:36,633
so that they get executed on the compute nodes.

19
00:01:39,733 --> 00:01:43,383
The usage policy in docs.csc.fi says that the login nodes are

20
00:01:43,383 --> 00:01:46,200
not meant for long or heavy processes.

21
00:01:46,633 --> 00:01:51,166
Instead the login nodes are used for compiling, managing batch jobs,

22
00:01:51,166 --> 00:01:55,566
moving data, light pre- and postprocessing.

23
00:01:59,716 --> 00:02:04,500
A batch job system handles the batch jobs submitted to the supercomputer.

24
00:02:05,150 --> 00:02:09,800
It keeps track on what resources exist, which requests have been made,

25
00:02:09,800 --> 00:02:14,533
to which jobs allocate resources and how to run those jobs with the given resources.

26
00:02:15,233 --> 00:02:18,816
The aim is to use the resources as efficiently as possible,

27
00:02:18,816 --> 00:02:21,416
but also share them in a fair way.

28
00:02:22,099 --> 00:02:24,783
For example when a job needs a lot of memory,

29
00:02:24,783 --> 00:02:27,966
then it is allocated to a node where this memory is available.

30
00:02:28,116 --> 00:02:32,099
Then if an other job does not need that much memory, it gets allocated so

31
00:02:32,099 --> 00:02:36,783
that the more demanding jobs can run where they have enough resources.

32
00:02:42,750 --> 00:02:47,000
The fair way of allocating resources is not the first come first served,

33
00:02:47,000 --> 00:02:52,050
but instead somehow that everyone gets resources, at least at some point.

34
00:02:52,533 --> 00:02:57,433
Obviously a job cannot start before the requested resources are available.

35
00:02:57,866 --> 00:03:02,916
Each job has a "priority" which is used to determine the order of starting jobs.

36
00:03:03,466 --> 00:03:08,599
The "fair share" configuration has the following rules of defining the priority.

37
00:03:09,383 --> 00:03:13,400
When you submit a job it gets an initial priority.

38
00:03:13,566 --> 00:03:18,383
This initial priority decreases if you have recently run a lot of jobs.

39
00:03:18,900 --> 00:03:23,300
That makes it possible to first run some small test and get the results fast

40
00:03:23,300 --> 00:03:26,699
before submitting a large set of calculations.

41
00:03:27,150 --> 00:03:31,449
Once a job is in queue its priority increases over time.

42
00:03:31,949 --> 00:03:36,566
At some point it will have high enough priority and it will be its turn next.

43
00:03:37,150 --> 00:03:41,199
The priority depends also on which queue the job is submitted.

44
00:03:41,366 --> 00:03:43,916
For example longrun has lower priority

45
00:03:43,916 --> 00:03:46,883
to discourage people to run in this unless necessary.

46
00:03:47,150 --> 00:03:51,033
If you do not want to queue so much for longrun resources,

47
00:03:51,033 --> 00:03:56,183
then consider refining your job so that you can use the standard three day queues.

48
00:03:56,666 --> 00:04:02,583
The documentation at docs.csc.fi has a guide for getting started with running jobs.

49
00:04:08,716 --> 00:04:12,733
Here is an illustration on what the batch job scheduler is doing.

50
00:04:13,300 --> 00:04:17,333
The batch jobs are pictured here as two dimensional rectangles.

51
00:04:18,100 --> 00:04:21,583
The horisontal dimension represents the number of CPUs

52
00:04:21,583 --> 00:04:24,149
and the vertical dimension represents time.

53
00:04:24,816 --> 00:04:28,399
For example this tall but thin rectangle would correspond

54
00:04:28,399 --> 00:04:31,666
to a fairly long job which is using just one core.

55
00:04:32,083 --> 00:04:36,733
Then this short but wide rectangle would be a shorter job using many cores.

56
00:04:37,466 --> 00:04:40,750
The batch job system sees these kind of job requests and

57
00:04:40,750 --> 00:04:44,133
then it knows about the resources that the compute nodes have.

58
00:04:44,866 --> 00:04:48,983
The aim of the batch job scheduler is to keep all the supercomputer resources

59
00:04:48,983 --> 00:04:51,133
busy with computing all of the time.

60
00:04:51,283 --> 00:04:54,199
That happens by allocating the jobs to the cores

61
00:04:54,199 --> 00:04:56,649
in a way that there is as little gaps as possible.

62
00:04:56,899 --> 00:05:00,949
It is like playing Tetris but with a variable size pieces.

63
00:05:01,416 --> 00:05:04,399
But if the scheduler filled each gap with small jobs,

64
00:05:04,399 --> 00:05:07,866
there would never be enough resources free for larger jobs.

65
00:05:08,583 --> 00:05:14,133
The increasing job priority enables that everybody gets resources at some point.

66
00:05:14,966 --> 00:05:19,850
There are more resources than just CPUs that the scheduler has to take into account.

67
00:05:20,233 --> 00:05:24,800
As an example here is this small one-core job that is colored in orange.

68
00:05:25,066 --> 00:05:30,149
That job requires a lot of memory and it is using all the memory that is left in this node.

69
00:05:30,533 --> 00:05:35,933
That renders the node unavailable for new jobs to run because there's no memory left.

70
00:05:36,483 --> 00:05:40,233
It means that these other cores in that node will be idle.

71
00:05:40,550 --> 00:05:45,016
And if this job really needs and uses this memory, this is totally fine.

72
00:05:45,133 --> 00:05:51,066
That is what memory is for, and one will run out before the other - memory, or cores.

73
00:05:51,216 --> 00:05:54,283
But, if the job does not need that much memory,

74
00:05:54,283 --> 00:05:58,033
then these cores will be unavailable for everyone for no reason.

75
00:05:58,283 --> 00:06:01,366
Therefore everyone will be queuing more.

76
00:06:01,649 --> 00:06:05,016
The lecture 6 will cover how to use the resources effectively and

77
00:06:05,016 --> 00:06:08,616
I hope this example shows you the importance of that topic.

78
00:06:16,250 --> 00:06:20,283
The batch job system used at CSC is called Slurm.

79
00:06:20,933 --> 00:06:24,266
If you ever again wonder why you must use this Slurm,

80
00:06:24,266 --> 00:06:27,766
you remember that it is for sharing the awesome supercomputer resources

81
00:06:27,766 --> 00:06:29,766
in a fair and efficient way.

82
00:06:30,283 --> 00:06:34,300
Some resources that you can ask from Slurm are listed here.

83
00:06:34,966 --> 00:06:39,333
Computing time means how long should the resources be allocated to you.

84
00:06:39,733 --> 00:06:44,233
Number of cores and the amount of memory are the basic resources needed.

85
00:06:44,783 --> 00:06:48,333
Then there are a bit more special resources like GPUs

86
00:06:48,333 --> 00:06:52,583
or the NVMe disks for fast file input and output.

87
00:06:59,683 --> 00:07:03,516
There is this example script in the slides that you as a starting point

88
00:07:03,516 --> 00:07:05,800
when creating your first batch job.

89
00:07:06,516 --> 00:07:10,566
This one uses only one core so it is a serial job.

90
00:07:11,583 --> 00:07:16,850
Consider a batch job as an ordinary shell script, like what you use with bash.

91
00:07:16,983 --> 00:07:22,516
Therefore it also starts with the line #!/bin/bash.

92
00:07:22,800 --> 00:07:27,066
The difference to bash scripts is that the first part of the batch job contains

93
00:07:27,066 --> 00:07:30,133
the resource requests flagged with #SBATCH.

94
00:07:30,733 --> 00:07:33,649
Remember that hashtag is a comment symbol in bash,

95
00:07:33,649 --> 00:07:38,166
so lines starting with hashtag-SBATCH do nothing as a bash script.

96
00:07:38,833 --> 00:07:42,866
Instead the queuing system understands those flags there.

97
00:07:44,483 --> 00:07:47,416
The first one is called jobname and in this example

98
00:07:47,416 --> 00:07:49,949
we provide it with a value print-hostname.

99
00:07:50,850 --> 00:07:55,899
The following flags define the requested time and the partition in which to run the job.

100
00:07:56,816 --> 00:08:00,633
Together the number of tasks and the number of CPUs per task

101
00:08:00,633 --> 00:08:03,449
define the number of cores needed for the job.

102
00:08:04,083 --> 00:08:08,649
The last flag defines the billing project, which is very important.

103
00:08:09,483 --> 00:08:14,066
If you don't specify a project that you have access to, you will get an error.

104
00:08:15,383 --> 00:08:17,833
Then if you make some other mistake here,

105
00:08:17,833 --> 00:08:21,500
typically the error message will give you an idea how to proceed.

106
00:08:22,050 --> 00:08:26,366
If you omit any of the flags it will use some reasonable default values,

107
00:08:26,366 --> 00:08:29,233
except of course with the account-flag.

108
00:08:30,750 --> 00:08:35,466
Then the actual commands or computing steps come after the resource requests.

109
00:08:36,250 --> 00:08:39,816
In general they are scripted as in any batch script.

110
00:08:40,233 --> 00:08:44,683
Some commands or variables are spesific to CSC computing environment

111
00:08:44,683 --> 00:08:48,700
and we try to provide you with best materials to get used to those.

112
00:08:49,283 --> 00:08:53,299
Here this command srun tells the Slurm system to run a command.

113
00:08:53,833 --> 00:08:57,866
Using srun makes the resource usage reports more accurate.

114
00:08:58,799 --> 00:09:03,666
This echo-command is basic bash command and it prints out the following string.

115
00:09:04,250 --> 00:09:09,000
In this example the string contains some useful environmental parameters.

116
00:09:10,766 --> 00:09:14,250
To use this script you can copy-paste it into a textfile

117
00:09:14,250 --> 00:09:17,333
named for example simple_serial.bash.

118
00:09:17,899 --> 00:09:21,633
If you run it straight in Terminal it would be run on the login node.

119
00:09:21,783 --> 00:09:24,166
Never run these in the login node!

120
00:09:24,266 --> 00:09:26,850
The right way to run this kind of script is with

121
00:09:26,850 --> 00:09:29,633
command sbatch and then the name of the script.

122
00:09:29,850 --> 00:09:33,866
The documentation covers the definitions and options of the SBATCH-flags

123
00:09:42,433 --> 00:09:46,566
Whenever you start to use a new application in CSC supercomputers

124
00:09:46,566 --> 00:09:50,433
you should first consult the documentation at docs.csc.fi.

125
00:09:51,166 --> 00:09:53,950
There you might also find a batch script template

126
00:09:53,950 --> 00:09:56,833
that gives you a starting point with the application.

127
00:09:57,666 --> 00:10:00,616
The templates have some default values for the resources

128
00:10:00,616 --> 00:10:03,833
so you may try those and edit them to suit your needs.

129
00:10:04,683 --> 00:10:09,533
Of course you will need to change the actual inputs and such in the computation step.

130
00:10:10,516 --> 00:10:14,416
In general it is better to start with these application spesific templates

131
00:10:14,416 --> 00:10:16,333
than with a generic template.

132
00:10:17,250 --> 00:10:22,666
In any case, you will need to edit your batch job so that it will match your use case.

133
00:10:29,783 --> 00:10:33,816
These are the most important commands for using the queueing system.

134
00:10:34,350 --> 00:10:38,399
Submit the batch job with command sbatch example_job.sh.

135
00:10:39,216 --> 00:10:43,600
Find all your jobs that are queuing or running with squeue -u $USER.

136
00:10:44,500 --> 00:10:49,766
Pay attention to the job-ID, because that is needed for other commands such as

137
00:10:49,833 --> 00:10:53,883
getting info on a job with scontrol show job and the job-ID.

138
00:10:54,883 --> 00:10:59,433
If you want to cancel a submitted job you can use scancel and the job-ID.

139
00:11:00,450 --> 00:11:04,116
The last command in this list is seff and the job-ID.

140
00:11:04,733 --> 00:11:08,600
That can be used to monitor the resources that the job used.

141
00:11:08,799 --> 00:11:10,966
The point is to check whether your resource request

142
00:11:10,966 --> 00:11:13,333
actually matches what the job used.

143
00:11:20,250 --> 00:11:26,383
The partitions, or job queues, have different properties that are listed in docs.csc.fi.

144
00:11:27,149 --> 00:11:31,016
The purpose of having these different job queues is that the jobs can have

145
00:11:31,016 --> 00:11:35,616
very different needs for example in terms of memory and computing time.

146
00:11:36,299 --> 00:11:39,116
So estimate your resource request with thought,

147
00:11:39,116 --> 00:11:42,516
and choose to which partition suits your job the best.

148
00:11:43,166 --> 00:11:48,283
It is really bad practice just to ask so much resources that it will always be enough.

149
00:11:48,416 --> 00:11:53,166
So please put some effort to study how much resources your jobs actually need.

150
00:11:53,833 --> 00:11:57,933
You will also benefit there because your job is likely to start a little bit earlier

151
00:11:57,933 --> 00:12:01,700
and there will be more resources for everyone using the supercomputer.

152
00:12:06,433 --> 00:12:10,466
The available partitions are listed here in the docs.csc.fi.

153
00:12:11,466 --> 00:12:15,516
Check the link to the instructions in the partition name.

154
00:12:15,799 --> 00:12:20,500
The columns in the partition spreadsheet list for example the limits on run time,

155
00:12:20,500 --> 00:12:25,783
maximum number of tasks and nodes and also the maximum memory available per job.

156
00:12:26,299 --> 00:12:29,466
Please notice that there are a lot of these medium sized nodes

157
00:12:29,466 --> 00:12:30,983
available in some queues.

158
00:12:31,149 --> 00:12:35,666
Thus if you can specify your job such that it fits in the medium sized node,

159
00:12:35,666 --> 00:12:38,516
then there are a lot of these resources available.

160
00:12:38,933 --> 00:12:42,750
The jobs that require more time or more memory and therefore need to use

161
00:12:42,750 --> 00:12:45,783
hugemem or longrun will probably be queuing longer.

162
00:12:46,250 --> 00:12:49,783
It is perfectly fine to ask for a lot of memory.

163
00:12:49,950 --> 00:12:52,733
This is why we have these big memory nodes.

164
00:12:52,816 --> 00:12:56,350
But if you don't need that much, then you might want to consider asking

165
00:12:56,350 --> 00:12:59,533
for resources that are faster and easier to provide.

166
00:13:04,750 --> 00:13:09,133
One way to categorize jobs is if it uses one or multiple cores.

167
00:13:09,399 --> 00:13:13,049
Also a job can be interactive or non-interactive.

168
00:13:13,266 --> 00:13:17,016
Different types of jobs are explained in the following slides.

169
00:13:17,283 --> 00:13:22,016
Serial jobs are the simplest type of jobs and thus a great starting point.

170
00:13:22,116 --> 00:13:25,333
It is important to know the basics of different job types

171
00:13:25,333 --> 00:13:28,033
also when using already installed software

172
00:13:28,049 --> 00:13:31,833
- You need to know which resources to request when you start the job.

173
00:13:36,799 --> 00:13:42,133
One-core jobs are serial, because the core has to process the tasks one after another.

174
00:13:42,333 --> 00:13:46,683
Please note that you must not request more than one core for a serial job,

175
00:13:46,683 --> 00:13:49,766
because the job cannot utilise those extra cores.

176
00:13:49,950 --> 00:13:52,733
That can be seen in a resource scaling test:

177
00:13:53,000 --> 00:13:57,266
No matter how many cores you allocate the job does not get any faster.

178
00:13:58,200 --> 00:14:01,183
You could run many serial jobs in your laptop.

179
00:14:01,250 --> 00:14:05,833
There are still reasons to run serial jobs in CSC supercomputers.

180
00:14:06,183 --> 00:14:10,399
A serial jobs can be part of a larger user-defined workflow.

181
00:14:10,583 --> 00:14:14,983
The job might produce some results that you want to analyse with a supercomputer,

182
00:14:14,983 --> 00:14:19,049
or you want to share the results with your other CSC-project members.

183
00:14:19,299 --> 00:14:22,299
You know already that there are many preinstalled software

184
00:14:22,299 --> 00:14:24,883
available in the CSC environment.

185
00:14:25,000 --> 00:14:28,049
You don't need to install the programs or they might even

186
00:14:28,049 --> 00:14:30,566
have a license that we have already paid for.

187
00:14:30,783 --> 00:14:35,416
Also serial jobs can have some big demands on memory or fast disk.

188
00:14:35,466 --> 00:14:40,266
This is why we have Puhti huge memory nodes and the local NVMe disks.

189
00:14:46,049 --> 00:14:49,899
You can combine individual serial jobs to create a workflow

190
00:14:49,899 --> 00:14:52,583
that can utilise parallel resources.

191
00:14:53,416 --> 00:14:57,833
Here is two documented ways of running multiple jobs at the same time.

192
00:14:58,750 --> 00:15:03,633
Array jobs mean simply that you submit multiple jobs with a simple command.

193
00:15:03,950 --> 00:15:08,183
Other high throughput workflow tools are documented in DocsCSC.

194
00:15:08,583 --> 00:15:12,500
Of course, there are hundreds of other workflow tools that somehow

195
00:15:12,500 --> 00:15:16,016
combine multiple single jobs into a bigger workflow.

196
00:15:16,833 --> 00:15:21,283
If your individual job is a serial job, then you should run it in Puhti.

197
00:15:22,266 --> 00:15:28,000
Some workflow tools make it possible to run multiple serial jobs also in Mahti.

198
00:15:28,850 --> 00:15:32,983
In that case all the jobs combined should fill at least one Mahti node,

199
00:15:32,983 --> 00:15:36,266
which means 128 cores and they should keep it busy

200
00:15:36,266 --> 00:15:38,500
for the duration of the whole allocation.

201
00:15:39,516 --> 00:15:42,750
This is a bit advanced way of running jobs so we recommend

202
00:15:42,750 --> 00:15:45,866
to start by running serial jobs in Puhti.

203
00:15:52,816 --> 00:15:56,833
A parallel job distributes the calculation over several cores.

204
00:15:57,816 --> 00:16:01,850
That means it can use many cores for the same task simultaneously.

205
00:16:02,299 --> 00:16:04,883
On the other hand you can also use the memory of

206
00:16:04,883 --> 00:16:08,416
multiple nodes at the same time if your job requires it.

207
00:16:09,466 --> 00:16:14,633
There are two major schemes for parallelising jobs: openMP and MPI.

208
00:16:15,549 --> 00:16:20,450
If you run a pre-installed code then you don't need to worry about the details of these.

209
00:16:21,299 --> 00:16:25,283
Jobs parallelised with openMP can run only within one node,

210
00:16:25,283 --> 00:16:29,716
whereas MPI jobs can technically be spread over multiple nodes.

211
00:16:29,933 --> 00:16:33,950
In certain cases, you can even combine those two.

212
00:16:34,833 --> 00:16:40,366
The important thing is that the job resource request is different for openMP and MPI.

213
00:16:41,250 --> 00:16:44,166
There are instructions and example batch scripts for both

214
00:16:44,166 --> 00:16:47,666
Puhti and Mahti available in docs.csc.fi.

215
00:16:48,333 --> 00:16:52,616
Use them if there is no software specific batch script template available.

216
00:17:00,366 --> 00:17:03,333
A GPU is a graphics processing unit.

217
00:17:03,333 --> 00:17:06,766
They are developed for very efficient parallel processing,

218
00:17:06,766 --> 00:17:09,549
because graphics processing requires that.

219
00:17:09,633 --> 00:17:14,216
Hence they can also run certain kinds of HPC jobs very efficiently,

220
00:17:14,216 --> 00:17:16,683
for example machine learning jobs.

221
00:17:17,633 --> 00:17:20,566
To use GPUs a code has to be rewritten,

222
00:17:20,566 --> 00:17:24,966
compiled and linked to the libraries that can use GPU processors.

223
00:17:25,750 --> 00:17:30,583
GPU cards are also a bit more expensive than regular CPU cards.

224
00:17:31,450 --> 00:17:35,433
Therefore, one hour on GPU core spends 60 times more

225
00:17:35,433 --> 00:17:38,233
billing units than on a single CPU core.

226
00:17:39,216 --> 00:17:43,733
That means a full CPU node is a lot cheaper than one GPU core.

227
00:17:44,599 --> 00:17:49,933
On Puhti you should reserve one to 10 CPU cores for each reserved GPU core,

228
00:17:49,933 --> 00:17:55,500
because each Puhti GPU node has four GPU cards and 40 CPU cores.

229
00:17:56,566 --> 00:18:00,333
This means that in practice using GPU cores requires more than

230
00:18:00,333 --> 00:18:04,016
60 times the billing units than to use a CPU core.

231
00:18:05,016 --> 00:18:10,099
Please keep in mind that if you allocate more than 10 CPU cores for one GPU,

232
00:18:10,099 --> 00:18:15,216
the node may run out of CPUs which renders the GPUs unavailable.

233
00:18:21,099 --> 00:18:24,750
As you login to a supercomputer you get the command line interface

234
00:18:24,750 --> 00:18:28,133
that enables you to execute commands in the supercomputer.

235
00:18:28,416 --> 00:18:31,866
Please remember that straight after login you are on a login node,

236
00:18:31,866 --> 00:18:34,416
which can not be used for any computing!

237
00:18:35,333 --> 00:18:39,383
So if you want to use the powerful computational resources and to interact

238
00:18:39,383 --> 00:18:42,716
with your code, you can use the interactive jobs.

239
00:18:43,716 --> 00:18:47,433
Those run on the interactive partition which has compute nodes.

240
00:18:48,233 --> 00:18:51,683
There you can execute commands as if it was your local laptop,

241
00:18:51,683 --> 00:18:53,650
albeit a more powerful one.

242
00:18:53,966 --> 00:18:57,400
Once you are there you don't need to go through the queuing system.

243
00:18:57,500 --> 00:19:01,433
For example a Jupyter notebook is something that you need to interact with

244
00:19:01,433 --> 00:19:04,816
but you also might need some heavy computing resources.

245
00:19:05,616 --> 00:19:09,166
The tutorials about batch jobs continue from here.

246
00:19:09,333 --> 00:19:13,250
They cover the basic use cases with easy-to-follow examples.

247
00:19:13,416 --> 00:19:16,366
Remember that the batch job documentation includes some

248
00:19:16,366 --> 00:19:19,916
example batch scripts that you can start experimenting.

Product

Resources

Company