Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
csc-training
GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/SRTFiles/06_Resources_SRT_English_mac.srt
696 views
1
00:00:23,883 --> 00:00:26,350
Remember that there are many users doing their work on 

2
00:00:26,350 --> 00:00:29,149
CSC supercomputers simultaneously?

3
00:00:30,083 --> 00:00:35,399
To maximise the resource usage all users should know how much resources to request.

4
00:00:36,233 --> 00:00:38,500
The rule number one is that you should not reserve 

5
00:00:38,500 --> 00:00:41,216
more resources than your job actually needs. 

6
00:00:41,799 --> 00:00:44,933
That will also minimize the queueing time.

7
00:00:46,416 --> 00:00:49,566
When it seems your jobs start to need more resources, 

8
00:00:49,566 --> 00:00:52,133
please justify your larger requests.

9
00:00:53,183 --> 00:00:55,833
In the end it is up to you to decide if it makes sense to 

10
00:00:55,833 --> 00:00:59,466
spend five hours work to make the job run two hours faster? 

11
00:01:00,450 --> 00:01:06,233
If you are going to repeat that job multiple times in the future, it really starts to pay off. 

12
00:01:06,883 --> 00:01:10,900
This lecture aims to help you to optimise you resource requests.

13
00:01:16,650 --> 00:01:21,299
The picture illustrates two opposite example cases of a job on a single node.

14
00:01:22,233 --> 00:01:26,033
On the left there is a job that uses all the CPUs on that node, 

15
00:01:26,033 --> 00:01:28,566
but it lefts most of the memory free.

16
00:01:29,349 --> 00:01:34,000
On the right there is a job that uses only one core but then reserves all of the memory.

17
00:01:35,950 --> 00:01:40,000
An average job needs a few cores and a fair amount of memory.

18
00:01:40,766 --> 00:01:45,500
Usually one node is capable of hosting several jobs from different users.

19
00:01:46,316 --> 00:01:49,633
At some point the resources of a single node are all in use and 

20
00:01:49,633 --> 00:01:53,533
it can not host any additional jobs before some resources are freed.

21
00:01:56,900 --> 00:01:59,216
Typically a node runs out of cores first 

22
00:01:59,216 --> 00:02:02,483
which makes the remaining memory unavailable and wasted.

23
00:02:05,166 --> 00:02:08,733
On the other hand if one one-core job uses all the memory, 

24
00:02:08,733 --> 00:02:12,866
then all the other cores on that node remain idle and therefore wasted.

25
00:02:14,400 --> 00:02:17,816
These both are extreme examples.

26
00:02:18,033 --> 00:02:22,883
If your job actually uses such resources, it is fine to reserve those.

27
00:02:23,566 --> 00:02:27,599
Remember to avoid reserving resources for "just in case".

28
00:02:28,083 --> 00:02:31,683
That is considered as actually wasting resources.

29
00:02:38,000 --> 00:02:40,716
And now to the billion billing unit question:

30
00:02:40,716 --> 00:02:44,366
"how to find out how much resources my job actually used?"

31
00:02:46,099 --> 00:02:50,150
Slurm accounting is a database where every job makes an entry.

32
00:02:50,616 --> 00:02:55,250
The data can be queried with the command called seff followed by the job ID. 

33
00:02:55,933 --> 00:03:01,050
You can see the job ID when you submit the job or with command squeue --me.

34
00:03:02,616 --> 00:03:06,900
This command seff prints out first where the job was run, by who, 

35
00:03:06,900 --> 00:03:10,833
if it completed or failed and how many node and cores were used. 

36
00:03:11,316 --> 00:03:15,349
Then there are three different times included in the seff output.

37
00:03:16,000 --> 00:03:21,300
CPU utilized tells the actual time the CPUs spent computing something.

38
00:03:22,116 --> 00:03:27,966
CPU efficiency tells how much the CPUs were active of the total CPU walltime.

39
00:03:28,966 --> 00:03:32,116
The job wall-clock time tells how long the computation took

40
00:03:32,116 --> 00:03:34,816
between the actual job start and job finish

41
00:03:34,816 --> 00:03:37,716
 - note that it does not include the queueing time.

42
00:03:38,466 --> 00:03:41,816
Notice that in this example the core-walltime is about four times 

43
00:03:41,816 --> 00:03:45,550
larger than job wall time, because four cores were used. 

44
00:03:46,333 --> 00:03:49,300
Moving forward, there is the utilised memory and 

45
00:03:49,300 --> 00:03:53,083
the memory efficiency which includes the total memory requested.

46
00:03:53,866 --> 00:03:57,849
The rest is about the billing units consumed, which project was billed 

47
00:03:57,849 --> 00:04:01,283
and how the billing units were spent on different resources.

48
00:04:02,116 --> 00:04:07,283
Here you can also see if the job used NVMe disk or GPU resources.

49
00:04:09,050 --> 00:04:12,616
Now the important parts to note on this seff output.

50
00:04:13,266 --> 00:04:17,800
You should optimise for a high CPU efficiency and short enough wall time,

51
00:04:17,800 --> 00:04:21,066
especially if you're running on on multiple cores. 

52
00:04:21,633 --> 00:04:24,566
If you try to use too many cores to the limit where

53
00:04:24,566 --> 00:04:27,966
the code would not scale anymore, the code would spend more time

54
00:04:27,966 --> 00:04:30,449
communicating information between different cores and

55
00:04:30,449 --> 00:04:33,083
less time actually computing something. 

56
00:04:33,866 --> 00:04:39,633
So in that case, the wall time might drop a little bit, but the efficiency would go down.

57
00:04:40,449 --> 00:04:44,833
If file I/O is slowing the code down that will show in lower CPU efficiency 

58
00:04:44,833 --> 00:04:47,666
and you should consider using NVMe disk.

59
00:04:48,550 --> 00:04:51,733
Also it is good practice to specify the amount of nodes in 

60
00:04:51,733 --> 00:04:56,883
the parallel job resource request so that the job will not spread on too many nodes.

61
00:04:57,233 --> 00:05:00,983
The data transfer between nodes is slower than within one node. 

62
00:05:01,250 --> 00:05:06,166
If everyone fails to specify the amount of nodes, the SLURM system will spread 

63
00:05:06,166 --> 00:05:10,300
parallel jobs across all nodes and there will be unnecessarily few full nodes available.

64
00:05:12,300 --> 00:05:16,183
If the memory efficiency is a bit more tricky to evaluate.

65
00:05:16,916 --> 00:05:20,133
The default memory request for one core is one gigabyte 

66
00:05:20,133 --> 00:05:23,050
which is quite small but often sufficient amount. 

67
00:05:23,899 --> 00:05:28,100
It is recommended to have some buffer on top of the assumed memory amount.

68
00:05:28,316 --> 00:05:32,550
In this example we have 2.5 gigabytes of memory buffer.

69
00:05:32,783 --> 00:05:38,199
Depending the size of your job the buffer could be for example from one to 10 GB.

70
00:05:38,466 --> 00:05:42,983
The nodes in Puhti have at least 192 GB memory.

71
00:05:43,333 --> 00:05:48,050
Still it is good to keep the buffer well below 50 gigabytes to avoid the situation

72
00:05:48,050 --> 00:05:51,100
where node memory is full and the cores stay idle.

73
00:05:52,766 --> 00:05:57,433
For GPU jobs low efficency implies that you should use CPUs instead and

74
00:05:57,433 --> 00:06:00,516
make sure that the disk is not slowing you down. 

75
00:06:09,850 --> 00:06:13,883
Sometimes seff does not capture all the usage statistics.

76
00:06:14,633 --> 00:06:18,566
You might find that the CPU or memory usage is suspiciously low 

77
00:06:18,566 --> 00:06:20,566
although the job performed well.

78
00:06:21,283 --> 00:06:23,800
In that case you should compare the job wall time or

79
00:06:23,800 --> 00:06:26,966
timing info from your program log files. 

80
00:06:27,216 --> 00:06:31,666
You can correlate the elapsed time with how many cores or memory you're using.

81
00:06:32,983 --> 00:06:37,133
It's recommended to use srun to launch programs in batch scripts.

82
00:06:37,383 --> 00:06:42,683
In some cases it is not feasible, but then the seff output can be missing something.

83
00:06:44,633 --> 00:06:49,166
In addition to seff, there is something called Slurm Accounting or sacct. 

84
00:06:49,416 --> 00:06:53,133
You can use sacct to find details on the jobs.

85
00:06:53,566 --> 00:06:57,216
You can also look for the job IDs of all jobs.

86
00:06:57,766 --> 00:07:02,316
Please note that these are a little bit heavy operations on the Slurm database.

87
00:07:02,733 --> 00:07:06,666
Do not query from the beginning of last year and never put these commands 

88
00:07:06,666 --> 00:07:08,466
in scripts that you loop over.

89
00:07:14,366 --> 00:07:19,016
Using resources like CPU and file storage consume billing units.

90
00:07:19,883 --> 00:07:22,899
Billing is done per project which means that the computing time

91
00:07:22,899 --> 00:07:25,466
and the quotas are properties of a project. 

92
00:07:25,833 --> 00:07:30,083
A user can belong to many projects and choose which project will be billed.

93
00:07:30,266 --> 00:07:34,583
All the users in a same project will use the same billing unit quota. 

94
00:07:35,366 --> 00:07:40,333
Use the command CSC-projects to see your remaining billing units per project.

95
00:07:42,100 --> 00:07:45,566
The billing scheme takes into account of the requested resources

96
00:07:45,566 --> 00:07:47,583
and the time the resources are used.

97
00:07:47,983 --> 00:07:52,183
The key here is to think how the reserved resources are unavailable to others.

98
00:07:53,000 --> 00:07:58,066
If you reserve four cores and use only one, your project is billed for four cores,

99
00:07:58,066 --> 00:08:01,266
because no one else can use those during that time.

100
00:08:02,066 --> 00:08:06,883
On the other hand if you reserve an hour of time and the job runs only for 10 minutes, 

101
00:08:06,883 --> 00:08:10,899
your project is billed for using resources for 10 minutes.

102
00:08:11,583 --> 00:08:15,716
That means also that if your job stops immediately because of an error, 

103
00:08:15,716 --> 00:08:19,116
only a really small amount of billing units are spent.

104
00:08:25,300 --> 00:08:29,350
If you run out of billing units, you can apply for more. 

105
00:08:29,916 --> 00:08:34,766
Go to MyCSC web page where you can monitor and apply for billing units.

106
00:08:35,466 --> 00:08:39,750
There is a separate entry for each of the projects you are involved in. 

107
00:08:40,500 --> 00:08:43,450
Please spread the knowledge about CSC if you have used 

108
00:08:43,450 --> 00:08:45,783
CSC resources for your research.

109
00:08:46,750 --> 00:08:50,033
Remember also inform us about all those Nature papers and

110
00:08:50,033 --> 00:08:53,466
other publications where you have acknowledged CSC.

111
00:08:53,649 --> 00:08:57,850
A convenient way of doing that is to mention them in the resource application.

112
00:08:57,966 --> 00:09:02,133
That helps us to inform our funders about the resource usage.

113
00:09:04,000 --> 00:09:09,166
You can check the Docs CSC if your research is considered as a free-to-use cases.

114
00:09:09,299 --> 00:09:12,183
For example usage for universities has been paid

115
00:09:12,183 --> 00:09:14,899
by the Ministry of Education and Culture.

116
00:09:15,566 --> 00:09:19,933
The online billing unit calculator will help you to estimate how many billing units 

117
00:09:19,933 --> 00:09:24,466
are needed for different types of jobs and how much that would cost in Euros.

118
00:09:31,350 --> 00:09:35,799
Billing units can also be considered as a kind of measure of efficiency.

119
00:09:36,716 --> 00:09:41,899
For example a one hour 40-core job is cheaper than one hour one-GPU job.

120
00:09:42,683 --> 00:09:46,700
Of course that does not tell which of them gets more computation done

121
00:09:46,700 --> 00:09:49,350
 - that is to be determined case-by-case.

122
00:09:50,950 --> 00:09:55,566
Here is a more detailed of the cost of different resources in billing units. 

123
00:09:56,516 --> 00:10:00,566
In Puhti one CPU core our equals one billing unit.

124
00:10:01,350 --> 00:10:04,799
Then one GPU card hour equals 60 billing units,

125
00:10:04,799 --> 00:10:09,016
plus all the CPUs that you need to allocate with the job as well. 

126
00:10:09,783 --> 00:10:15,383
However, in Mahti, the the resources are allocated by nodes instead of cores.

127
00:10:16,049 --> 00:10:20,816
Using one node for one hour in Mahti consumes 100 billing units. 

128
00:10:21,666 --> 00:10:25,833
In Puhti you also neet to request some memory for your jobs.

129
00:10:25,950 --> 00:10:30,433
One gibibyte hour of memory equals zero point one billing units. 

130
00:10:30,799 --> 00:10:33,850
In Mahti you don't need to request memory because you get 

131
00:10:33,850 --> 00:10:36,633
all memory in the requested node anyway. 

132
00:10:37,950 --> 00:10:43,566
Regarding disk space in Scratch or projAppl, the first terabyte of quota is free. 

133
00:10:43,916 --> 00:10:45,766
But if you need more space, 

134
00:10:45,766 --> 00:10:49,633
you can apply for more by sending e-mail to a service desk.

135
00:10:50,866 --> 00:10:54,350
The billing for the extra space is done based on the usage.

136
00:10:54,433 --> 00:10:59,500
Excessing the first (free) terabyte costs 5 billing units per terabyte per hour.

137
00:10:59,883 --> 00:11:02,899
It means you can use more space when you need some, 

138
00:11:02,899 --> 00:11:07,516
but it is still a good idea to move your files elsewhere when you don't need them.

139
00:11:08,649 --> 00:11:13,799
In Allas, the billing unit cost is based on how much data you actually have there.

140
00:11:13,899 --> 00:11:18,583
One terabyte of data in Allas equals nine thousand billing units in a year

141
00:11:19,049 --> 00:11:22,183
That favours a workflow where you move your data to Allas

142
00:11:22,183 --> 00:11:24,500
when you are not actively using it.

143
00:11:26,049 --> 00:11:31,333
There is a link to docs.csc.fi where billing scheme is explained in more detail.

144
00:11:31,500 --> 00:11:34,733
There is a formula that you can use to calculate the total

145
00:11:34,733 --> 00:11:37,033
billing unit consumption for a job.

146
00:11:37,500 --> 00:11:40,033
The further links take you to the information about 

147
00:11:40,033 --> 00:11:42,983
cloud resource billing and quantum simulator billing. 

148
00:11:51,166 --> 00:11:55,700
The first thing to do with any new batch job script is to test that it works.

149
00:11:56,533 --> 00:12:01,383
You don't want to queue for days just to see that your tiny typo made the job fail.

150
00:12:02,233 --> 00:12:07,700
Shorter runs queue less so create a short test run and submit in the queue called "test".

151
00:12:08,466 --> 00:12:10,933
That has usually really short queueing times and 

152
00:12:10,933 --> 00:12:13,566
you will quickly see how your job performs. 

153
00:12:15,533 --> 00:12:20,566
Examining the test job with seff tells you if it actually used the requested resources.

154
00:12:21,366 --> 00:12:26,450
You can use the information to refine resource requests for similar kinds of jobs. 

155
00:12:27,233 --> 00:12:31,250
If you run only one calculation it is not so important.

156
00:12:31,250 --> 00:12:35,683
But that really pays off when you start to scale up your calculations.

157
00:12:36,733 --> 00:12:41,200
If you request too little memory or too little time, the job will fail. 

158
00:12:41,250 --> 00:12:43,666
This is normal and fine.

159
00:12:44,016 --> 00:12:47,649
Usually the explanation is provided either by the queuing system

160
00:12:47,649 --> 00:12:49,733
or somewhere in the log files. 

161
00:12:50,000 --> 00:12:54,016
Then you can adjust the parameters and preferably restart the job. 

162
00:12:54,083 --> 00:12:58,116
Or you can run a new batch job with the same batch script.

163
00:12:58,866 --> 00:13:02,799
If you run jobs with so large requests that your jobs never fail, 

164
00:13:02,799 --> 00:13:06,483
it leads to most of the resources left unused and wasted.

165
00:13:07,166 --> 00:13:10,649
Also your jobs will be queueing more.

166
00:13:15,483 --> 00:13:20,983
If you want to use parallel computation resources, you should consider the workflow.

167
00:13:21,816 --> 00:13:27,366
For example, you could run multiple smaller simulations instead of one big simulation.

168
00:13:27,966 --> 00:13:32,766
Or maybe use a completely different code or algorithm if that is more efficient. 

169
00:13:33,516 --> 00:13:37,049
Typically the easy-to-use codes written by non-specialists

170
00:13:37,049 --> 00:13:40,016
can do something well enough in a small scale. 

171
00:13:40,200 --> 00:13:42,866
But when you move on to run in a large scale, 

172
00:13:42,866 --> 00:13:45,450
you might need to switch to something more complicated 

173
00:13:45,450 --> 00:13:47,649
that has much better performance. 

174
00:13:48,899 --> 00:13:53,683
Remember your job can be slow just because it is reading or writing a lot of files. 

175
00:13:54,000 --> 00:13:57,000
Then the solution is not adding CPUs but instead 

176
00:13:57,000 --> 00:14:00,366
use the fast local storage available in some nodes.

177
00:14:02,683 --> 00:14:05,700
When optimising and considering parallel resources 

178
00:14:05,700 --> 00:14:08,766
you should think about wall time and CPU time.

179
00:14:09,566 --> 00:14:13,616
Wall time means how long it takes before the job is finished. 

180
00:14:13,783 --> 00:14:17,066
CPU time is what consumes billing units and

181
00:14:17,066 --> 00:14:20,466
it multiplies with the amound of CPUs you use.

182
00:14:21,000 --> 00:14:24,149
Adding more CPUs may reduce the wall time, 

183
00:14:24,149 --> 00:14:28,566
but at some point it becomes quite expensive in terms of billing units. 

184
00:14:36,583 --> 00:14:40,916
This summary slide also includes links to further documentation.

185
00:14:41,700 --> 00:14:46,399
The most important things to monitor when you start doing some heavy computing are:

186
00:14:46,399 --> 00:14:49,033
whether the job is using all of the memory,

187
00:14:49,033 --> 00:14:51,716
whether the job is using disk efficiently,

188
00:14:51,716 --> 00:14:54,583
whether it makes sense to use GPUs,

189
00:14:54,583 --> 00:14:58,166
and whether adding more resources speeds up stuff.

190
00:15:00,016 --> 00:15:03,233
With memory, it is recommended to always have some reserve 

191
00:15:03,233 --> 00:15:06,200
as instructed in the slides about Slurm accounting.

192
00:15:06,816 --> 00:15:12,000
Avoiding excessive disk workload means not to burden the Lustre parallel file system, 

193
00:15:12,000 --> 00:15:15,166
but to use fast local storage if necessary. 

194
00:15:17,333 --> 00:15:21,216
If your application can use GPUs, check that it also gains 

195
00:15:21,216 --> 00:15:25,366
a real performance improvement compared to using CPUs.

196
00:15:25,983 --> 00:15:30,700
The documentation includes a quite detailed GPU usage policy. 

197
00:15:31,566 --> 00:15:34,516
The GPUs should be used on those applications where 

198
00:15:34,516 --> 00:15:37,049
it speeds up running the jobs the most. 

199
00:15:37,883 --> 00:15:40,416
In some cases - typically machine learning 

200
00:15:40,416 --> 00:15:44,316
 - the speedup can be even 6-fold compared to CPUs. 

201
00:15:45,433 --> 00:15:49,633
If the speedup is barely at the minimum level allowed by the usage policy, 

202
00:15:49,633 --> 00:15:53,600
you may lose the gain if you need to queue for the resources.

203
00:15:56,316 --> 00:15:58,916
For parallel jobs it is important to check that

204
00:15:58,916 --> 00:16:01,883
adding more resources makes the job run faster. 

205
00:16:01,883 --> 00:16:05,333
Otherwise it does not make sense to run in parallel. 

206
00:16:06,216 --> 00:16:09,799
The kind of rule of thumb is that when you double the resources

207
00:16:09,799 --> 00:16:12,600
 - for example from four cores to eight cores - 

208
00:16:12,600 --> 00:16:16,216
the job should run at least one and a half times faster. 

209
00:16:17,483 --> 00:16:21,950
The documentation covers instructions to how to perform a scaling test.

210
00:16:22,966 --> 00:16:26,100
The idea is that you run a job first with two cores, 

211
00:16:26,100 --> 00:16:29,333
then four and eight cores and monitor the performance. 

212
00:16:30,100 --> 00:16:33,983
If the running time goes down - or the performance increases

213
00:16:33,983 --> 00:16:36,483
 - then it is okay to add more resources. 

214
00:16:42,333 --> 00:16:46,383
Here are two illustrative examples of seff results. 

215
00:16:46,683 --> 00:16:52,083
The first example job has run for only two minutes until finished which is fairly short. 

216
00:16:52,850 --> 00:16:55,983
The CPU efficiency is really low. 

217
00:16:56,516 --> 00:16:59,883
Either the system is not logging CPU usage correctly 

218
00:16:59,883 --> 00:17:02,133
or there is something wrong with this job. 

219
00:17:03,083 --> 00:17:08,200
Memory efficiency is five percent out of five GB, which is also very low. 

220
00:17:09,099 --> 00:17:15,299
On the other hand, this is a GPU job and the GPUs are used with 83% efficiency. 

221
00:17:16,083 --> 00:17:20,599
So this job has been making really good use of the expensive resources. 

222
00:17:21,349 --> 00:17:27,016
Apparently this application does not need the CPU, but keeps the GPU busy. 

223
00:17:27,666 --> 00:17:31,700
There is four GPUs per one GPU node.

224
00:17:31,916 --> 00:17:36,400
In this example the memory usage is well below one fourth of the available memory, 

225
00:17:36,400 --> 00:17:39,933
which leaves sufficient memory for other GPUs.

226
00:17:40,683 --> 00:17:43,616
Although 5% memory efficiency is low,

227
00:17:43,616 --> 00:17:47,866
the remaining "buffer size" is only about 4 GB in this case.

228
00:17:48,433 --> 00:17:52,150
So overall, the two small efficiencies here are fine, 

229
00:17:52,150 --> 00:17:56,049
because of the well utilised expensive GPU resource.

230
00:17:56,799 --> 00:18:01,849
Also the amount of not utilised CPU and memory resources is small.

231
00:18:03,416 --> 00:18:07,450
The second example is a job that has actually failed.

232
00:18:07,799 --> 00:18:12,033
The CPU efficiency is even smaller than with the previous example.

233
00:18:12,683 --> 00:18:16,349
But, it has a problem in the memory efficiency.

234
00:18:16,750 --> 00:18:20,783
It has used more than 100% of available memory. 

235
00:18:21,166 --> 00:18:24,883
That is probably the reason why the job has failed. 

236
00:18:25,366 --> 00:18:29,216
In this case the job error file might have a clear error message telling that 

237
00:18:29,216 --> 00:18:33,466
the job was using too much memory and it was killed by the queueing system. 

238
00:18:34,133 --> 00:18:36,150
That is not alarming as such - 

239
00:18:36,150 --> 00:18:40,316
you might have been testing the optimal amount of resources and this happens.

240
00:18:41,216 --> 00:18:45,599
However, the CPU efficiency here is also very small. 

241
00:18:45,799 --> 00:18:50,349
It would be best to check that this job is actually doing what it is supposed to do. 

242
00:18:51,683 --> 00:18:55,066
The tutorials about resource usage continue from here. 

243
00:18:55,066 --> 00:18:58,966
They cover the basic use cases with easy-to-follow examples.

244
00:18:59,666 --> 00:19:03,349
The documentation has more information about available resources

245
00:19:03,349 --> 00:19:07,200
including technical details and example batch scripts.