Path: blob/master/_slides/SRTFiles/06_Resources_SRT_English_mac.srt
696 views
1 00:00:23,883 --> 00:00:26,350 Remember that there are many users doing their work on 2 00:00:26,350 --> 00:00:29,149 CSC supercomputers simultaneously? 3 00:00:30,083 --> 00:00:35,399 To maximise the resource usage all users should know how much resources to request. 4 00:00:36,233 --> 00:00:38,500 The rule number one is that you should not reserve 5 00:00:38,500 --> 00:00:41,216 more resources than your job actually needs. 6 00:00:41,799 --> 00:00:44,933 That will also minimize the queueing time. 7 00:00:46,416 --> 00:00:49,566 When it seems your jobs start to need more resources, 8 00:00:49,566 --> 00:00:52,133 please justify your larger requests. 9 00:00:53,183 --> 00:00:55,833 In the end it is up to you to decide if it makes sense to 10 00:00:55,833 --> 00:00:59,466 spend five hours work to make the job run two hours faster? 11 00:01:00,450 --> 00:01:06,233 If you are going to repeat that job multiple times in the future, it really starts to pay off. 12 00:01:06,883 --> 00:01:10,900 This lecture aims to help you to optimise you resource requests. 13 00:01:16,650 --> 00:01:21,299 The picture illustrates two opposite example cases of a job on a single node. 14 00:01:22,233 --> 00:01:26,033 On the left there is a job that uses all the CPUs on that node, 15 00:01:26,033 --> 00:01:28,566 but it lefts most of the memory free. 16 00:01:29,349 --> 00:01:34,000 On the right there is a job that uses only one core but then reserves all of the memory. 17 00:01:35,950 --> 00:01:40,000 An average job needs a few cores and a fair amount of memory. 18 00:01:40,766 --> 00:01:45,500 Usually one node is capable of hosting several jobs from different users. 19 00:01:46,316 --> 00:01:49,633 At some point the resources of a single node are all in use and 20 00:01:49,633 --> 00:01:53,533 it can not host any additional jobs before some resources are freed. 21 00:01:56,900 --> 00:01:59,216 Typically a node runs out of cores first 22 00:01:59,216 --> 00:02:02,483 which makes the remaining memory unavailable and wasted. 23 00:02:05,166 --> 00:02:08,733 On the other hand if one one-core job uses all the memory, 24 00:02:08,733 --> 00:02:12,866 then all the other cores on that node remain idle and therefore wasted. 25 00:02:14,400 --> 00:02:17,816 These both are extreme examples. 26 00:02:18,033 --> 00:02:22,883 If your job actually uses such resources, it is fine to reserve those. 27 00:02:23,566 --> 00:02:27,599 Remember to avoid reserving resources for "just in case". 28 00:02:28,083 --> 00:02:31,683 That is considered as actually wasting resources. 29 00:02:38,000 --> 00:02:40,716 And now to the billion billing unit question: 30 00:02:40,716 --> 00:02:44,366 "how to find out how much resources my job actually used?" 31 00:02:46,099 --> 00:02:50,150 Slurm accounting is a database where every job makes an entry. 32 00:02:50,616 --> 00:02:55,250 The data can be queried with the command called seff followed by the job ID. 33 00:02:55,933 --> 00:03:01,050 You can see the job ID when you submit the job or with command squeue --me. 34 00:03:02,616 --> 00:03:06,900 This command seff prints out first where the job was run, by who, 35 00:03:06,900 --> 00:03:10,833 if it completed or failed and how many node and cores were used. 36 00:03:11,316 --> 00:03:15,349 Then there are three different times included in the seff output. 37 00:03:16,000 --> 00:03:21,300 CPU utilized tells the actual time the CPUs spent computing something. 38 00:03:22,116 --> 00:03:27,966 CPU efficiency tells how much the CPUs were active of the total CPU walltime. 39 00:03:28,966 --> 00:03:32,116 The job wall-clock time tells how long the computation took 40 00:03:32,116 --> 00:03:34,816 between the actual job start and job finish 41 00:03:34,816 --> 00:03:37,716 - note that it does not include the queueing time. 42 00:03:38,466 --> 00:03:41,816 Notice that in this example the core-walltime is about four times 43 00:03:41,816 --> 00:03:45,550 larger than job wall time, because four cores were used. 44 00:03:46,333 --> 00:03:49,300 Moving forward, there is the utilised memory and 45 00:03:49,300 --> 00:03:53,083 the memory efficiency which includes the total memory requested. 46 00:03:53,866 --> 00:03:57,849 The rest is about the billing units consumed, which project was billed 47 00:03:57,849 --> 00:04:01,283 and how the billing units were spent on different resources. 48 00:04:02,116 --> 00:04:07,283 Here you can also see if the job used NVMe disk or GPU resources. 49 00:04:09,050 --> 00:04:12,616 Now the important parts to note on this seff output. 50 00:04:13,266 --> 00:04:17,800 You should optimise for a high CPU efficiency and short enough wall time, 51 00:04:17,800 --> 00:04:21,066 especially if you're running on on multiple cores. 52 00:04:21,633 --> 00:04:24,566 If you try to use too many cores to the limit where 53 00:04:24,566 --> 00:04:27,966 the code would not scale anymore, the code would spend more time 54 00:04:27,966 --> 00:04:30,449 communicating information between different cores and 55 00:04:30,449 --> 00:04:33,083 less time actually computing something. 56 00:04:33,866 --> 00:04:39,633 So in that case, the wall time might drop a little bit, but the efficiency would go down. 57 00:04:40,449 --> 00:04:44,833 If file I/O is slowing the code down that will show in lower CPU efficiency 58 00:04:44,833 --> 00:04:47,666 and you should consider using NVMe disk. 59 00:04:48,550 --> 00:04:51,733 Also it is good practice to specify the amount of nodes in 60 00:04:51,733 --> 00:04:56,883 the parallel job resource request so that the job will not spread on too many nodes. 61 00:04:57,233 --> 00:05:00,983 The data transfer between nodes is slower than within one node. 62 00:05:01,250 --> 00:05:06,166 If everyone fails to specify the amount of nodes, the SLURM system will spread 63 00:05:06,166 --> 00:05:10,300 parallel jobs across all nodes and there will be unnecessarily few full nodes available. 64 00:05:12,300 --> 00:05:16,183 If the memory efficiency is a bit more tricky to evaluate. 65 00:05:16,916 --> 00:05:20,133 The default memory request for one core is one gigabyte 66 00:05:20,133 --> 00:05:23,050 which is quite small but often sufficient amount. 67 00:05:23,899 --> 00:05:28,100 It is recommended to have some buffer on top of the assumed memory amount. 68 00:05:28,316 --> 00:05:32,550 In this example we have 2.5 gigabytes of memory buffer. 69 00:05:32,783 --> 00:05:38,199 Depending the size of your job the buffer could be for example from one to 10 GB. 70 00:05:38,466 --> 00:05:42,983 The nodes in Puhti have at least 192 GB memory. 71 00:05:43,333 --> 00:05:48,050 Still it is good to keep the buffer well below 50 gigabytes to avoid the situation 72 00:05:48,050 --> 00:05:51,100 where node memory is full and the cores stay idle. 73 00:05:52,766 --> 00:05:57,433 For GPU jobs low efficency implies that you should use CPUs instead and 74 00:05:57,433 --> 00:06:00,516 make sure that the disk is not slowing you down. 75 00:06:09,850 --> 00:06:13,883 Sometimes seff does not capture all the usage statistics. 76 00:06:14,633 --> 00:06:18,566 You might find that the CPU or memory usage is suspiciously low 77 00:06:18,566 --> 00:06:20,566 although the job performed well. 78 00:06:21,283 --> 00:06:23,800 In that case you should compare the job wall time or 79 00:06:23,800 --> 00:06:26,966 timing info from your program log files. 80 00:06:27,216 --> 00:06:31,666 You can correlate the elapsed time with how many cores or memory you're using. 81 00:06:32,983 --> 00:06:37,133 It's recommended to use srun to launch programs in batch scripts. 82 00:06:37,383 --> 00:06:42,683 In some cases it is not feasible, but then the seff output can be missing something. 83 00:06:44,633 --> 00:06:49,166 In addition to seff, there is something called Slurm Accounting or sacct. 84 00:06:49,416 --> 00:06:53,133 You can use sacct to find details on the jobs. 85 00:06:53,566 --> 00:06:57,216 You can also look for the job IDs of all jobs. 86 00:06:57,766 --> 00:07:02,316 Please note that these are a little bit heavy operations on the Slurm database. 87 00:07:02,733 --> 00:07:06,666 Do not query from the beginning of last year and never put these commands 88 00:07:06,666 --> 00:07:08,466 in scripts that you loop over. 89 00:07:14,366 --> 00:07:19,016 Using resources like CPU and file storage consume billing units. 90 00:07:19,883 --> 00:07:22,899 Billing is done per project which means that the computing time 91 00:07:22,899 --> 00:07:25,466 and the quotas are properties of a project. 92 00:07:25,833 --> 00:07:30,083 A user can belong to many projects and choose which project will be billed. 93 00:07:30,266 --> 00:07:34,583 All the users in a same project will use the same billing unit quota. 94 00:07:35,366 --> 00:07:40,333 Use the command CSC-projects to see your remaining billing units per project. 95 00:07:42,100 --> 00:07:45,566 The billing scheme takes into account of the requested resources 96 00:07:45,566 --> 00:07:47,583 and the time the resources are used. 97 00:07:47,983 --> 00:07:52,183 The key here is to think how the reserved resources are unavailable to others. 98 00:07:53,000 --> 00:07:58,066 If you reserve four cores and use only one, your project is billed for four cores, 99 00:07:58,066 --> 00:08:01,266 because no one else can use those during that time. 100 00:08:02,066 --> 00:08:06,883 On the other hand if you reserve an hour of time and the job runs only for 10 minutes, 101 00:08:06,883 --> 00:08:10,899 your project is billed for using resources for 10 minutes. 102 00:08:11,583 --> 00:08:15,716 That means also that if your job stops immediately because of an error, 103 00:08:15,716 --> 00:08:19,116 only a really small amount of billing units are spent. 104 00:08:25,300 --> 00:08:29,350 If you run out of billing units, you can apply for more. 105 00:08:29,916 --> 00:08:34,766 Go to MyCSC web page where you can monitor and apply for billing units. 106 00:08:35,466 --> 00:08:39,750 There is a separate entry for each of the projects you are involved in. 107 00:08:40,500 --> 00:08:43,450 Please spread the knowledge about CSC if you have used 108 00:08:43,450 --> 00:08:45,783 CSC resources for your research. 109 00:08:46,750 --> 00:08:50,033 Remember also inform us about all those Nature papers and 110 00:08:50,033 --> 00:08:53,466 other publications where you have acknowledged CSC. 111 00:08:53,649 --> 00:08:57,850 A convenient way of doing that is to mention them in the resource application. 112 00:08:57,966 --> 00:09:02,133 That helps us to inform our funders about the resource usage. 113 00:09:04,000 --> 00:09:09,166 You can check the Docs CSC if your research is considered as a free-to-use cases. 114 00:09:09,299 --> 00:09:12,183 For example usage for universities has been paid 115 00:09:12,183 --> 00:09:14,899 by the Ministry of Education and Culture. 116 00:09:15,566 --> 00:09:19,933 The online billing unit calculator will help you to estimate how many billing units 117 00:09:19,933 --> 00:09:24,466 are needed for different types of jobs and how much that would cost in Euros. 118 00:09:31,350 --> 00:09:35,799 Billing units can also be considered as a kind of measure of efficiency. 119 00:09:36,716 --> 00:09:41,899 For example a one hour 40-core job is cheaper than one hour one-GPU job. 120 00:09:42,683 --> 00:09:46,700 Of course that does not tell which of them gets more computation done 121 00:09:46,700 --> 00:09:49,350 - that is to be determined case-by-case. 122 00:09:50,950 --> 00:09:55,566 Here is a more detailed of the cost of different resources in billing units. 123 00:09:56,516 --> 00:10:00,566 In Puhti one CPU core our equals one billing unit. 124 00:10:01,350 --> 00:10:04,799 Then one GPU card hour equals 60 billing units, 125 00:10:04,799 --> 00:10:09,016 plus all the CPUs that you need to allocate with the job as well. 126 00:10:09,783 --> 00:10:15,383 However, in Mahti, the the resources are allocated by nodes instead of cores. 127 00:10:16,049 --> 00:10:20,816 Using one node for one hour in Mahti consumes 100 billing units. 128 00:10:21,666 --> 00:10:25,833 In Puhti you also neet to request some memory for your jobs. 129 00:10:25,950 --> 00:10:30,433 One gibibyte hour of memory equals zero point one billing units. 130 00:10:30,799 --> 00:10:33,850 In Mahti you don't need to request memory because you get 131 00:10:33,850 --> 00:10:36,633 all memory in the requested node anyway. 132 00:10:37,950 --> 00:10:43,566 Regarding disk space in Scratch or projAppl, the first terabyte of quota is free. 133 00:10:43,916 --> 00:10:45,766 But if you need more space, 134 00:10:45,766 --> 00:10:49,633 you can apply for more by sending e-mail to a service desk. 135 00:10:50,866 --> 00:10:54,350 The billing for the extra space is done based on the usage. 136 00:10:54,433 --> 00:10:59,500 Excessing the first (free) terabyte costs 5 billing units per terabyte per hour. 137 00:10:59,883 --> 00:11:02,899 It means you can use more space when you need some, 138 00:11:02,899 --> 00:11:07,516 but it is still a good idea to move your files elsewhere when you don't need them. 139 00:11:08,649 --> 00:11:13,799 In Allas, the billing unit cost is based on how much data you actually have there. 140 00:11:13,899 --> 00:11:18,583 One terabyte of data in Allas equals nine thousand billing units in a year 141 00:11:19,049 --> 00:11:22,183 That favours a workflow where you move your data to Allas 142 00:11:22,183 --> 00:11:24,500 when you are not actively using it. 143 00:11:26,049 --> 00:11:31,333 There is a link to docs.csc.fi where billing scheme is explained in more detail. 144 00:11:31,500 --> 00:11:34,733 There is a formula that you can use to calculate the total 145 00:11:34,733 --> 00:11:37,033 billing unit consumption for a job. 146 00:11:37,500 --> 00:11:40,033 The further links take you to the information about 147 00:11:40,033 --> 00:11:42,983 cloud resource billing and quantum simulator billing. 148 00:11:51,166 --> 00:11:55,700 The first thing to do with any new batch job script is to test that it works. 149 00:11:56,533 --> 00:12:01,383 You don't want to queue for days just to see that your tiny typo made the job fail. 150 00:12:02,233 --> 00:12:07,700 Shorter runs queue less so create a short test run and submit in the queue called "test". 151 00:12:08,466 --> 00:12:10,933 That has usually really short queueing times and 152 00:12:10,933 --> 00:12:13,566 you will quickly see how your job performs. 153 00:12:15,533 --> 00:12:20,566 Examining the test job with seff tells you if it actually used the requested resources. 154 00:12:21,366 --> 00:12:26,450 You can use the information to refine resource requests for similar kinds of jobs. 155 00:12:27,233 --> 00:12:31,250 If you run only one calculation it is not so important. 156 00:12:31,250 --> 00:12:35,683 But that really pays off when you start to scale up your calculations. 157 00:12:36,733 --> 00:12:41,200 If you request too little memory or too little time, the job will fail. 158 00:12:41,250 --> 00:12:43,666 This is normal and fine. 159 00:12:44,016 --> 00:12:47,649 Usually the explanation is provided either by the queuing system 160 00:12:47,649 --> 00:12:49,733 or somewhere in the log files. 161 00:12:50,000 --> 00:12:54,016 Then you can adjust the parameters and preferably restart the job. 162 00:12:54,083 --> 00:12:58,116 Or you can run a new batch job with the same batch script. 163 00:12:58,866 --> 00:13:02,799 If you run jobs with so large requests that your jobs never fail, 164 00:13:02,799 --> 00:13:06,483 it leads to most of the resources left unused and wasted. 165 00:13:07,166 --> 00:13:10,649 Also your jobs will be queueing more. 166 00:13:15,483 --> 00:13:20,983 If you want to use parallel computation resources, you should consider the workflow. 167 00:13:21,816 --> 00:13:27,366 For example, you could run multiple smaller simulations instead of one big simulation. 168 00:13:27,966 --> 00:13:32,766 Or maybe use a completely different code or algorithm if that is more efficient. 169 00:13:33,516 --> 00:13:37,049 Typically the easy-to-use codes written by non-specialists 170 00:13:37,049 --> 00:13:40,016 can do something well enough in a small scale. 171 00:13:40,200 --> 00:13:42,866 But when you move on to run in a large scale, 172 00:13:42,866 --> 00:13:45,450 you might need to switch to something more complicated 173 00:13:45,450 --> 00:13:47,649 that has much better performance. 174 00:13:48,899 --> 00:13:53,683 Remember your job can be slow just because it is reading or writing a lot of files. 175 00:13:54,000 --> 00:13:57,000 Then the solution is not adding CPUs but instead 176 00:13:57,000 --> 00:14:00,366 use the fast local storage available in some nodes. 177 00:14:02,683 --> 00:14:05,700 When optimising and considering parallel resources 178 00:14:05,700 --> 00:14:08,766 you should think about wall time and CPU time. 179 00:14:09,566 --> 00:14:13,616 Wall time means how long it takes before the job is finished. 180 00:14:13,783 --> 00:14:17,066 CPU time is what consumes billing units and 181 00:14:17,066 --> 00:14:20,466 it multiplies with the amound of CPUs you use. 182 00:14:21,000 --> 00:14:24,149 Adding more CPUs may reduce the wall time, 183 00:14:24,149 --> 00:14:28,566 but at some point it becomes quite expensive in terms of billing units. 184 00:14:36,583 --> 00:14:40,916 This summary slide also includes links to further documentation. 185 00:14:41,700 --> 00:14:46,399 The most important things to monitor when you start doing some heavy computing are: 186 00:14:46,399 --> 00:14:49,033 whether the job is using all of the memory, 187 00:14:49,033 --> 00:14:51,716 whether the job is using disk efficiently, 188 00:14:51,716 --> 00:14:54,583 whether it makes sense to use GPUs, 189 00:14:54,583 --> 00:14:58,166 and whether adding more resources speeds up stuff. 190 00:15:00,016 --> 00:15:03,233 With memory, it is recommended to always have some reserve 191 00:15:03,233 --> 00:15:06,200 as instructed in the slides about Slurm accounting. 192 00:15:06,816 --> 00:15:12,000 Avoiding excessive disk workload means not to burden the Lustre parallel file system, 193 00:15:12,000 --> 00:15:15,166 but to use fast local storage if necessary. 194 00:15:17,333 --> 00:15:21,216 If your application can use GPUs, check that it also gains 195 00:15:21,216 --> 00:15:25,366 a real performance improvement compared to using CPUs. 196 00:15:25,983 --> 00:15:30,700 The documentation includes a quite detailed GPU usage policy. 197 00:15:31,566 --> 00:15:34,516 The GPUs should be used on those applications where 198 00:15:34,516 --> 00:15:37,049 it speeds up running the jobs the most. 199 00:15:37,883 --> 00:15:40,416 In some cases - typically machine learning 200 00:15:40,416 --> 00:15:44,316 - the speedup can be even 6-fold compared to CPUs. 201 00:15:45,433 --> 00:15:49,633 If the speedup is barely at the minimum level allowed by the usage policy, 202 00:15:49,633 --> 00:15:53,600 you may lose the gain if you need to queue for the resources. 203 00:15:56,316 --> 00:15:58,916 For parallel jobs it is important to check that 204 00:15:58,916 --> 00:16:01,883 adding more resources makes the job run faster. 205 00:16:01,883 --> 00:16:05,333 Otherwise it does not make sense to run in parallel. 206 00:16:06,216 --> 00:16:09,799 The kind of rule of thumb is that when you double the resources 207 00:16:09,799 --> 00:16:12,600 - for example from four cores to eight cores - 208 00:16:12,600 --> 00:16:16,216 the job should run at least one and a half times faster. 209 00:16:17,483 --> 00:16:21,950 The documentation covers instructions to how to perform a scaling test. 210 00:16:22,966 --> 00:16:26,100 The idea is that you run a job first with two cores, 211 00:16:26,100 --> 00:16:29,333 then four and eight cores and monitor the performance. 212 00:16:30,100 --> 00:16:33,983 If the running time goes down - or the performance increases 213 00:16:33,983 --> 00:16:36,483 - then it is okay to add more resources. 214 00:16:42,333 --> 00:16:46,383 Here are two illustrative examples of seff results. 215 00:16:46,683 --> 00:16:52,083 The first example job has run for only two minutes until finished which is fairly short. 216 00:16:52,850 --> 00:16:55,983 The CPU efficiency is really low. 217 00:16:56,516 --> 00:16:59,883 Either the system is not logging CPU usage correctly 218 00:16:59,883 --> 00:17:02,133 or there is something wrong with this job. 219 00:17:03,083 --> 00:17:08,200 Memory efficiency is five percent out of five GB, which is also very low. 220 00:17:09,099 --> 00:17:15,299 On the other hand, this is a GPU job and the GPUs are used with 83% efficiency. 221 00:17:16,083 --> 00:17:20,599 So this job has been making really good use of the expensive resources. 222 00:17:21,349 --> 00:17:27,016 Apparently this application does not need the CPU, but keeps the GPU busy. 223 00:17:27,666 --> 00:17:31,700 There is four GPUs per one GPU node. 224 00:17:31,916 --> 00:17:36,400 In this example the memory usage is well below one fourth of the available memory, 225 00:17:36,400 --> 00:17:39,933 which leaves sufficient memory for other GPUs. 226 00:17:40,683 --> 00:17:43,616 Although 5% memory efficiency is low, 227 00:17:43,616 --> 00:17:47,866 the remaining "buffer size" is only about 4 GB in this case. 228 00:17:48,433 --> 00:17:52,150 So overall, the two small efficiencies here are fine, 229 00:17:52,150 --> 00:17:56,049 because of the well utilised expensive GPU resource. 230 00:17:56,799 --> 00:18:01,849 Also the amount of not utilised CPU and memory resources is small. 231 00:18:03,416 --> 00:18:07,450 The second example is a job that has actually failed. 232 00:18:07,799 --> 00:18:12,033 The CPU efficiency is even smaller than with the previous example. 233 00:18:12,683 --> 00:18:16,349 But, it has a problem in the memory efficiency. 234 00:18:16,750 --> 00:18:20,783 It has used more than 100% of available memory. 235 00:18:21,166 --> 00:18:24,883 That is probably the reason why the job has failed. 236 00:18:25,366 --> 00:18:29,216 In this case the job error file might have a clear error message telling that 237 00:18:29,216 --> 00:18:33,466 the job was using too much memory and it was killed by the queueing system. 238 00:18:34,133 --> 00:18:36,150 That is not alarming as such - 239 00:18:36,150 --> 00:18:40,316 you might have been testing the optimal amount of resources and this happens. 240 00:18:41,216 --> 00:18:45,599 However, the CPU efficiency here is also very small. 241 00:18:45,799 --> 00:18:50,349 It would be best to check that this job is actually doing what it is supposed to do. 242 00:18:51,683 --> 00:18:55,066 The tutorials about resource usage continue from here. 243 00:18:55,066 --> 00:18:58,966 They cover the basic use cases with easy-to-follow examples. 244 00:18:59,666 --> 00:19:03,349 The documentation has more information about available resources 245 00:19:03,349 --> 00:19:07,200 including technical details and example batch scripts.