Path: blob/master/_slides/SRTFiles/05_BatchJobs_SRT_English_mac.srt
696 views
1 00:00:23,000 --> 00:00:25,566 On a laptop when you run something, 2 00:00:25,566 --> 00:00:29,149 you double click an icon and the program starts to run. 3 00:00:29,633 --> 00:00:33,316 Modern computers can run multiple tasks at the same time, 4 00:00:33,316 --> 00:00:38,566 but with too many simultaneous tasks you start to run out of memory or CPU power, 5 00:00:38,566 --> 00:00:41,083 which slows down the computer. 6 00:00:41,433 --> 00:00:45,600 In the HPC environment really many people are using that same computer 7 00:00:45,600 --> 00:00:48,616 and they all need different amounts of resources. 8 00:00:48,933 --> 00:00:54,333 It is not possible to let everyone start their programs and run them in real time. 9 00:00:54,899 --> 00:00:58,816 A batch job tells the batch job scheduler the resource requirements 10 00:00:58,816 --> 00:01:02,899 that how much resources should be available for that particular job. 11 00:01:06,466 --> 00:01:09,849 In addition to the resource request a batch job includes 12 00:01:09,849 --> 00:01:12,650 some script that does the actual computing. 13 00:01:13,099 --> 00:01:17,500 To ensure that there's sufficient computing power available for all users, 14 00:01:17,500 --> 00:01:19,883 the batch jobs are sent to a queue. 15 00:01:20,366 --> 00:01:24,866 Depending on what kind of resources you requested and on the load on the system, 16 00:01:24,866 --> 00:01:28,750 the job may need to queue for some time before it starts. 17 00:01:28,950 --> 00:01:33,233 On HPC systems all heavy computing needs to be done by a batch job 18 00:01:33,233 --> 00:01:36,633 so that they get executed on the compute nodes. 19 00:01:39,733 --> 00:01:43,383 The usage policy in docs.csc.fi says that the login nodes are 20 00:01:43,383 --> 00:01:46,200 not meant for long or heavy processes. 21 00:01:46,633 --> 00:01:51,166 Instead the login nodes are used for compiling, managing batch jobs, 22 00:01:51,166 --> 00:01:55,566 moving data, light pre- and postprocessing. 23 00:01:59,716 --> 00:02:04,500 A batch job system handles the batch jobs submitted to the supercomputer. 24 00:02:05,150 --> 00:02:09,800 It keeps track on what resources exist, which requests have been made, 25 00:02:09,800 --> 00:02:14,533 to which jobs allocate resources and how to run those jobs with the given resources. 26 00:02:15,233 --> 00:02:18,816 The aim is to use the resources as efficiently as possible, 27 00:02:18,816 --> 00:02:21,416 but also share them in a fair way. 28 00:02:22,099 --> 00:02:24,783 For example when a job needs a lot of memory, 29 00:02:24,783 --> 00:02:27,966 then it is allocated to a node where this memory is available. 30 00:02:28,116 --> 00:02:32,099 Then if an other job does not need that much memory, it gets allocated so 31 00:02:32,099 --> 00:02:36,783 that the more demanding jobs can run where they have enough resources. 32 00:02:42,750 --> 00:02:47,000 The fair way of allocating resources is not the first come first served, 33 00:02:47,000 --> 00:02:52,050 but instead somehow that everyone gets resources, at least at some point. 34 00:02:52,533 --> 00:02:57,433 Obviously a job cannot start before the requested resources are available. 35 00:02:57,866 --> 00:03:02,916 Each job has a "priority" which is used to determine the order of starting jobs. 36 00:03:03,466 --> 00:03:08,599 The "fair share" configuration has the following rules of defining the priority. 37 00:03:09,383 --> 00:03:13,400 When you submit a job it gets an initial priority. 38 00:03:13,566 --> 00:03:18,383 This initial priority decreases if you have recently run a lot of jobs. 39 00:03:18,900 --> 00:03:23,300 That makes it possible to first run some small test and get the results fast 40 00:03:23,300 --> 00:03:26,699 before submitting a large set of calculations. 41 00:03:27,150 --> 00:03:31,449 Once a job is in queue its priority increases over time. 42 00:03:31,949 --> 00:03:36,566 At some point it will have high enough priority and it will be its turn next. 43 00:03:37,150 --> 00:03:41,199 The priority depends also on which queue the job is submitted. 44 00:03:41,366 --> 00:03:43,916 For example longrun has lower priority 45 00:03:43,916 --> 00:03:46,883 to discourage people to run in this unless necessary. 46 00:03:47,150 --> 00:03:51,033 If you do not want to queue so much for longrun resources, 47 00:03:51,033 --> 00:03:56,183 then consider refining your job so that you can use the standard three day queues. 48 00:03:56,666 --> 00:04:02,583 The documentation at docs.csc.fi has a guide for getting started with running jobs. 49 00:04:08,716 --> 00:04:12,733 Here is an illustration on what the batch job scheduler is doing. 50 00:04:13,300 --> 00:04:17,333 The batch jobs are pictured here as two dimensional rectangles. 51 00:04:18,100 --> 00:04:21,583 The horisontal dimension represents the number of CPUs 52 00:04:21,583 --> 00:04:24,149 and the vertical dimension represents time. 53 00:04:24,816 --> 00:04:28,399 For example this tall but thin rectangle would correspond 54 00:04:28,399 --> 00:04:31,666 to a fairly long job which is using just one core. 55 00:04:32,083 --> 00:04:36,733 Then this short but wide rectangle would be a shorter job using many cores. 56 00:04:37,466 --> 00:04:40,750 The batch job system sees these kind of job requests and 57 00:04:40,750 --> 00:04:44,133 then it knows about the resources that the compute nodes have. 58 00:04:44,866 --> 00:04:48,983 The aim of the batch job scheduler is to keep all the supercomputer resources 59 00:04:48,983 --> 00:04:51,133 busy with computing all of the time. 60 00:04:51,283 --> 00:04:54,199 That happens by allocating the jobs to the cores 61 00:04:54,199 --> 00:04:56,649 in a way that there is as little gaps as possible. 62 00:04:56,899 --> 00:05:00,949 It is like playing Tetris but with a variable size pieces. 63 00:05:01,416 --> 00:05:04,399 But if the scheduler filled each gap with small jobs, 64 00:05:04,399 --> 00:05:07,866 there would never be enough resources free for larger jobs. 65 00:05:08,583 --> 00:05:14,133 The increasing job priority enables that everybody gets resources at some point. 66 00:05:14,966 --> 00:05:19,850 There are more resources than just CPUs that the scheduler has to take into account. 67 00:05:20,233 --> 00:05:24,800 As an example here is this small one-core job that is colored in orange. 68 00:05:25,066 --> 00:05:30,149 That job requires a lot of memory and it is using all the memory that is left in this node. 69 00:05:30,533 --> 00:05:35,933 That renders the node unavailable for new jobs to run because there's no memory left. 70 00:05:36,483 --> 00:05:40,233 It means that these other cores in that node will be idle. 71 00:05:40,550 --> 00:05:45,016 And if this job really needs and uses this memory, this is totally fine. 72 00:05:45,133 --> 00:05:51,066 That is what memory is for, and one will run out before the other - memory, or cores. 73 00:05:51,216 --> 00:05:54,283 But, if the job does not need that much memory, 74 00:05:54,283 --> 00:05:58,033 then these cores will be unavailable for everyone for no reason. 75 00:05:58,283 --> 00:06:01,366 Therefore everyone will be queuing more. 76 00:06:01,649 --> 00:06:05,016 The lecture 6 will cover how to use the resources effectively and 77 00:06:05,016 --> 00:06:08,616 I hope this example shows you the importance of that topic. 78 00:06:16,250 --> 00:06:20,283 The batch job system used at CSC is called Slurm. 79 00:06:20,933 --> 00:06:24,266 If you ever again wonder why you must use this Slurm, 80 00:06:24,266 --> 00:06:27,766 you remember that it is for sharing the awesome supercomputer resources 81 00:06:27,766 --> 00:06:29,766 in a fair and efficient way. 82 00:06:30,283 --> 00:06:34,300 Some resources that you can ask from Slurm are listed here. 83 00:06:34,966 --> 00:06:39,333 Computing time means how long should the resources be allocated to you. 84 00:06:39,733 --> 00:06:44,233 Number of cores and the amount of memory are the basic resources needed. 85 00:06:44,783 --> 00:06:48,333 Then there are a bit more special resources like GPUs 86 00:06:48,333 --> 00:06:52,583 or the NVMe disks for fast file input and output. 87 00:06:59,683 --> 00:07:03,516 There is this example script in the slides that you as a starting point 88 00:07:03,516 --> 00:07:05,800 when creating your first batch job. 89 00:07:06,516 --> 00:07:10,566 This one uses only one core so it is a serial job. 90 00:07:11,583 --> 00:07:16,850 Consider a batch job as an ordinary shell script, like what you use with bash. 91 00:07:16,983 --> 00:07:22,516 Therefore it also starts with the line #!/bin/bash. 92 00:07:22,800 --> 00:07:27,066 The difference to bash scripts is that the first part of the batch job contains 93 00:07:27,066 --> 00:07:30,133 the resource requests flagged with #SBATCH. 94 00:07:30,733 --> 00:07:33,649 Remember that hashtag is a comment symbol in bash, 95 00:07:33,649 --> 00:07:38,166 so lines starting with hashtag-SBATCH do nothing as a bash script. 96 00:07:38,833 --> 00:07:42,866 Instead the queuing system understands those flags there. 97 00:07:44,483 --> 00:07:47,416 The first one is called jobname and in this example 98 00:07:47,416 --> 00:07:49,949 we provide it with a value print-hostname. 99 00:07:50,850 --> 00:07:55,899 The following flags define the requested time and the partition in which to run the job. 100 00:07:56,816 --> 00:08:00,633 Together the number of tasks and the number of CPUs per task 101 00:08:00,633 --> 00:08:03,449 define the number of cores needed for the job. 102 00:08:04,083 --> 00:08:08,649 The last flag defines the billing project, which is very important. 103 00:08:09,483 --> 00:08:14,066 If you don't specify a project that you have access to, you will get an error. 104 00:08:15,383 --> 00:08:17,833 Then if you make some other mistake here, 105 00:08:17,833 --> 00:08:21,500 typically the error message will give you an idea how to proceed. 106 00:08:22,050 --> 00:08:26,366 If you omit any of the flags it will use some reasonable default values, 107 00:08:26,366 --> 00:08:29,233 except of course with the account-flag. 108 00:08:30,750 --> 00:08:35,466 Then the actual commands or computing steps come after the resource requests. 109 00:08:36,250 --> 00:08:39,816 In general they are scripted as in any batch script. 110 00:08:40,233 --> 00:08:44,683 Some commands or variables are spesific to CSC computing environment 111 00:08:44,683 --> 00:08:48,700 and we try to provide you with best materials to get used to those. 112 00:08:49,283 --> 00:08:53,299 Here this command srun tells the Slurm system to run a command. 113 00:08:53,833 --> 00:08:57,866 Using srun makes the resource usage reports more accurate. 114 00:08:58,799 --> 00:09:03,666 This echo-command is basic bash command and it prints out the following string. 115 00:09:04,250 --> 00:09:09,000 In this example the string contains some useful environmental parameters. 116 00:09:10,766 --> 00:09:14,250 To use this script you can copy-paste it into a textfile 117 00:09:14,250 --> 00:09:17,333 named for example simple_serial.bash. 118 00:09:17,899 --> 00:09:21,633 If you run it straight in Terminal it would be run on the login node. 119 00:09:21,783 --> 00:09:24,166 Never run these in the login node! 120 00:09:24,266 --> 00:09:26,850 The right way to run this kind of script is with 121 00:09:26,850 --> 00:09:29,633 command sbatch and then the name of the script. 122 00:09:29,850 --> 00:09:33,866 The documentation covers the definitions and options of the SBATCH-flags 123 00:09:42,433 --> 00:09:46,566 Whenever you start to use a new application in CSC supercomputers 124 00:09:46,566 --> 00:09:50,433 you should first consult the documentation at docs.csc.fi. 125 00:09:51,166 --> 00:09:53,950 There you might also find a batch script template 126 00:09:53,950 --> 00:09:56,833 that gives you a starting point with the application. 127 00:09:57,666 --> 00:10:00,616 The templates have some default values for the resources 128 00:10:00,616 --> 00:10:03,833 so you may try those and edit them to suit your needs. 129 00:10:04,683 --> 00:10:09,533 Of course you will need to change the actual inputs and such in the computation step. 130 00:10:10,516 --> 00:10:14,416 In general it is better to start with these application spesific templates 131 00:10:14,416 --> 00:10:16,333 than with a generic template. 132 00:10:17,250 --> 00:10:22,666 In any case, you will need to edit your batch job so that it will match your use case. 133 00:10:29,783 --> 00:10:33,816 These are the most important commands for using the queueing system. 134 00:10:34,350 --> 00:10:38,399 Submit the batch job with command sbatch example_job.sh. 135 00:10:39,216 --> 00:10:43,600 Find all your jobs that are queuing or running with squeue -u $USER. 136 00:10:44,500 --> 00:10:49,766 Pay attention to the job-ID, because that is needed for other commands such as 137 00:10:49,833 --> 00:10:53,883 getting info on a job with scontrol show job and the job-ID. 138 00:10:54,883 --> 00:10:59,433 If you want to cancel a submitted job you can use scancel and the job-ID. 139 00:11:00,450 --> 00:11:04,116 The last command in this list is seff and the job-ID. 140 00:11:04,733 --> 00:11:08,600 That can be used to monitor the resources that the job used. 141 00:11:08,799 --> 00:11:10,966 The point is to check whether your resource request 142 00:11:10,966 --> 00:11:13,333 actually matches what the job used. 143 00:11:20,250 --> 00:11:26,383 The partitions, or job queues, have different properties that are listed in docs.csc.fi. 144 00:11:27,149 --> 00:11:31,016 The purpose of having these different job queues is that the jobs can have 145 00:11:31,016 --> 00:11:35,616 very different needs for example in terms of memory and computing time. 146 00:11:36,299 --> 00:11:39,116 So estimate your resource request with thought, 147 00:11:39,116 --> 00:11:42,516 and choose to which partition suits your job the best. 148 00:11:43,166 --> 00:11:48,283 It is really bad practice just to ask so much resources that it will always be enough. 149 00:11:48,416 --> 00:11:53,166 So please put some effort to study how much resources your jobs actually need. 150 00:11:53,833 --> 00:11:57,933 You will also benefit there because your job is likely to start a little bit earlier 151 00:11:57,933 --> 00:12:01,700 and there will be more resources for everyone using the supercomputer. 152 00:12:06,433 --> 00:12:10,466 The available partitions are listed here in the docs.csc.fi. 153 00:12:11,466 --> 00:12:15,516 Check the link to the instructions in the partition name. 154 00:12:15,799 --> 00:12:20,500 The columns in the partition spreadsheet list for example the limits on run time, 155 00:12:20,500 --> 00:12:25,783 maximum number of tasks and nodes and also the maximum memory available per job. 156 00:12:26,299 --> 00:12:29,466 Please notice that there are a lot of these medium sized nodes 157 00:12:29,466 --> 00:12:30,983 available in some queues. 158 00:12:31,149 --> 00:12:35,666 Thus if you can specify your job such that it fits in the medium sized node, 159 00:12:35,666 --> 00:12:38,516 then there are a lot of these resources available. 160 00:12:38,933 --> 00:12:42,750 The jobs that require more time or more memory and therefore need to use 161 00:12:42,750 --> 00:12:45,783 hugemem or longrun will probably be queuing longer. 162 00:12:46,250 --> 00:12:49,783 It is perfectly fine to ask for a lot of memory. 163 00:12:49,950 --> 00:12:52,733 This is why we have these big memory nodes. 164 00:12:52,816 --> 00:12:56,350 But if you don't need that much, then you might want to consider asking 165 00:12:56,350 --> 00:12:59,533 for resources that are faster and easier to provide. 166 00:13:04,750 --> 00:13:09,133 One way to categorize jobs is if it uses one or multiple cores. 167 00:13:09,399 --> 00:13:13,049 Also a job can be interactive or non-interactive. 168 00:13:13,266 --> 00:13:17,016 Different types of jobs are explained in the following slides. 169 00:13:17,283 --> 00:13:22,016 Serial jobs are the simplest type of jobs and thus a great starting point. 170 00:13:22,116 --> 00:13:25,333 It is important to know the basics of different job types 171 00:13:25,333 --> 00:13:28,033 also when using already installed software 172 00:13:28,049 --> 00:13:31,833 - You need to know which resources to request when you start the job. 173 00:13:36,799 --> 00:13:42,133 One-core jobs are serial, because the core has to process the tasks one after another. 174 00:13:42,333 --> 00:13:46,683 Please note that you must not request more than one core for a serial job, 175 00:13:46,683 --> 00:13:49,766 because the job cannot utilise those extra cores. 176 00:13:49,950 --> 00:13:52,733 That can be seen in a resource scaling test: 177 00:13:53,000 --> 00:13:57,266 No matter how many cores you allocate the job does not get any faster. 178 00:13:58,200 --> 00:14:01,183 You could run many serial jobs in your laptop. 179 00:14:01,250 --> 00:14:05,833 There are still reasons to run serial jobs in CSC supercomputers. 180 00:14:06,183 --> 00:14:10,399 A serial jobs can be part of a larger user-defined workflow. 181 00:14:10,583 --> 00:14:14,983 The job might produce some results that you want to analyse with a supercomputer, 182 00:14:14,983 --> 00:14:19,049 or you want to share the results with your other CSC-project members. 183 00:14:19,299 --> 00:14:22,299 You know already that there are many preinstalled software 184 00:14:22,299 --> 00:14:24,883 available in the CSC environment. 185 00:14:25,000 --> 00:14:28,049 You don't need to install the programs or they might even 186 00:14:28,049 --> 00:14:30,566 have a license that we have already paid for. 187 00:14:30,783 --> 00:14:35,416 Also serial jobs can have some big demands on memory or fast disk. 188 00:14:35,466 --> 00:14:40,266 This is why we have Puhti huge memory nodes and the local NVMe disks. 189 00:14:46,049 --> 00:14:49,899 You can combine individual serial jobs to create a workflow 190 00:14:49,899 --> 00:14:52,583 that can utilise parallel resources. 191 00:14:53,416 --> 00:14:57,833 Here is two documented ways of running multiple jobs at the same time. 192 00:14:58,750 --> 00:15:03,633 Array jobs mean simply that you submit multiple jobs with a simple command. 193 00:15:03,950 --> 00:15:08,183 Other high throughput workflow tools are documented in DocsCSC. 194 00:15:08,583 --> 00:15:12,500 Of course, there are hundreds of other workflow tools that somehow 195 00:15:12,500 --> 00:15:16,016 combine multiple single jobs into a bigger workflow. 196 00:15:16,833 --> 00:15:21,283 If your individual job is a serial job, then you should run it in Puhti. 197 00:15:22,266 --> 00:15:28,000 Some workflow tools make it possible to run multiple serial jobs also in Mahti. 198 00:15:28,850 --> 00:15:32,983 In that case all the jobs combined should fill at least one Mahti node, 199 00:15:32,983 --> 00:15:36,266 which means 128 cores and they should keep it busy 200 00:15:36,266 --> 00:15:38,500 for the duration of the whole allocation. 201 00:15:39,516 --> 00:15:42,750 This is a bit advanced way of running jobs so we recommend 202 00:15:42,750 --> 00:15:45,866 to start by running serial jobs in Puhti. 203 00:15:52,816 --> 00:15:56,833 A parallel job distributes the calculation over several cores. 204 00:15:57,816 --> 00:16:01,850 That means it can use many cores for the same task simultaneously. 205 00:16:02,299 --> 00:16:04,883 On the other hand you can also use the memory of 206 00:16:04,883 --> 00:16:08,416 multiple nodes at the same time if your job requires it. 207 00:16:09,466 --> 00:16:14,633 There are two major schemes for parallelising jobs: openMP and MPI. 208 00:16:15,549 --> 00:16:20,450 If you run a pre-installed code then you don't need to worry about the details of these. 209 00:16:21,299 --> 00:16:25,283 Jobs parallelised with openMP can run only within one node, 210 00:16:25,283 --> 00:16:29,716 whereas MPI jobs can technically be spread over multiple nodes. 211 00:16:29,933 --> 00:16:33,950 In certain cases, you can even combine those two. 212 00:16:34,833 --> 00:16:40,366 The important thing is that the job resource request is different for openMP and MPI. 213 00:16:41,250 --> 00:16:44,166 There are instructions and example batch scripts for both 214 00:16:44,166 --> 00:16:47,666 Puhti and Mahti available in docs.csc.fi. 215 00:16:48,333 --> 00:16:52,616 Use them if there is no software specific batch script template available. 216 00:17:00,366 --> 00:17:03,333 A GPU is a graphics processing unit. 217 00:17:03,333 --> 00:17:06,766 They are developed for very efficient parallel processing, 218 00:17:06,766 --> 00:17:09,549 because graphics processing requires that. 219 00:17:09,633 --> 00:17:14,216 Hence they can also run certain kinds of HPC jobs very efficiently, 220 00:17:14,216 --> 00:17:16,683 for example machine learning jobs. 221 00:17:17,633 --> 00:17:20,566 To use GPUs a code has to be rewritten, 222 00:17:20,566 --> 00:17:24,966 compiled and linked to the libraries that can use GPU processors. 223 00:17:25,750 --> 00:17:30,583 GPU cards are also a bit more expensive than regular CPU cards. 224 00:17:31,450 --> 00:17:35,433 Therefore, one hour on GPU core spends 60 times more 225 00:17:35,433 --> 00:17:38,233 billing units than on a single CPU core. 226 00:17:39,216 --> 00:17:43,733 That means a full CPU node is a lot cheaper than one GPU core. 227 00:17:44,599 --> 00:17:49,933 On Puhti you should reserve one to 10 CPU cores for each reserved GPU core, 228 00:17:49,933 --> 00:17:55,500 because each Puhti GPU node has four GPU cards and 40 CPU cores. 229 00:17:56,566 --> 00:18:00,333 This means that in practice using GPU cores requires more than 230 00:18:00,333 --> 00:18:04,016 60 times the billing units than to use a CPU core. 231 00:18:05,016 --> 00:18:10,099 Please keep in mind that if you allocate more than 10 CPU cores for one GPU, 232 00:18:10,099 --> 00:18:15,216 the node may run out of CPUs which renders the GPUs unavailable. 233 00:18:21,099 --> 00:18:24,750 As you login to a supercomputer you get the command line interface 234 00:18:24,750 --> 00:18:28,133 that enables you to execute commands in the supercomputer. 235 00:18:28,416 --> 00:18:31,866 Please remember that straight after login you are on a login node, 236 00:18:31,866 --> 00:18:34,416 which can not be used for any computing! 237 00:18:35,333 --> 00:18:39,383 So if you want to use the powerful computational resources and to interact 238 00:18:39,383 --> 00:18:42,716 with your code, you can use the interactive jobs. 239 00:18:43,716 --> 00:18:47,433 Those run on the interactive partition which has compute nodes. 240 00:18:48,233 --> 00:18:51,683 There you can execute commands as if it was your local laptop, 241 00:18:51,683 --> 00:18:53,650 albeit a more powerful one. 242 00:18:53,966 --> 00:18:57,400 Once you are there you don't need to go through the queuing system. 243 00:18:57,500 --> 00:19:01,433 For example a Jupyter notebook is something that you need to interact with 244 00:19:01,433 --> 00:19:04,816 but you also might need some heavy computing resources. 245 00:19:05,616 --> 00:19:09,166 The tutorials about batch jobs continue from here. 246 00:19:09,333 --> 00:19:13,250 They cover the basic use cases with easy-to-follow examples. 247 00:19:13,416 --> 00:19:16,366 Remember that the batch job documentation includes some 248 00:19:16,366 --> 00:19:19,916 example batch scripts that you can start experimenting.