CoCalc -- 02_HPCenv_SRT

GitHub Repository: csc-training/csc-env-eff
Path: blob/master/_slides/SRTFiles/02_HPCenv_SRT_English.srt
⁶⁹⁶ views
1
00:00:22,666 --> 00:00:26,199
We assume that a laptop computer is a familiar thing to you, 

2
00:00:26,199 --> 00:00:29,350
so this comparison shows how the components of a laptop 

3
00:00:29,350 --> 00:00:33,083
and a supercomputer roughly relate to each other. 

4
00:00:34,066 --> 00:00:38,116
So your laptop would be something similar to one node in a cluster. 

5
00:00:38,766 --> 00:00:43,766
A cluster is a collection of nodes that are linked together with fast connections.

6
00:00:43,950 --> 00:00:48,299
That collection of nodes is what makes a supercomputer so super - 

7
00:00:48,299 --> 00:00:50,100
terms of computing power.

8
00:00:50,583 --> 00:00:52,616
Your laptop has a processor,

9
00:00:52,616 --> 00:00:56,500
and the equivalent of that in a supercomputer is called a socket. 

10
00:00:56,783 --> 00:01:00,799
A socket or a processor has CPU's or cores.

11
00:01:01,133 --> 00:01:04,633
Terms CPU and core are used interchangeably. 

12
00:01:04,633 --> 00:01:07,349
Very often it's just the core. 

13
00:01:07,349 --> 00:01:11,366
Sometimes it's a CPU core and so on. 

14
00:01:12,166 --> 00:01:15,983
Simply put you can think a supercomputer as a set of computers 

15
00:01:15,983 --> 00:01:17,583
that are connected together. 

16
00:01:17,583 --> 00:01:20,650
Each of the nodes have their own memory.

17
00:01:20,650 --> 00:01:25,400
Some nodes have also local storage which is useful in certain workflows. 

18
00:01:29,583 --> 00:01:33,333
Typically a computing cluster contains a storage system, 

19
00:01:33,333 --> 00:01:35,966
login nodes and compute nodes. 

20
00:01:36,233 --> 00:01:41,733
The supercomputer uses the login nodes for connecting with the outside world.

21
00:01:42,083 --> 00:01:48,366
They are like fat laptops that enable users for example to browse their files in the supercomputer.

22
00:01:48,633 --> 00:01:54,033
Running heavy computations on login nodes is against the CSC usage policy.

23
00:01:56,833 --> 00:01:59,516
Then there is a central storage system 

24
00:01:59,516 --> 00:02:02,733
where users can have their codes and relevant files.

25
00:02:05,133 --> 00:02:09,733
The essence of the supercomputer are of course the compute nodes.

26
00:02:09,833 --> 00:02:13,216
You are not supposed to log in in those, but they are dedicated 

27
00:02:13,216 --> 00:02:16,550
for actually running the jobs and doing the heavy computing.

28
00:02:17,216 --> 00:02:21,266
Accessing the compute nodes is done through a batch job system.

29
00:02:23,116 --> 00:02:28,150
On CSC machines we use batch job scheduler called Slurm. 

30
00:02:28,633 --> 00:02:33,133
There are other schedulers for example SGE, Torque or PBS, 

31
00:02:33,133 --> 00:02:35,599
and they all do the same function. 

32
00:02:36,250 --> 00:02:39,233
The syntax between those is a little bit different.

33
00:02:39,233 --> 00:02:42,900
If you copy-paste a batch script from the internet you have to adapt it 

34
00:02:42,900 --> 00:02:45,516
to the supercomputer you are using.

35
00:02:45,750 --> 00:02:50,883
Also a different Slurm cluster may use a little bit different syntax than ours. 

36
00:02:56,183 --> 00:03:00,383
CSC provides these HPC resources for customers. 

37
00:03:00,383 --> 00:03:04,616
Try the links in the slides to open corresponding documentation. 

38
00:03:06,000 --> 00:03:09,250
Puhti is the general purpose supercomputer.

39
00:03:09,250 --> 00:03:14,383
It has the most pre-installed applications and it can run both serial and parallel jobs. 

40
00:03:15,833 --> 00:03:19,099
Mahti is a massively parallel supercomputer. 

41
00:03:19,516 --> 00:03:23,483
Then Lumi will have even more resources for parallel computation. 

42
00:03:25,949 --> 00:03:29,983
Pouta is a cloud resource infrastructure-as-a-service.

43
00:03:30,616 --> 00:03:33,116
There you set up virtual machines and operate them - 

44
00:03:33,116 --> 00:03:35,716
so you get root access for your system. 

45
00:03:36,316 --> 00:03:38,316
Rahti is a little bit similar system,

46
00:03:38,316 --> 00:03:41,566
but everything you run there should be deployed as containers.

47
00:03:42,433 --> 00:03:44,833
Allas is for storing data. 

48
00:03:44,900 --> 00:03:49,116
It is managed by CSC and designed for large amounts of data. 

49
00:03:49,766 --> 00:03:53,533
One of the reasons of using HPC is that you may have so much data 

50
00:03:53,533 --> 00:03:56,300
that you cannot keep it in your own laptop.

51
00:04:02,416 --> 00:04:06,449
This is an open-ended question and the answer depends on your needs.

52
00:04:07,183 --> 00:04:11,233
You should figure out what kind of resources your application can use. 

53
00:04:12,266 --> 00:04:16,949
Some software can use only one core which makes them serial programs. 

54
00:04:17,166 --> 00:04:19,250
They should run in Puhti.

55
00:04:19,866 --> 00:04:22,666
Parallel progrms can use more than one core, 

56
00:04:22,666 --> 00:04:25,733
which is what Mahti and Lumi are designed for. 

57
00:04:26,716 --> 00:04:29,399
Other factors to consider are the memory requirement 

58
00:04:29,399 --> 00:04:34,050
and if the application can benefit from GPU's or fast local disk. 

59
00:04:34,833 --> 00:04:38,233
Every job has a limiting stage in the workflow.

60
00:04:38,483 --> 00:04:43,083
Identifying that may give insight to choosing a suitable supercomputer.

61
00:04:43,316 --> 00:04:46,300
We recommend you to find out answers to these questions 

62
00:04:46,300 --> 00:04:48,983
by investigating your application. 

63
00:04:49,933 --> 00:04:53,966
Then you should check which resources are available in the different supercomputers. 

64
00:04:54,550 --> 00:04:57,383
If the code that you want to use is already installed,

65
00:04:57,383 --> 00:05:00,566
then probably there are also instructions on usage. 

66
00:05:00,566 --> 00:05:03,433
Then you can get started immediately with running the code

67
00:05:03,433 --> 00:05:06,050
rather than installing and learning how it works. 

68
00:05:06,833 --> 00:05:09,516
Different machines or different partitions have 

69
00:05:09,516 --> 00:05:12,883
different maximum run times or provisioning policies.

70
00:05:13,083 --> 00:05:17,100
Some let you apply for a single core or then full nodes and so on.

71
00:05:17,966 --> 00:05:21,250
We recommend always to check the documentation. 

72
00:05:21,533 --> 00:05:26,316
Feel free to contact us if the documentation does not provide you with answers. 

73
00:05:33,133 --> 00:05:38,050
Together Puhti and Mahti provide users with extensive HPC resources. 

74
00:05:38,283 --> 00:05:43,166
This comparison gives a quick overview that helps to decide which one to use.

75
00:05:43,850 --> 00:05:46,649
Puhti is more general use supercomputer

76
00:05:46,649 --> 00:05:50,916
and it also has a lot more applications preinstalled than Mahti.

77
00:05:51,683 --> 00:05:55,699
The links in the slides take you to the lists of the installed applications. 

78
00:05:56,266 --> 00:06:00,300
Of course, you can also install your own applications if you want. 

79
00:06:01,466 --> 00:06:04,050
The sizes of the nodes are quite different.

80
00:06:04,333 --> 00:06:10,199
In Puhti the each node has 40 cores or CPU's and in Mahti it has 128. 

81
00:06:10,899 --> 00:06:17,399
The job size - so how many CPU cores one job can use - starts from one in Puhti.

82
00:06:17,399 --> 00:06:21,449
That means you can run both serial or parallel jobs there. 

83
00:06:22,683 --> 00:06:25,600
In Mahti the resources are allocated by nodes, 

84
00:06:25,600 --> 00:06:29,083
so the minimum amount that you can get is one full node. 

85
00:06:29,316 --> 00:06:31,616
That means your job should be able to use 

86
00:06:31,616 --> 00:06:34,600
at least one hundred and twenty eight cores in parallel.

87
00:06:34,633 --> 00:06:40,116
This is what Mahti is really meant for - to run this massively parallel jobs. 

88
00:06:41,350 --> 00:06:46,500
If your job needs a lot of memory, then we have these huge memory nodes in Puhti. 

89
00:06:46,699 --> 00:06:51,166
Certain kind of applications will run a lot faster with a lot of memory. 

90
00:06:52,399 --> 00:06:55,983
In Mahti, there's a lot less memory per core available,

91
00:06:55,983 --> 00:06:58,916
but that is enough for most applications. 

92
00:06:59,216 --> 00:07:03,883
In comparison your laptop has propably 8 or 16 gigs of RAM.

93
00:07:06,566 --> 00:07:09,899
Both machines have Nvidia GPU cards. 

94
00:07:09,899 --> 00:07:13,949
These are extremely good for a certain machine learning jobs. 

95
00:07:14,466 --> 00:07:18,016
Later we will cover whether it makes sense to use GPUs 

96
00:07:18,016 --> 00:07:20,300
or should you use CPUs instead. 

97
00:07:21,783 --> 00:07:25,766
In Puhti we have 120 nodes with fast local disk. 

98
00:07:26,183 --> 00:07:30,000
On Mahti only the GPU nodes have this local disk. 

99
00:07:31,133 --> 00:07:35,033
To summarize: Puhti is a general purpose supercomputer.

100
00:07:35,366 --> 00:07:39,199
One node in Puhti is around 10 times faster than your laptop, 

101
00:07:39,199 --> 00:07:41,516
and there are many of those nodes.

102
00:07:41,949 --> 00:07:44,600
Mahti is for large parallel jobs.

103
00:07:44,600 --> 00:07:49,516
But if you want to use that, then be prepared to install and optimize your code.
Product

Resources

Company