Difference between revisions of "Getting Started"

From HPC Wiki
Jump to navigation Jump to search
m
 
(87 intermediate revisions by 6 users not shown)
Line 1: Line 1:
== Access ==
+
[[Category:Basics]]
Depending on the specific supercomputer, one has either has to register to get a user account or write a project proposal and apply for computing resources that way. The respective pages are linked in this overview:
+
== [[Access|Access]] or "How-to-be-allowed-onto-the-supercomputer" ==
 +
Depending on the specific supercomputer, one has to either register to get a user account or write a project proposal and apply for computing resources that way. The respective pages are linked in [[Access|this overview]].
  
{| class="wikitable" style="width: 40%;"
+
After this is done and login credentials are supplied, one can proceed to [[ Getting_Started#Login_or_.22How-to-now-actually-connect-to-the-supercomputer.22 | login ]].
| IT Center - RWTH Aachen [https://doc.itc.rwth-aachen.de/display/CC/Overview+of+Compute+Project+Categories+and+Links+to+Submission+Forms]
+
 
| RRZE - FAU Erlangen [https://www.rrze.fau.de/serverdienste/hpc/]
+
== [[Nodes#Login|Login]] or "How-to-now-actually-connect-to-the-supercomputer" ==
| ZIH - TU Dresden [https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Access]
+
Most HPC Systems are unix-based environments with [[shell]] (commandline) access.
|}
+
 
after this is done and login credentials are supplied, one can proceed to
+
To log in, one usually uses [[ssh]] to reach the respective [[Nodes#Login|Login Nodes]] (computers reserved for people just like you that want to connect to the supercomputer). Sometimes this access is restricted, so you can only connect, when you are within the university/facility and its network. To still access the Login Nodes externally, one can 'pretend to be inside the network' by using a [[VPN|Virtual Private Network (VPN)]].
 +
 
 +
Once there, the user can interact with the system and run (small) programs to generally test the system/software.
  
 +
== [[File_Transfer|File Transfer]] or "How-to-get-your-data-onto-or-off-the-supercomputer" ==
 +
To get your data (files) onto the supercomputer or back to your local machine, there are usually different ways. Sometimes there are computers specifically reserved for this purpose called [[Nodes#Copy|copy nodes]].
  
== Login ==
+
If available to you, it is recommened to use these copy nodes to move data to or from the supercomputer, since this will result in a better connection and disturb other users less. Additionally the tools mentioned below might only work on these nodes. If there are no dedicated copy nodes, you can usually use the [[Nodes#Login|Login Nodes]] for this purpose.
Most HPC Systems are unix-based environments with [[shell]] (commandline) access.
 
  
To log in, one usually uses [[ssh]] [https://wickie.hlrs.de/platforms/index.php/Secure_Shell_ssh] to reach the respective [[Login Nodes]] (Computers reserved for the login of users).
+
Commonly used and widely supported copying tools are [[rsync]] which mirrors directories (folders) between the supercomputer and your local machine. [[scp]] which is useful for a few single files or specified file-lists, and lastly the commonly used [[ftp]] or the encrypted version sftp (or ftps).
 +
A little bit more information can be found in the [[File_Transfer|File Transfer]] article.
  
{| class="wikitable" style="width: 40%;"
+
== [[Scheduling_Basics|Schedulers]] or "How-To-Run-Applications-on-a-supercomputer" ==
| IT Center - RWTH Aachen
+
To run any significant program or workload on a supercomputer, generally a [[Scheduling_Basics|Batch-Scheduler]] is employed. Alongside the above-mentioned Login Nodes there are usually far more Backend Nodes in the system (computers exclusively reserved for computing, to which you cannot connect directly, also referred to as "batch system"). A program called Batch-scheduler decides who gets how many of those compute resources for which amount of time. Please use the Backend Nodes for everything which is not a simple small test and only runs for a few minutes., otherwise you will block the Login Nodes for everybody when you run your calculations there. These Backend Nodes make up more than 98% of a supercomputer and can only be accessed via the scheduler.
| RRZE - FAU Erlangen
 
| ZIH - TU Dresden
 
|-
 
| cluster.rz.rwth-aachen.de
 
| cshpc.rrze.fau.de
 
| taurus.hrsk.tu-dresden.de
 
|}
 
Once there, the user can interact with the system and run (very small) programs to generally test the system/software.
 
  
 +
When you log into a supercomputer, you can run commands on the Login Nodes interactively. You type, you hit return, the command gets executed. Schedulers work differently. You submit a series of commands (in form of a file) and tell it, how much resources it will approximately need in terms of:
  
== Schedulers or How-To-Run-Applications-on-a-supercomputer ==
+
* time: If the specified time runs out, before your application finishes and exits, it will be terminated by the scheduler.
To run any significant program or workload on a supercomputer, generally schedulers [https://en.wikipedia.org/wiki/Job_scheduler] are employed. Except from the above-mentioned Login Nodes there are usually far more Backend Nodes (Computers exclusively reserved for computing). The scheduler decides who gets how many of those for what time.
+
* compute resources:  how many cpus ('calculation thingies'), sockets ('cpu-houses') and nodes ('computers')
 +
* memory resources:    how much RAM ('very fast memory, similar to the few books you have at home')
  
In order to run your application with that, you have to tell the Scheduler, what your application needs in term of
+
This combination of specified commands and required resources is commonly referred to as a "(batch) job".
* time
 
* compute resources (how many cpus/sockets/nodes)
 
* memory resources (how much RAM/storage)
 
* how to actually execute your application
 
which obviously has to fit within the boundaries of the running system. If you ask for more than there is, chances are, the scheduler will take this job and wait until you buy and install the missing hardware -> forever. Information over the available hardware can be found in the following table.
 
  
This ususally is done with a [[Jobscript]]. When you have this jobscript ready with the help of [[jobscript-examples]], colleagues or the [[Support]], you can submit it to the respective [[Scheduler]].
+
If later compute resources become free, which match the requirements of your application, the scheduler will run your specified commands on the requested hardware. This is usually delayed (sometimes you have to wait a day or two) and not instant, because other users are currently using the compute resources and you have to wait until their program runs finish. Furthermore you cannot change the series of commands after submitting, but just terminate the job and submit a new one in case of an error.
{| class="wikitable" style="width: 40%;"
 
| IT Center - RWTH Aachen [https://doc.itc.rwth-aachen.de/display/CC/Hardware+of+the+RWTH+Compute+Cluster]
 
| RRZE - Erlangen [https://www.anleitungen.rrze.fau.de/hpc/]
 
| ZIH - Dresden [https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/HardwareTaurus]
 
|-
 
| [[LSF]]
 
|colspan="2"| [[SLURM]]
 
|}
 
  
After this the application is executed when a set of nodes (computers) are allocated the your 'job' by the scheduler. Usually there is (optionally) Email notification on start/finish of jobs. If the specified time runs out, before your application finishes and exits, it will be terminated.
+
The file specifying this series of commands and the required resources is called a [[jobscript]]. Its format and syntax depends on the installed scheduler. When you have this jobscript ready with the help of [[jobscript-examples]], colleagues or your local [[support]], you can submit it to the respective [[Schedulers|scheduler of your facility]]. The scheduler then waits until a set of nodes (computers) are free and later allocates those to execute your job as soon as possible. Sometimes there is (an optional) email notification, which is send when your job starts execution/finished running.
  
 +
Be aware that your specified requirements have to fit within the boundaries of the system of your facility. If you ask for more than there is, chances are, the scheduler will accept your job and wait until missing hardware is bought and installed - although this will not happen in 99.9% of cases. Information over the available hardware can be found in the [https://gauss-allianz.de/de/hpc-ecosystem overview of the Gauss Allianz] or the [[Site-specific_documentation|documentation of the different sites]]. You can find more information about [[Getting_Started#Parallel_Programming_or_.22How-To-Use-More-Than-One-Core.22|parallelizing programs here]]. Also there is an [[Schedulers|overview of the schedulers used at the different sites]].
  
== Modules or How-To-Use-Software-Without-installing-everything-yourself ==
+
== [[Modules|Modules]] or "How-To-Use-Software-Without-installing-everything-yourself" ==
A lot of applications rely on 3rd party software. One prominent example beeing compilers, this software is usually loadable with the module system. Depending on the site, different modules are available, but there are usually common ones like the [[Intel Compiler|Intel]] or [[GCC]] Compilers.
+
Since a lot of applications rely on 3rd party software, there is a program on most supercomputers, called the [[Modules|Module system]]. With this system, other software, like compilers or special math libraries, are easily loadable and usable. Depending on the institution, different modules might be available, but there are usually common ones like the [[Compiler#Intel_Compiler|Intel]] or [[Compiler#Gnu_Compiler_Collection|GCC]] [[Compiler|Compilers]].
  
A few common commands are
+
A few common commands, to enter into the supercomputer commandline and talk to the module system, are  
 
{| class="wikitable" style="width: 40%;"
 
{| class="wikitable" style="width: 40%;"
 
| module list || lists loaded modules
 
| module list || lists loaded modules
Line 59: Line 46:
 
| module avail || lists available (loadable) modules
 
| module avail || lists available (loadable) modules
 
|-
 
|-
| module load/unload x || loads/unloads modul x
+
| module load/unload x || loads/unloads module x
 
|-
 
|-
 
| module switch x y || switches out module x for module y
 
| module switch x y || switches out module x for module y
 
|}
 
|}
  
If you recurrently need lots of modules, this loading can be automated with an [[sh]] file, so that you just have to execute the file once and it loads all modules, you need.
+
If you recurrently need lots of modules, this loading can be automated with an [[sh-file]], so that you just have to execute the file once and it loads all modules, you need.
 
 
 
 
== Parallelizing or How-To-Use-More-Than-One-Core ==
 
Unfortunately currently development of computers is at the point, where you can not just make a processor run faster, because the physics simply dont work out. Therefore the current solution is to split the work into multiple partly independent parts, which are then executed in parallel. Similar to cleaning your house, where everybody takes care of a few rooms, on a supercomputer this is usually done with [[OpenMP]] or [[MPI]]. However like the vacuum cleaner, where you have only one, there are limits on how fast you can get, even with a big number of processors working on your problem in parallel.
 
 
 
[[MPI|Message Passing Interface (MPI)]] is similar to the way how humans interact with problems: every process 'thinks' on it's own and can communicate with the others by sending messages. [[OpenMP|Open Multi-Processing (OpenMP)]] on the other hand works more like a depot, where every branch of the city can store their wares/results and access the wares/results of everybody else, who is on the same warehouse. Similar to a warehouse, there are logistical limits on how many branches can have access to the same memory and how big it can be. Therefore usually [[OpenMP]] is employed for the different processes in one node and [[MPI]] to communicate accross nodes. Both can be used simultaneously.
 
 
 
 
 
== File Transfer or How-to-get-your-stuff-onto-the-supercomputer ==
 
To get your stuff (files) onto the supercomputer, there are usually different ways. Sometimes there are computers, specifically reserved for this purpose:
 
 
 
{| class="wikitable" style="width: 40%;"
 
| RWTH Aachen [https://doc.itc.rwth-aachen.de/display/CC/Remote+file+transfers]
 
| RRZE - FAU Erlangen [https://www.anleitungen.rrze.fau.de/hpc/hpc-storage/]
 
| ZIH - TU Dresden [https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/ExportNodes]
 
|-
 
| cluster-copy.rz.rwth-aachen.de || no dedicated copy nodes || taurusexport.hrsk.tu-dresden.de
 
|-
 
| cluster-copy2.rz.rwth-aachen.de ||
 
|}
 
  
Common copying tools to do the copying are [[rsync]] which mirrors directories (folders) between the supercomputer and your local machine. [[scp]] which is useful for a few single files or specified file-lists, and lastly the commonly used [[ftp]] or the encrypted versions sftp (or ftps).
+
== [[Parallel_Programming|Parallel Programming]] or "How-To-Use-More-Than-One-Core" ==
The best is to check on the above links, which protocol/tool your supercomputer supports and then move from there.
+
Currently development of computers is at a point, where you cannot just make a processor run faster (e.g. by increasing its clock frequency), because limits of physics have been reached in semiconductor development. Therefore the current approach is to split the work into multiple, ideally independent parts, which are then executed in parallel. Similar to cleaning your house, where everybody takes care of a few rooms, on a supercomputer this is usually done with parallel programming paradigms like [[OpenMP|Open Multi-Processing (OpenMP)]] or [[MPI|Message Passing Interface (MPI)]]. However like the fact that you only have one vacuum cleaner in the whole house which not everybody can use at the same time, there are limits on how fast you can get, even with a big number of processing units/cpus/cores (analogous to people in the metaphor) working on your problem (cleaning the house) in parallel.

Latest revision as of 14:06, 3 September 2019

Access or "How-to-be-allowed-onto-the-supercomputer"

Depending on the specific supercomputer, one has to either register to get a user account or write a project proposal and apply for computing resources that way. The respective pages are linked in this overview.

After this is done and login credentials are supplied, one can proceed to login .

Login or "How-to-now-actually-connect-to-the-supercomputer"

Most HPC Systems are unix-based environments with shell (commandline) access.

To log in, one usually uses ssh to reach the respective Login Nodes (computers reserved for people just like you that want to connect to the supercomputer). Sometimes this access is restricted, so you can only connect, when you are within the university/facility and its network. To still access the Login Nodes externally, one can 'pretend to be inside the network' by using a Virtual Private Network (VPN).

Once there, the user can interact with the system and run (small) programs to generally test the system/software.

File Transfer or "How-to-get-your-data-onto-or-off-the-supercomputer"

To get your data (files) onto the supercomputer or back to your local machine, there are usually different ways. Sometimes there are computers specifically reserved for this purpose called copy nodes.

If available to you, it is recommened to use these copy nodes to move data to or from the supercomputer, since this will result in a better connection and disturb other users less. Additionally the tools mentioned below might only work on these nodes. If there are no dedicated copy nodes, you can usually use the Login Nodes for this purpose.

Commonly used and widely supported copying tools are rsync which mirrors directories (folders) between the supercomputer and your local machine. scp which is useful for a few single files or specified file-lists, and lastly the commonly used ftp or the encrypted version sftp (or ftps). A little bit more information can be found in the File Transfer article.

Schedulers or "How-To-Run-Applications-on-a-supercomputer"

To run any significant program or workload on a supercomputer, generally a Batch-Scheduler is employed. Alongside the above-mentioned Login Nodes there are usually far more Backend Nodes in the system (computers exclusively reserved for computing, to which you cannot connect directly, also referred to as "batch system"). A program called Batch-scheduler decides who gets how many of those compute resources for which amount of time. Please use the Backend Nodes for everything which is not a simple small test and only runs for a few minutes., otherwise you will block the Login Nodes for everybody when you run your calculations there. These Backend Nodes make up more than 98% of a supercomputer and can only be accessed via the scheduler.

When you log into a supercomputer, you can run commands on the Login Nodes interactively. You type, you hit return, the command gets executed. Schedulers work differently. You submit a series of commands (in form of a file) and tell it, how much resources it will approximately need in terms of:

  • time: If the specified time runs out, before your application finishes and exits, it will be terminated by the scheduler.
  • compute resources: how many cpus ('calculation thingies'), sockets ('cpu-houses') and nodes ('computers')
  • memory resources: how much RAM ('very fast memory, similar to the few books you have at home')

This combination of specified commands and required resources is commonly referred to as a "(batch) job".

If later compute resources become free, which match the requirements of your application, the scheduler will run your specified commands on the requested hardware. This is usually delayed (sometimes you have to wait a day or two) and not instant, because other users are currently using the compute resources and you have to wait until their program runs finish. Furthermore you cannot change the series of commands after submitting, but just terminate the job and submit a new one in case of an error.

The file specifying this series of commands and the required resources is called a jobscript. Its format and syntax depends on the installed scheduler. When you have this jobscript ready with the help of jobscript-examples, colleagues or your local support, you can submit it to the respective scheduler of your facility. The scheduler then waits until a set of nodes (computers) are free and later allocates those to execute your job as soon as possible. Sometimes there is (an optional) email notification, which is send when your job starts execution/finished running.

Be aware that your specified requirements have to fit within the boundaries of the system of your facility. If you ask for more than there is, chances are, the scheduler will accept your job and wait until missing hardware is bought and installed - although this will not happen in 99.9% of cases. Information over the available hardware can be found in the overview of the Gauss Allianz or the documentation of the different sites. You can find more information about parallelizing programs here. Also there is an overview of the schedulers used at the different sites.

Modules or "How-To-Use-Software-Without-installing-everything-yourself"

Since a lot of applications rely on 3rd party software, there is a program on most supercomputers, called the Module system. With this system, other software, like compilers or special math libraries, are easily loadable and usable. Depending on the institution, different modules might be available, but there are usually common ones like the Intel or GCC Compilers.

A few common commands, to enter into the supercomputer commandline and talk to the module system, are

module list lists loaded modules
module avail lists available (loadable) modules
module load/unload x loads/unloads module x
module switch x y switches out module x for module y

If you recurrently need lots of modules, this loading can be automated with an sh-file, so that you just have to execute the file once and it loads all modules, you need.

Parallel Programming or "How-To-Use-More-Than-One-Core"

Currently development of computers is at a point, where you cannot just make a processor run faster (e.g. by increasing its clock frequency), because limits of physics have been reached in semiconductor development. Therefore the current approach is to split the work into multiple, ideally independent parts, which are then executed in parallel. Similar to cleaning your house, where everybody takes care of a few rooms, on a supercomputer this is usually done with parallel programming paradigms like Open Multi-Processing (OpenMP) or Message Passing Interface (MPI). However like the fact that you only have one vacuum cleaner in the whole house which not everybody can use at the same time, there are limits on how fast you can get, even with a big number of processing units/cpus/cores (analogous to people in the metaphor) working on your problem (cleaning the house) in parallel.