Parallel Programming - Revision history

Kamil-braschke-0d3e@uni-wuppertal.de: /* Shared Memory */

2022-04-06T14:23:31Z

Shared Memory

Paul-kapinos-e26d@rwth-aachen.de at 22:49, 20 May 2020

2020-05-20T22:49:15Z

Paul-kapinos-e26d@rwth-aachen.de: /* Should I use Distributed Memory or Shared Memory? */

2020-05-20T21:12:27Z

Should I use Distributed Memory or Shared Memory?

Paul-kapinos-e26d@rwth-aachen.de: /* Distributed Memory */

2020-05-20T21:02:15Z

Distributed Memory

Paul-kapinos-e26d@rwth-aachen.de: /* Shared Memory */

2020-05-20T20:54:53Z

Shared Memory

Paul-kapinos-e26d@rwth-aachen.de: /* Shared Memory */

2020-05-20T20:49:03Z

Shared Memory

Paul-kapinos-e26d@rwth-aachen.de: /* Should I use Distributed Memory or Shared Memory? */

2020-05-20T20:20:04Z

Should I use Distributed Memory or Shared Memory?

Paul-kapinos-e26d@rwth-aachen.de at 20:06, 20 May 2020

2020-05-20T20:06:14Z

Paul-kapinos-e26d@rwth-aachen.de at 16:18, 20 May 2020

2020-05-20T16:18:06Z

Paul-kapinos-e26d@rwth-aachen.de at 16:10, 20 May 2020

2020-05-20T16:10:14Z

@@ Line 33: / Line 33: @@
 [[File:Shared_Memory.png|thumb|300px|Schematic of shared memory]]
-Shared Memory programming works like the communication of multiple people, who are cleaning a house, via a pin board. There is one shared memory (pin-board in the analogy) where everybody can see what everybody is doing and how far they have gotten or which results (the bathroom is already clean) they got. Similar to the physical world, there are logistical limits on many parallel units (people) can use the memory (pin board) efficiently and how big it can be.
+Shared Memory programming works like the communication of multiple people, who are cleaning a house, via a pin board. There is one shared memory (pin-board in the analogy) where everybody can see what everybody is doing and how far they have gotten or which results (the bathroom is already clean) they got. Similar to the physical world, there are logistical limits on how many parallel units (people) can use the memory (pin board) efficiently and how big it can be.
 In the computer this translates to multiple cores having joint access to the same shared memory as depicted. This has the advantage, that there is generally very little communication overhead, since every core can write to every memory location and the communication is therefore implicit. (However due to [[NUMA]] access may not be equally fast from any core to any memory location.) Futhermore parallelising an existing sequential (= not parallel) program is commonly straight forward and very easy to start, if the underlying problem allows parallelisation at all. As can be seen in the picture, it is not practical to attach more and more cores to the same memory, because it can only serve a limited number of cores with data efficiently at the same time. Therefore this paradigm is limited by how many cores can fit into one computer (a few hundred is a good estimate).

@@ Line 8: / Line 8: @@
 * Bottlenecks in the parallel computer design, e.g. memory or network bandwidth limitations.
 * Load imbalances; one processor has more work than the others causing them to wait.
-* Serial parts in the program which may not be parallelized at all [https://en.wikipedia.org/wiki/Amdahl's_law (Amdahl's Law)].
+* Serial parts in the program which may not be parallelized at all [[Amdahl's_Law|Amdahl's Law]].
 All of the parallelization approaches can be coarsely classified to be of [[#Distributed_Memory|Distributed Memory]] (DM) or [[#Shared_Memory|Shared Memory]] (SM) class. The Distributed Memory paradigms owns the ultimate feature to work beyond frontiers of a physical node / operating system instance, opening the possibility to utilize much more of (potentially cheaper) hardware. However the used network is crucial for the DM performance and scalability. Note that a Distributed Memory approach typically can be used also on a Shared Memory system. Through unremittingly development some approaches previously known to be DM get features related to SM, and on the other side attemps are made to make SM approaches be runnable over DM clusters, making a clean dichotomy complicated up to impossible in many cases.

@@ Line 62: / Line 62: @@
 ! Pros || Cons || Pros || Cons
 |-
-| Easy to implement || scales only to 1 node || scales across multiple nodes || harder to implement
+| Easy to start || scales only to '''1''' node || scales across multiple nodes || harder to implement
 |-
 | shared variables || inherent data races ||  no inherent data races || no shared variables
 |-
-| low overhead || || rowspan="2"| each MPI process can utilize OpenMP,
+| low-overhead apps ''possible'' || typically bad performace of ''first attempt''  || rowspan="2"| each MPI process can utilize OpenMP,
 resulting in a hybrid application
 | some overhead
 |-
-| can be executed/started normally || || needs a library wrapper
+|  || || needs a library, complicated start-up
 |}

@@ Line 46: / Line 46: @@
 Distributed Memory is similar to the way how multiple humans interact while solving problems: every process (person) 'works' on it's own and can communicate with the others by sending messages (talking and listening).
-In a computer or a cluster of computers every core works on it's own and has a way (e.g. the [[MPI|Message Passing Interface (MPI)]]) to communicate with the other cores. This messaging can happen within a CPU between multiple cores, utilize a high speed network between the computers (nodes) of a supercomputer, or theoretically even happen over the internet. This sending and receiving of messages is often harder to implement for the developer and sometimes even requires a major rewrite/restructure of existing code. However, it has the advantage, that it can be scaled to more computers (nodes), since every process has it's own memory and can communicate over [[MPI]] with the other processes. The limiting factor here is the speed and characteristics of the physical network, connecting the different nodes.
+In a computer or a cluster of computers every core works on it's own and has a way (e.g. the [[MPI|Message Passing Interface (MPI)]]) to communicate with the other cores. This messaging can happen within a CPU between multiple cores, utilize a high speed network between the computers (nodes) of a supercomputer, or theoretically even happen over the internet. This sending and receiving of messages is often harder to implement for the developer and sometimes even requires a major rewrite/restructure of existing code or even modifications on algorithms. However, it has the advantage, that it can be scaled to more computers (nodes), since every process has it's own memory and can communicate over [[MPI]] with the other processes. The limiting factor here is the speed and characteristics of the physical network, connecting the different nodes.
 The communication pattern is depicted with a sparse and a dense network. In a sparse network, messages have to be forwarded by sometimes multiple cores to reach their destination. The more connections there are, the lower this amount of forwarding gets, which reduces average latency and overhead and increases throughput and scalability.
 Since every communication is explicitly coded, this communication pattern can be designed carefully to exploit the architecture and the available nodes to their fullest extend. It follows, that in theory the application can scale as high as the underlying problem allows, being only limited by the network connecting the nodes and the overhead for sending/receiving messages.
 == Should I use Distributed Memory or Shared Memory? ==

@@ Line 35: / Line 35: @@
 Shared Memory programming works like the communication of multiple people, who are cleaning a house, via a pin board. There is one shared memory (pin-board in the analogy) where everybody can see what everybody is doing and how far they have gotten or which results (the bathroom is already clean) they got. Similar to the physical world, there are logistical limits on many parallel units (people) can use the memory (pin board) efficiently and how big it can be.
-In the computer this translates to multiple cores having joint access to the same shared memory as depicted. This has the advantage, that there is generally very little communication overhead, since every core can write to every memory location and the communication is therefore implicit. (However due to [[NUMA]] access may not be equally fast from any core to an memory.) Futhermore parallelising an existing sequential (= not parallel) program is commonly straight forward and very easy to start, if the underlying problem allows parallelisation. As can be seen in the picture, it is not practical to attach more and more cores to the same memory, because it can only serve a limited number of cores with data efficiently at the same time. Therefore this paradigm is limited by how many cores can fit into one computer (a few hundred is a good estimate).
+In the computer this translates to multiple cores having joint access to the same shared memory as depicted. This has the advantage, that there is generally very little communication overhead, since every core can write to every memory location and the communication is therefore implicit. (However due to [[NUMA]] access may not be equally fast from any core to any memory location.) Futhermore parallelising an existing sequential (= not parallel) program is commonly straight forward and very easy to start, if the underlying problem allows parallelisation at all. As can be seen in the picture, it is not practical to attach more and more cores to the same memory, because it can only serve a limited number of cores with data efficiently at the same time. Therefore this paradigm is limited by how many cores can fit into one computer (a few hundred is a good estimate).
-For parallelizing applications, which plan on running on these kind of systems, the explicit distribution of work over the processors by compiler directives [[OpenMP|Open Memory Programming (OpenMP)]] is commonly used in the HPC community. Autoparallelization (automatic distribution of loop iterations over several processors) by the compiler is worth a try for trivial codes - it is just a compiler parameter which may give you (or may not) some speedup 'for free'.
+For parallelizing applications, which plan on running on Shared Memory systems, the explicit distribution of work over the processors by compiler directives [[OpenMP|Open Memory Programming (OpenMP)]] is commonly used in the HPC community. Autoparallelization (automatic distribution of loop iterations over several processors) by the compiler is worth a try for modest codes - it is just a compiler parameter which may give you (or may not) some speedup 'for free'.
 == Distributed Memory ==

@@ Line 54: / Line 54: @@
 == Should I use Distributed Memory or Shared Memory? ==
-This really depends on the problem at hand. If the problem is parallelizable, the required computing power is a good indicator. When a few to a hundred cores should suffice, [[OpenMP]] is (for existing codes) commonly the easiest alternative. However, if thousands or even millions of cores are required, there is not really a way around [[MPI]]. To give a better overview, different pros and cons are listed in the table below:
+This really depends on the problem at hand. If the problem is parallelizable, the required computing power is a good indicator. When a few to a hundred cores should suffice, [[OpenMP]] is (for existing codes) commonly the easiest alternative. In many cases only a few lines of OpenMP codes are needed, whereas MPI is a lot more tedious and often require a whole redesign of the program or either used algorithm. However, if thousands or even millions of cores are required, or if the data set does not fit into the memory of a biggest single node available, there is not really a way around [[MPI]]. A combination of MPI and OpenMP may be advantageous, especially for applications with more than one level of parallelism. To give a better overview, different pros and cons are listed in the table below:
 {| class="wikitable" style=""

@@ Line 4: / Line 4: @@
 '''There are very many kinds of parallelization.'''
-In the ideal case doubling the number of processors the runtime is cut in half. There are several reasons why the ideal case usually is not met, a few of them are:
+In the ideal case doubling the number of execution units the runtime is cut in half. There are several reasons why the ideal case usually is not met, a few of them are:
-* Overhead because of process or thread synchronization and communication.
+* Overhead because of synchronization and communication.
 * Bottlenecks in the parallel computer design, e.g. memory or network bandwidth limitations.
 * Load imbalances; one processor has more work than the others causing them to wait.
 * Serial parts in the program which may not be parallelized at all [https://en.wikipedia.org/wiki/Amdahl's_law (Amdahl's Law)].
-All of the parallelization approaches can be coarsely classified to be of [[#Distributed_Memory|Distributed Memory]] (DM) or [[#Shared_Memory|Shared Memory]] (SM) class. The Distributed Memory paradigms owns the ultimate feature to work beyond frontiers of a physical nodem / operating system instance, opening the possibility to utilize much more of (potentially cheaper) hardware. However the used network is crucial for the DM performance and scalability. Note that a Distributed Memory approach typically can be used also on a Shared Memory system. Through unremittingly development some approaches previously known to be DM get features related to SM, and on the other side attemps are made to make SM approaches be runnable over clusters, making a clead dichitomy complicated to impossible in many cases.
+All of the parallelization approaches can be coarsely classified to be of [[#Distributed_Memory|Distributed Memory]] (DM) or [[#Shared_Memory|Shared Memory]] (SM) class. The Distributed Memory paradigms owns the ultimate feature to work beyond frontiers of a physical node / operating system instance, opening the possibility to utilize much more of (potentially cheaper) hardware. However the used network is crucial for the DM performance and scalability. Note that a Distributed Memory approach typically can be used also on a Shared Memory system. Through unremittingly development some approaches previously known to be DM get features related to SM, and on the other side attemps are made to make SM approaches be runnable over DM clusters, making a clean dichotomy complicated up to impossible in many cases.
 In the context of HPC those well-known approaches can be itemised (the list is not final!):
-* [[#Distributed_Memory|(DM)]]
+* [[#Distributed_Memory| Distributed Memory]]
 ** [[MPI|Message Passing Interface]] (MPI)
 ** [https://en.wikipedia.org/wiki/Partitioned_global_address_space PGAS] languages ([https://software.intel.com/content/www/us/en/develop/articles/intel-c-compiler-for-linux-building-upc-to-utilize-the-intel-c-compiler.html UPC], [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html Coarray Fortran])
-** accelerators and other devices, like [[Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs|GPGPU]]
+** accelerators and other devices, like [[Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs|GPGPU]] - latest [[OpenMP|OpenMP]] standards introduce offloading
-* [[#Shared_Memory|(SM)]]
+* [[#Shared_Memory|Shared Memory]]
 ** [[OpenMP|OpenMP]] (not to be mixed with [https://www.open-mpi.org/ Open MPI] wich is an MPI implementation)
 ** The Pthreads library