Difference between revisions of "Parallel Programming"

From HPC Wiki
Jump to navigation Jump to search
m
Line 19: Line 19:
 
Distributed Memory is similar to the way how multiple humans interact with problems: every process 'works' on it's own and can communicate with the others by sending messages (talking and listening).
 
Distributed Memory is similar to the way how multiple humans interact with problems: every process 'works' on it's own and can communicate with the others by sending messages (talking and listening).
  
In a computer or a cluster of computers every core works on it's own and has a way (e.g. the [[MPI|Message Passing Interface (MPI)]]) to communicate with the other cores. This messaging can happen within a CPU between multiple cores, utilize a high speed network between the computers (nodes) of a supercomputer, or theoretically even happen over the internet. This Sending and receiving messages is often harder to implement for the developer and sometimes even requires a major rewrite/restructure of existing code. However, it has the advantage, that it can be scaled to more computers (nodes), since every process has it's own memory and can communicate over [[MPI]] with the other processes. The limiting factor here is the speed and characteristics of the physical network, connecting the different nodes.
+
In a computer or a cluster of computers every core works on it's own and has a way (e.g. the [[MPI|Message Passing Interface (MPI)]]) to communicate with the other cores. This messaging can happen within a CPU between multiple cores, utilize a high speed network between the computers (nodes) of a supercomputer, or theoretically even happen over the internet. This sending and receiving of messages is often harder to implement for the developer and sometimes even requires a major rewrite/restructure of existing code. However, it has the advantage, that it can be scaled to more computers (nodes), since every process has it's own memory and can communicate over [[MPI]] with the other processes. The limiting factor here is the speed and characteristics of the physical network, connecting the different nodes.
  
 
The communication pattern is depicted with a sparse and a dense network. In a sparse network, messages have to be forwarded by sometimes multiple cores to reach their destination. The more connections there are, the lower this amount of forwarding gets, which reduces average latency and overhead and increases throughput/scalability.
 
The communication pattern is depicted with a sparse and a dense network. In a sparse network, messages have to be forwarded by sometimes multiple cores to reach their destination. The more connections there are, the lower this amount of forwarding gets, which reduces average latency and overhead and increases throughput/scalability.
  
 
Since every communication is explicitly coded, this communication pattern can be designed carefully to exploit the architecture and the available nodes to the fullest extend. Therefore, theoretically, it can scale as high as the underlying problem allows, being only limited by the network connecting the nodes and the overhead of sending/receiving messages.
 
Since every communication is explicitly coded, this communication pattern can be designed carefully to exploit the architecture and the available nodes to the fullest extend. Therefore, theoretically, it can scale as high as the underlying problem allows, being only limited by the network connecting the nodes and the overhead of sending/receiving messages.
 
  
 
== Should I use Distributed Memory or Shared Memory? ==
 
== Should I use Distributed Memory or Shared Memory? ==

Revision as of 13:24, 9 April 2018

In order to solve a problem faster, work can sometimes be executed in parallel, as mentioned in Getting Started. To achieve this, one usually uses either a Shared Memory or a Distributed Memory programming model. Following is a short description of the basic concept of Shared Memory and Distributed Memory. Information on how to start/run/use an exisiting parallel code can be found in the OpenMP or MPI article.


Shared Memory

Schematic of shared memory

Shared Memory programming works like the communication via a pin board. There is one shared memory (pin-board in the analogy) where everybody can see what everybody is doing and how far they have gotten or which results (the bathroom is already clean) they got. Similar to the physical world, there are logistical limits on many people can use the memory (pin board) efficiently and how big it can be.

In the computer this translates to multiple cores having equal access to the same memory as depicted. This has the advantage, that there is generally very little communication overhead, since every core can write to every memory location and the communication is therefore implicit. Futhermore parallelising an existing sequential (= not parallel) program is commonly straight forward and very easy to implement, if the underlying problem allows parallelisation. As can be seen in the picture, it is not practical to attach more and more cores to the memory and therefore this paradigm is limited by how many cores you can fit into one computer (a few hundred are a good estimate).

This paradigm is implemented by e.g. Open Memory Programming (OpenMP).


Distributed Memory

Schematic of distributed memory with sparse network
Schematic of distributed memory with dense network

Distributed Memory is similar to the way how multiple humans interact with problems: every process 'works' on it's own and can communicate with the others by sending messages (talking and listening).

In a computer or a cluster of computers every core works on it's own and has a way (e.g. the Message Passing Interface (MPI)) to communicate with the other cores. This messaging can happen within a CPU between multiple cores, utilize a high speed network between the computers (nodes) of a supercomputer, or theoretically even happen over the internet. This sending and receiving of messages is often harder to implement for the developer and sometimes even requires a major rewrite/restructure of existing code. However, it has the advantage, that it can be scaled to more computers (nodes), since every process has it's own memory and can communicate over MPI with the other processes. The limiting factor here is the speed and characteristics of the physical network, connecting the different nodes.

The communication pattern is depicted with a sparse and a dense network. In a sparse network, messages have to be forwarded by sometimes multiple cores to reach their destination. The more connections there are, the lower this amount of forwarding gets, which reduces average latency and overhead and increases throughput/scalability.

Since every communication is explicitly coded, this communication pattern can be designed carefully to exploit the architecture and the available nodes to the fullest extend. Therefore, theoretically, it can scale as high as the underlying problem allows, being only limited by the network connecting the nodes and the overhead of sending/receiving messages.

Should I use Distributed Memory or Shared Memory?

This really depends on the problem at hand. If the problem is parallelizable, the required computing power is a good indicator. When a few to a hundret cores should suffice, OpenMP is (for existing codes) commonly the easiest alternative. However, if thousands or even millions of cores are required, there is not really a way around MPI. To give a better overview, the different pro/cons are listed in this table:

Shared Memory (OpenMP) Distributed Memory (MPI)
Pros Cons Pros Cons
Easy to implement scales only to 1 node scales across multiple nodes harder to implement
shared variables inherent data races no inherent data races no shared variables
low overhead each MPI process can utilize OpenMP,

resulting in a hybrid application

some overhead
can be executed/started normally needs a library wrapper

So in sum it really depends and if you are unsure, have a chat with your local HPC division, check out the example in the References and/or head over to the Support.

References

Difference between SM und DM in a concrete C example