Python/pip
General
Python (official website, Wikipedia) is an Open Source, interpreted high-level programming language which is popular for prototyping high end simulations and data science applications. In addition to a large standard library there are many third-party libraries available (e.g. NumPy, SciPy), as well as the built-in package manager pip. However, it does not have good performance nor easy parallelization and therefore may not be most suitable for very large simulations.
Installation and Usage
Python and its most commonly used libraries (NumPy, SciPy, Matplotlib) are usually available on HPC clusters as an environment module, possibly in more than one version. Often Linux systems also have default Python instances that are available without loading a module, however those are typically older. If the version you need is not available, you can install it in your home directory by using pip
, conda
or easy_install
, if available.
Tip: In our experience, using the wrong Python instance, or the wrong Pip, is a common error source for HPC users. Be aware of which Python instance you are talking to.
You can check the used version with:
$ python --version
and the corresponding install directory with
$ which python
Note that Python 2 has been deprecated since 2020, and Python 3 should always be used, unless Python 2 is absolutely necessary for backwards compatibility. Python 2 is often still available however, and may even be the default. You can also specify the Python version directly:
$ python2 --version
$ python3 --version
if you need to differentiate.
In general you run a Python script with a command that looks something like the following:
$ python my_program.py
This will execute all the Python code in that script. Often, there will be a line like the one above in your job script.
Learning Resources
- Tutorial from official documentation
- Python tutorials at Tutorials Point
- Python tutorial using HPC examples
- "Python for Everybody" - Good video tutorial covering basic Python programming
- Good overview of Python best practices
- Overview of common newbie mistakes
- Overview of Python topics at W3 Schools
Arrays
Arrays are a particularly important datatype to master for scientific programmers. Python has a default built-in array datatype that is rarely used. The much more common way, particularly in scientific software, is to use the NuPy package with its array type. The array syntax used in NumPy is commonly used by other software too. For example TensorFlow and PyTorch tensors are functionally very similar to NumPy arrays. Pandas dataframes are also similar in some respects.
An array is a container datatype that contains many entries of the same datatype (e.g. floating-point numbers). The entries are also called array elements. The main advantages of arrays are:
- They are an efficient way to store many entries, as metadata (e.g. entry datatype) only has to be stored once for the entire array.
- The elements are stored contiguously (one directly after the other) in memory. Computers can therefore potentially make use of caching and vector instructions, which speeds up array operations.
- Operations with all entries can often be expressed very concisely in code. For example, to add a constant to all elements, in NumPy and most other Python array libraries, one can simply write
array + 1.0
instead of having to write a loop that adds1.0
to each element individually. - Arrays correspond closely in many respects to mathematical structures like vectors and matrices, making them more practical for scientific use, and more intuitive to understand for many scientists.
The main disadvantage is that resizing arrays is harder, particularly when adding elements in the middle: memory has to be added at the end, where no free memory block might be (possibly necessitating copying the entire array to a new, larger, memory block). All entries after the added element also need to be moved by one position. Python offers the built-in list, set and dict datatypes if entries need to be added or removed often.
A multi-dimensional array is one where elements can be addressed with multiple indices. For example, elements in a two-dimensional NumPy array can be addressed by row and column. The number of dimensions is not limited to 2. Most third-party libraries mentioned here use multi-dimensional arrays. In NumPy, the word size typically denotes the total number of elements, while the word shape denotes the size of the individual dimensions. In NumPy arrays, they can be accessed with my_array.size
and my_array.shape
, respectively. Note that the latter is a Python tuple. See this tutorial in the NumPy documentation for examples.
Learning resources
- NumPy: the absolute basics for beginners (NumPy documentation)
- SciPy user guide with linear algebra and many other algorithmic operations on NumPy arrays
- Introduction to PyTorch tensors, PyTorch documentation
- TensorFlow tensors guide, TensorFlow documentation
Key concepts for understanding arrays
The examples in this section refer to NumPy arrays, but they mostly apply analogously to PyTorch tensors and other array types.
Indexing and slicing
Elements can be indexed individually or the array as a whole can be used in mathematical operations and other functions. However NumPy and other arrays in Python also allow addressing subsets of array elements. The syntax usually uses square brackets, e.g. my_matrix[3, 5] = 1.0
. In multi-dimensional arrays, the last (rightmost) dimensions can be left out, then implicitly all elements along that dimension are selected. For example, in a 2D array my_array[5]
is identical to my_array[5, :]
.
The indexing syntax allows a lot more advanced addressing of elements and groups of elements (sub-arrays).
Caution: Python uses zero-based indexing, so element count starts at zero. Additionally, the stop is one larger than the largest element one wants to address.
Slices signify a range of elements along the given axis. They can be specified with a start and stop and optionally a step. The syntax uses colons to separate like so: [start:stop:step]
. All three elements of a slice, and even the second colon, can be left out. For example my_array[:]
would address all elements in a 1D array.
Note that Python slices are also a built-in object named slice
. This syntax is used in many other Python datatypes as well, for example strings.
Boolean Indexing: An array of bool values with the same shape as the array can be used as a mask, so that only the elements where the mask is true get addressed. For example my_array[my_array < 0] = 0
will set all negative elements to zero. The expression my_array < 0
creates a temporary bool array in this case, but of course the mask can be defined separately and its variable name passed into the square brackets.
Integer arrays and other integer sequences can also be passed into the square brackets to address multiple elements along the given dimension(s) at a time.
Memory Layout
Data in the computer's working memory (RAM) is numbered sequentially. A one-dimensional array therefore maps to the memory addresses very straightforwardly, but a multi-dimensional array can map to memory in multiple different ways. NumPy arrays by default use row-major ordering, called C order in NumPy. They can optionally be set to column-major ordering, which NumPy calls Fortran order.
This mapping of multiple dimensions to one also means that in some circumstances the same array can be reinterpreted as having different shapes without having to change the order of the elements in memory. For example, an N-dimensional NumPy array can always be interpreted as a 1D array. NumPy has many operations for reshaping pre-defined, for example ravel.
Tip: Understanding which transformations change the order of array elements and which do not is key to ensuring that your Python code performs well. Unnecessary re-arranging of array elements takes time and, depending on the operation, extra memory. NumPy will generally not warn you about this kind of waste.
If you retrieve a part of an array using, for example, the slicing operations used above, you will usually get a so-called array view. This is an object that looks at the same data in a different way, but can also be passed into NumPy operations as if it were a normal array. This also implies that modifying array values in the view will change them in the original. Most operations will return views, a few will return copies.
Learning Resources
- Copies and Views, NumPy documentation
- More on array memory layouts
Vectorized Operations and Broadcasting
In the context of NumPy and array operations, the term vectorization refers to operations that are defined for individual array elements, but also can be easily applied to larger arrays. Note that the term vectorization also refers to a CPU's or GPU's capability to operate on multiple values at the same time, which is a related but different concept.
NumPy has a concept called array broadcasting (see NumPy documentation): operations can not only be done on arrays with completely identical shapes, but also if one of the operand arrays has a size of one in that dimension.
The simplest case would be an operations with a scalar and, say, a 5x4 array a
:
x1 = a + 5
The scalar 5
is broadcast onto the 5x4 elements of a
, and added to each one. This requires array shapes to be compatible, i.e. to have either an identical length or a length of one along each dimension. In this case, they are compatible because the 5
is implicitly interpreted as a 2D array of shape 1x1.
In the same vein, a 1D array of size 4 b
could be added to a
because the shapes are compatible. Mathematically, this can be seen as adding a vector b
to each row of a matrix a
.
x2 = a + b
Broadcasting always happens from the last dimension forward. That means that for example a 1D array of length 5 c
could not be added to a, because the last dimension of a
is 4 and of c
is 5.
x3 = a + c # Error if c has shape 5
This raises the question of what to do if one wants to add a vector to each column of a matrix. NumPy gives users control over which operations to apply along which axes in several ways. The first is the numpy.newaxis
object. One can use this to easily 'append' an array dimension. For example the following would work:
x3 = a + c[:, numpy.newaxis]
The syntax [:, numpy.newaxis]
reinterprets the 1D array c
as a 5x1 2D array. Now the shapes of the two operands - 5x4 and 5x1 respectively, are compatible.
Additionally, NumPy makes use of the ...
(ellipsis) Python object as a placeholder for 'zero or more additional dimensions'. This allows writing code that can handle arrays with an arbitrary number of dimensions. For example, consider a use case where the code handles many arrays of points (each with x, y and z coordinates stored in the last array dimension). A function that only manipulates x coordinates might look like this:
def add_2_to_x(points):
return points[:, 0] + 2.0
This function can only handle a list of points (a 2D array). It cannot handle a single point, because the :
requires the points
array to have two dimensions. Nor can it handle a higher-dimensional array (let's say a grid of points). However, the ellipsis object can stand in as a placeholder:
def add_2_to_x(points):
return points[..., 0] + 2.0
and the function can now handle 1D, 3D and higher-dimensional arrays with no additional code.
Learning Resources
- Broadcasting explained in the official NumPy documentation
- Good explanation of vectorized operations
- NumPy documentation on arrays and memory layout: [1] [2]
- NumPy Python for Data Analysis - Series of short Youtube videos demonstrating NumPy arrays
Scientific and numerical libraries
NumPy
NumPy is a powerful numerical computing library in Python, specializing in large, multi-dimensional arrays and matrices. It enhances Python with efficient data structures for array computations and a vast collection of high-level mathematical functions.
A good introduction for NumPy syntax are the 100 NumPy exercises. For an introduction with focus on efficient computing, refer to this tutorial. Additional tutorials are available on YouTube, including the 3rd and 4th videos in this series with corresponding Notebooks in this GitLab repository. Another informative YouTube video, covering NumPy in the first 1:48 minutes, is accessible here, along with related notebooks available in this Github repository.
However, when working with extensive arrays, performance might lag. To address this, consider alternatives like NumExpr, enabling expression evaluation on large arrays with a smaller memory footprint. For further performance optimization, explore tools such as Numba (just-in-time compilation), Cython (C-level optimizations), and CuPy (GPU acceleration), each offering distinct advantages based on specific performance requirements and available computing resources.
SciPy
SciPy is built on top of NumPy, aimed to facilitate complex mathematical operations, optimization routines, signal processing, and statistical analysis. Here is a quick tour through the main SciPy modules. For more details refer to the official SciPy user guide.
PyTorch
Main articles: Machine and Deep Learning Frameworks and PyTorch
PyTorch is a tensor computation (like NumPy) library with strong GPU acceleration for training artificial neural networks. The installation procedure depends on the cluster, which can be either using module environment, Singularity/Apptainer images or installing it yourself. An example can be found in this HPC cluster's documentation.
PyTorch has its own distributed package (torch.distributed
), which helps researchers to parallelize their computations across processes and multiple nodes. More information on using torch.distributed
in your Python codes can be found at the PyTorch Distributed Tutorial.
TensorFlow
Main articles: Machine and Deep Learning Frameworks and TensorFlow
TensorFlow is an open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. Its installation on HPC systems depend on the cluster. Here are few examples for the various ways to install it:
Module Environment: Installation can be provided by the module environment. For-example in a similar way to this HPC cluster.
Apptainer Container: Using Apptainer container with and without GPU, for-example in a similar way to this HPC cluster or this one
Anaconda Distribution: Install TensorFlow using the pre-installed Anaconda distribution, if your HPC cluster has one: for-example here
Home Directory Installation with Conda/Pip: Install TensorFlow in your home directory with conda/pip to create a virtual python environment here or here or here
For TensorFlow tutorials, explore the official TensorFlow tutorials. For a detailed guide on how to program TensorFlow code for GPUs, refer to this section of the official documentation. To execute TensorFlow jobs on an HPC system, you can refer to an example of a TensorFlow Slurm job provided here or here.
BLAS and LAPACK
BLAS and LAPACK are two specifications of linear algebra libraries. The BLAS package routines address low-level scalar, vector and matrix manipulations (such as matrix--vector and matrix--matrix multiplication) while the LAPACK package builds on the BLAS routines to, among other things, factorize matrices, solve simultaneous linear equations and eigenvalue problems. This Youtube lecture gives an overview about these two libraries.
BLAS and LAPACK each have multiple implementations. The most common ones are the reference implementations, OpenBLAS and the Intel MKL.
Python frameworks discussed on this page often use BLAS and LAPACK as a backend. Users normally do not need to install or configure the backends manually, but the need may arise when performance is a concern, for example switching to an implementation that supports multithreading or GPUs. See for example this guide on configuring BLAS and LAPACK for SciPy.
Pandas
Pandas offers efficient methods to load and processing data, alongside robust functionalities for exploration, summarization, and manipulation. The main data structure used in Pandas is the Dataframe, see also this introduction on Pandas data structures
Seamlessly integrating with Pandas, the Dask library enables the leveraging of parallelism in HPC systems, allowing for the scaling of Pandas operations to larger-memory datasets and efficient utilization of multiple CPU cores. To optimize the performance, one can use vectorized operations, avoiding unnecessary data copies and parallelizing computations. For additional guidance, refer to the Pandas user's guide.
Performance
Python is a rather slow language. A naive pure-Python implementation of a numerical algorithm tends to be considerably slower than a C or Fortran implementation of the same algorithm, up to two orders of magnitude slower.
When using Python in HPC, it is therefore usually best to use pre-existing libraries like NumPy, that are built to provide a higher performance (often by outsourcing the actual computations to C or C++ code), instead of implementing basic algorithms oneself.
There are also libraries that compile or convert Python code into some other form that runs faster. Examples are Numba, an ahead-of time compiler and Cython, which generates C code from Python code. Both of those can greatly speed up running code but require some amount of additional development effort and typically cover only a subset of possible Python code.
Additional basic concepts relevant to Python performance are:
Avoiding explicit loops
As discussed above in the section on vectorized operations, NumPy (and PyTorch, TensorFlow etc.) offer the possibility to perform an operation on an entire array in one go. This should always be tried first before writing loops in pure Python. Writing a raw loop in Python is often several orders of magnitude slower than the equivalent array operation. This is because pure Python operations introduce a large overhead that dedicated array operations avoid (usually by implementing those operations in another language).
The following example compares a matrix multiplication using explicit Python loops with the NumPy function matmul
. When tested, the explicit loop was 2-4 orders of magnitude slower.
import time
import numpy as np
n = 300
x = np.random.rand(n, n)
y = np.random.rand(n, n)
result = np.zeros((n, n))
t0 = time.time()
for i in range(n):
for j in range(n):
for k in range(n):
result[i, j] += x[i, k] * y[k, j]
t1 = time.time()
print(t1 - t0)
result2 = np.matmul(x, y)
t2 = time.time()
print(t2 - t1)
Data locality and CPU caches
Modern CPUs and GPUs can perform operations on multiple sequential array elements quickly or even at the same time. For this purpose, they have caches (Wikipedia): data is not read from RAM one-by one, but a chunk (called a cache line) at a time. This means that if an operation (e.g. an addition) needs to be done with all entries in a chunk, the cache line has to be fetched from memory only once. Otherwise, an entire cache line would be fetched, the operation only done on one element, and later the same line would have to be fetched again just to do the same operation on the next element, and so on.
This means that the programmer should ensure that, as much as possible, operations on elements happen in the right order to make use of caching. Developing an intuitive understanding of the memory layout, see above, can make programming with data locality in mind easier. The Wikipedia article on Locality of reference contains more information on caches and more examples
Temporary objects
Often NumPy operations will create a temporary intermediate results, without the user noticing. This increases memory consumption.
NumPy offers in-place operators which perform a basic operation without creating a temporary object. Consider this example, which allocates a large array and then adds a smaller one to part of it in two different ways:
x = np.zeros(5000)
y = np.zeros(10000)
y[:5000] = x + 1 # Will create a temporary array holding the result "x+1"
y[5000:] += 1 # Will not create a temporary array
Many NumPy functions also have an optional argument called out
. You can use this to allow the function to write its result to an existing NumPy array, rather than allocating a new one every time.
Learning Resources
- Tutorial with focus on performance
- Good overview of Python profiling
- Video tutorial covering multiple topics including performance
- Corresponding Jupyter worksheets
- Another video tutorial series covering many topics related to profiling and performance
- PyTorch in-place operations
- Pandas documentation on Pandas data structures
- Numba introduction
- Cython introduction
- Real World Numba - conference talk with real-life experiences from a scientist who converted his code to Numba
Parallelization
If you run Python code, then by default it will be serial, i.e. a single process, running a single thread on a single CPU core, doing a single thing at a time. In fact, CPython (the most widely used Python implementation by far) has a feature called the Global Interpreter Lock (GIL) to ensure that only one thread is executing Python code at a time.
There are a multitude of ways of running Python in parallel however. Some of the most important ones are:
- Simply running multiple Python processes independently of each other. The GIL applies to each one separately. This can be enough if your compute problem is trivially parallelizable. The processes do not know anything about each other, so without extra methods they cannot exchange data (without writing files to disk).
- Starting subprocesses from your Python code. Python comes with the built-in subprocess module.
- Calling code that is not written in Python, for example C code. This of course requires that code to be parallelized. There are many different ways of doing that, see for example the official Python documentation and this tutorial.
- Relying on built-in parallelization features of your framework or application. For example, some NumPy/SciPy functions are already multithreaded. PyTorch and TensorFlow also offer parallel features, see the PyTorch distributed module and TensorFlow distributed training overview.
- Using a parallelization framework. Examples are MPI4Py and Dask. The former wraps the MPI interface, the later is a library that offers various ways of defining tasks, including parallel and distributed tasks.
Note that most of these ways require additional development effort to use.
Learning Resources
- Python Global Interpreter Lock explained at realpython.com
- Introduction to Python bindings, i.e. calling C or C++ code from Python
- MPI4Py tutorial
- Dask tutorial
Pip
Pip is a package management system for Python and is included in Python (since Python 2.7.9/3.4). It enables managing lists of packages and their versions. If you aim to install packages for Python3, use pip3
instead of pip
with the commands below.
Every pip instance is connected to a specific Python instance, so be careful that you are using the correct pip executable.
Example:
$ pip install --user theano
The option --user
is necessary on HPC clusters because otherwise pip would try to install the packages into the central Python/pip install directory, which normal users typically do not have access to. With this option, the packages will be installed in your home directory instead, specifically in a folder named ~/.local
. You can check the install location with pip list -v installed
.
Getting the list of the installed packages:
$ pip list
Pip can also list the packages with outdated versions or available prereleases:
$ pip list --outdated
$ pip list --pre
Uninstalling packages works like this:
$ pip uninstall my-package
Upgrade the packages specified to the latest version:
$ pip install --upgrade package1 [package2 ...]
The present packages can be stored as a requirements file. If properly formatted, this file can then be used to recreate the given environment on another system with exactly the same packages and versions:
$ pip install -r requirements.txt
Note: pip3 [...]
can be used interchangeably with:
$ python3 -m pip [...]
Learning Resources
- Pip documentation: Getting Started
Virtual environments
When working with multiple coding projects, one often needs different sets of packages, possibly with conflicting requirements (e.g. different versions of the same package). Python and pip offer the possibility to create virtual environments. A virtual environment, also called a venv, enables switching between different sets of installed packages. Python comes with the built-in venv
module, which allows managing venvs.
To create and activate a new venv, type the following in the Linux terminal (not the Python console):
$ python -m venv ~/.venv/myenvname
$ source ~/.venv/myenvname/bin/activate
(myenvname) $
The first command tells Python to run the venv
module. The second command will run a shell script inside the new venv that applies some settings to your Linux environment, to enable Python and pip to reference that venv. It also adds a marker to the left of your console prompt to signify which venv you are in. From this point on, all pip installs will be put into this venv. To deactivate the venv and remove the marker, simply type deactivate
into the console.
Conda
Conda is a package manager that is geared towards scientists. It is language-agnostic and the repositories contain many non-Python packages. There is a large overlap however with the packages available in Pip. The main Conda online package repositories, which are called channels in Conda, are the built-in one and Conda-Forge. Other organizations might have additional channels, for example NVIDIA has one and there is a bioinformatics channel named Bioconda. Application software that is available in Conda typically has its recommended channel in its installation instructions.
When installing Conda, one can choose between the "Anaconda" distribution which contains a number of commonly used packages, or "Miniconda", which only contains the minimum packages to function. HPC clusters might have one or the other installed as a module.
Conda works with Conda environments (Conda envs) which function similar to virtual environments, see above.
Typically, each Conda env brings its own Python instance. This means that you can use this way to install a newer Python version, if you need one that is not available on your cluster.
Tips for use on HPC clusters
Conda is not primarily an HPC tool, and using it on a cluster sometimes causes problems. Here are some tips specific to HPC use:
Hardware detection
Make sure to install the correct variant of your package. For example, running conda install
on a login node might install a non-GPU variant of a given package, because Conda will detect that the login node does not have a GPU. To ensure you are installing the correct variant for your hardware, i.e for CUDA, you can do conda install
or module load inside an interactive job, e.g. on a SLURM cluster:
$ srun (...other SLURM options...) --gpus=<number_of_gpus> --pty /bin/bash
Installing Conda yourself
If neither Anaconda nor Miniconda is installed, you can install Conda without administrator rights into your home directory.
Problem with Conda and environment modules
Conda requires initializing with conda init
when you first start it, which adds an entry to your .bashrc
file (or the .rc
file of whichever other terminal you are using). This complicates things for HPC users, because it circumvents any environment module system that the cluster might have. Neither the active Conda env with the (envname)
prompt nor the conda
command get removed after unloading an Anaconda/Miniconda environment module. This can mess up in particular your PATH
and LD_LIBRARY_PATH
order. Here
There is no general recommendation on how to fix this, but here are some tips:
- You can disable automatic activation of the
base
environment as described here conda init
changes can be removed from your.bashrc
with the commandconda init --reverse
- You can deactivate and then reactivate your Conda env, e.g. at the beginning of your job script with a
conda deactivate
, followed by aconda activate <your env name>
- Likewise, you can use
module load
,module unload
andmodule purge
as needed. - In principle, you can create a custom environment module file which applies the
.bashrc
changes only temporarily upon loading.
Mixing Conda and pip
It is typically recommended not to mix Conda and pip, as they do not know about each other and might cause inconsistencies. If you absolutely must install packages with pip, people generally recommend that you install all Conda packags first, and after you install any pip packages to not install further Conda packages.
If you realize too late that you need to install another Conda package, one trick is to export your Conda env to a file, then throw it away and create a new one from that file, and finally do any pip installs on top of that. Exporting Conda envs is described here.
Conda package location and cache
If Conda is installed centrally on your HPC cluster, then the (base)
env is in a location that you cannot write to. You will therefore get an error message if you try to install pacakges into the (base)
env. However, any new envs you create will automatically land in your home directory.
Conda packages, can take up quite a lot of storage space. If your cluster has a quota, this can cause problems with the size of your home directory. Conda is generally smart enough not to download packages that are already in the central install location, but any packages you install on top of those will take up space twice: once for the installed package and once for the cached download. Your can clear the Conda caches with conda clean -a
as described here.
You can also control the location where packages get installed (with an environment variable called CONDA_ENVS_PATH
, and migrate them, see for example this guide.
Learning Resources
- Getting started with conda
- Anaconda blog on using pip in a Conda environment
- In-depth Conda how-to including advanced topics like revisions and environment rollback