Machine and Deep Learning Frameworks

Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and inference process of different machine learning models. This article aims to provide an overview about common frameworks and the surrounding execution environments, as well as some details about the underlying strategies.

Exisiting frameworks

As there is already an extensive number of ML/DL frameworks available and new ones targeting more specialized use-cases are actively being developed, this article only lists some of them and provides some classification as a basic overview. The choice for a suiting framework depends on multiple factors:

the type of machine learning model: e.g. classification, regression, neural networks, large language models, evolutionary algorithms
the training method: e.g. (un-)supervised learning, reinforcment learning, auto-regressive
the targeted hardware: CPU, GPU, CPU+GPU, or other accelerators
the used programming model: e.g. CUDA for Nvidia GPUs, ROCm via HiP for AMD GPUs, etc.
the used programming language: C/C++, Python, Julia, Fortran, etc.
others

In the following the focus on ML/DL frameworks lies on the Python programming language, while some of them also offer support for different programming languages like C++.

scikit-learn

scikit-learn is a Python framework for shallow machine learning. It provides both supervised and unsupervised machine learning models like regression, support vector machines, neural networks and clustering. scikit-learn only supports execution on the CPU.

PyTorch

PyTorch is a Python framework for machine and deep learning. It is build upon the torch library, which also provides a C++ interface. Both CPU and GPU execution is supported for single-node and multi-node systems. Distributed model training is possible through a PyTorch native implementation as well as horovod and can be extended with additional distributed strategies and algorithms through libraries and frameworks like DeepSpeed or PyTorch Lightning.

For more information on general setup and (distributed) machine learning, check out: PyTorch in HPC

TensorFlow

TensorFlow is a machine learning framework with focus on deep neural networks, supporting CPU and GPU execution. It uses Keras as a high-level API to help the user in constructing neural network models. Distributed model training is possible through a TensorFlow native implementation and horovod.

For more information on general setup and (distributed) machine learning, check out: TensorFlow in HPC

Others

Another framework that was popular in the past was MXNet, which is no longer in development. Many other machine and deep learning frameworks exist, where some of them are tailored more specifically towards different fields of application. Colossal-AI and Mosaic ML are other noteworthy mentions as frameworks for neural networks in general, while Megatron-LM is a framework meant for transformer-based large language models (LLM).

General setup and software environment

In most cases installing a framework through a package manager like pip, when using Python, is enough to get started. When GPU support is required, additional software and libraries are necessary (which sometimes will be installed as requirements, if not found). For Nvidia support, this includes at least CUDA for the backend and NCCL for communication and sometimes additional libraries like cuDNN for deep neural networks and others. Similar libraries exist for other types of accelerators like AMD GPUs and different ML/DL frameworks. As these libraries tend to be large in size, it is advised to use pre-installed versions if available. On HPC systems these are often available through the provided module system and should be loaded before installing an applicable framework and each time before executing a workload using that framework.

Containers

As dependencies between package, library and framework versions can be an issue, it is sometimes a challenge to find a working combinitation of those. This is were containers excel. Containers offer the option to provide a pre-configured/pre-built copy of a configuration. One source for container images is the [https://catalog.ngc.nvidia.com/containers Nvidia GPU Cloud (NGC) Catalog], which offers many container images for different softwares to use with Nvidia hardware. This also includes working environments for frameworks like TensorFlow or PyTorch that are packed together with other tools and library that improve the (GPU) performance of certain workloads within these frameworks or in case of e.g. TensorFlow contain a Horovod installation for distributed execution.

To make use of these containers check out which containerization software is available on the cluster you are using. Possible software includes Apptainer, Shifter or Docker. The latter one is most likely only available through Enroot as an unpriviledged container or managed via SLURM by the Pyxis plugin on HPC systems due to security considerations. Note that the other mentioned containerization tools typically support converting existing docker containers to their own container image type.

Consult the according documentation on how to run/execute a container depending on the used software and check for flags that might be required on the desired HPC systems. This might include the following entries

Ensuring GPU availabilty inside the container, e.g., via --nv for Apptainer
Mapping additional user-specific directories during container usage, where files can be accessed or manipulated, e.g, via --bind for Apptainer. Remember: Directories that are part of the container are typically read-only.
Checking the set environment variables that might get carried into the container environment and therefore have to be either cleaned or expanded when requiring file paths that are not part of the container.

In some cases it might be required to build own containers or build upon exisiting ones, but it is strongly recommended to use containers provided on the HPC systems to avoid unnecessary duplication of container images on file systems. Consult applying guidlines regarding the use and availability of containers on the HPC system in use.

Expanding containers without rebuilding

If a used container does not feature all required packages different options exist that do not require to rebuild the image. One option provided by tools like Apptainer are persistant overlays. While containers are typically read-only file systems, persitant overlays are sandboxed file systems lying ontop of the container enabling making additional software and packages available to the containerized software. Read more about persistant overlays here.

In case of Python, a second option is available through the additional use of virtual environments. If packages are missing inside the container they can be installed in a separate virtual environment. Ensure that the Python version used to create the environment matches the version inside the container. Otherwise compatibility issues are possible. The path to the virtual environment can then be appended to the PYTHONPATH environment variable and passed to the container when executing it. This allows the containerized software to be able to find packages installed in the virtual environment.

Virtual environments

Virtual environments allow separating different package installations to account for dependencies between package versions, enabling separation of framework installations for better maintenance and compatibility. The following will cover virtual environments for Python installations. Virtualenv is a tool that allows to create virtual environments. A version with a reduced, but for most cases sufficient, feature set is integrated in the Python venv module. Before creating a virtual environment ensure that the desired Python version is loaded. A virtual environment can be created and activated using the following commands:

   $ python -m venv path/to/venv % To create the venv
   $ source path/to/venv/bin/activate % To activate the venv

Once the environment is activated all package installations will be performed inside the virtual environment. Ensure that required dependencies like e.g. CUDA libraries are loaded before package installations, if applicable. To start an execution using a virtual environment in a job script, simply load all necessary modules, source the virtual envrionment to be activated and execute the desired command, all inside the job script.

Be aware that virtual environments create overhead in form of around 50,000 files on creation which may make it not suitable to be put on file systems like LUSTRE, where file quotas are often used. Also, resort to provided containers, if suited, to minimize unnecessary duplication of packages.

Possible workloads

This section is meant to provide rough guidlines to select available hardware suited for the desired workload.

Training and fine-tuning

Training and fine-tuning (in this section referred to as simply training) of machine learning models, especially (deep) neural networks are compute and memory intensive tasks. These kind of tasks involve the loading of (large) datasets and are well suited to be performed on GPUs, as they benefit from the accelerated computation of matrix-matrix multiplications, which are the core of many neural network computations. Depending on the field of application, like e.g. computer vision and natural language processing, models vary heavily in number of trainable parameters requiring different amounts of GPU memory to fit on a device. Methods to run models that exceed the available memory of a single GPU are covered in a sub-section for distributed training. The required amount of GPU memory depends on the chosen model, the optimizer and the precision (FP32, FP16, BF16, FP32+FP16 (mixed precision)). For example, the memory requirements for the training of a LLama2 7B large language model with 7 billion parameters in FP32 precision can be estimated to roughly 112GB (depending on the used optimizer).

Training is performed in so called batches which inputs multiple training samples into the model and aggregates the gradients of an entire batch before updating parameters to increase both training throughput and model accuracy. While a dataset is most commonly first loaded into the systems main memory, the data samples required for the training batches need to be copied to the GPU and therefore have also be taken into consideration when estimating the required memory. It is often required to experiment with different batch sizes to find a balance between memory usage, training speed and the accuracy of the final model. If a batch-size is required for a certain outcome, but does not fit into the memory, techniques like gradient accumulation can be used to trade additional computational overhead for improved model performance by aggregating gradients from multiple batches before updating the parameters, instead of updating after each batch.

For considerations regarding dataset storage and loading refer to [Machine_and_Deep_Learning_Frameworks#Handling_datasets dataset handling]

Distributed training/fine-tuning

If the model that should be trained does not fit into the memory of a single GPU or the model training takes too much time, the work can be distributed over multiple CPUs or GPUs. While distributed training with multiple CPUs is possible, the remaining part will only consider muli-GPU use-cases.

Training over multiple devices is mainly classified into two categories, model parallel and data parallel, which also can be combined to allow even better usage of distributed resources. These concepts will be briefly explained for the use-cases of neural networks. More detailed explanations can be found on: Hugging Face or Colossal AI and other.

Model parallelism

Model parallelism focuses on the problem of fitting the model parameters into GPU memory. By distributing the parameters among available GPUs reduces the memory required for the model parameters per GPU and frees memory to train larger models or allow larger batch sizes.

Splitting a network vertically distributes the different layers among the GPUs, so one GPU will only need to save the parameters of a subset of layers. This requires communication between the GPUs in both the forward and backwards path. It also leaves all GPUs idle which require other GPUs to finish the computations on their layers and exchange information. To increase usage of the devices pipeline parallelism can be used to split the batch into micro-batches, perform calculations on those micro-batches and already provide data to other GPUs while still working on the remaining micro-batches.

Tensor parallelism offers another appproach to reduce the memory requirements. By splitting the tensors along one of the dimensions a tensor can be distributed among multiple GPUs reducing the memory required for the tensor on each GPU. The results from each GPU are computed into one final result tensor at the end.

Data parallelism

Data parallelism serves the main purpose of accelerating the training process of a machine learning model. By distributing the training samples across multiple GPUs each GPU needs to process less batches resulting in lower training time per epoch. After each batch all GPUs exchange their gradients through an all-reduce pattern to calculate the gradients for the weight updates. Because all GPUs contribute with their number of samples the effective batch size is scaled by the number of used GPUs. This might require adjustments to the batch size to still achieve the required performance of a model, but can also help to achieve certain batch sizes which otherwise would not be possible on single devices. Specialized optimizer like LARS (Layer-wise Adaptive Rate Scaling) are designed to perform well on large batch sizes that are achieved through data parallel training. Basic data distributed parallelism is supported by most ML/DL frameworks like PyTorch and TensorFlow.

Hybrid parallelism

Often some degrees of model and data parallelism are combined to achieve better training performance.

ZeRO, the zero redundancy optimizer, partitions optimizer states across GPUs and CPUs to both accelerate training and lower memory requirements. Different optimization levels have different impact on communication overhead, training time and memory requirements.

FSDP, fully sharded data parallelism, is another approach to enable the training of large models across multiple GPUs by sharding parameters across multiple devices. FSDP is available in PyTorch.

Inference

Inference allows to use a trained model with previously unseen data to create predictions. It requires significantly less computational power and memory resources than training. Therefore it is suited for CPU and GPU systems, where GPU systems still outperform CPUs by a lot regarding the number of processed samples per time step. But depending on the use-cases of a trained model, CPU inference might be sufficient on smaller deployment systems if there is not a large amount of data or if the latency of the inference is negligible. As a reference for memory requirements a Llama 2 7B model would require about 28GB of memory with FP32 precision, which is significantly less than required for the model training as optimizer states and gradients don't have to be stored.

Inference can be sped up by using multiple devices. Unlike distributed training this does not require communication or specialized algorithms/strategies. Distributed inference is performed by supplying different instances with separate data samples to work on.

To deploy trained models frameworks like PyTorch and Tensorflow provide inference server, TorchServe and TensorFlow Serving, that are suited for production environments. Nvidia also provides an inference server with Triton which is optimized for Nvidia GPUs.

Additional libraries like TensorRT can increase the inference throughput by pre-compiling a trained model with optimizations before deploying it in a target environment.

Handling datasets

Being able to access a dataset for training with high bandwidth is crucial for high resource utilization during model training. As datasets tend to require up to multiple TB of storage and can span over millions of files choosing an adequate filesystem is really important on HPC systems. HPC systems use shared file systems to provide data storage to the users. Accessing datasets on those file systems can add many I/O operations on the systems and can have significant performance impact for other users. Potentially required considerations that could be made are:

Storage quota:

The large size of datasets present challenges to storage space avaiable to a user. A space-efficient approach are datasets that are provided to the cluster users from a central storage which reduces the need for own copies of the same dataset for users. Due to licensing and availability reasons some datasets can not be provided to all users and may require special permission or access groups to be available. If the targeted user group is too small it may not be feasible to store a dataset centrally. Additionally users may require different preprocessings or file formats for their datasets. This again creates the need for additional copies and the space to store them.

File systems:

HPC systems often provide multiple types of storage and file systems. To reduce the load on those systems it is advised to reduce the amount of single file operations that are performed. One way to achieve this are archives/tarballs, as this presents the file systems with a single file operation when copying to a destination. This case requires additional considerations to be accessible. If node-local storage is available those archives can be copied onto the nodes, unpacked, potentially required preprocessing can be applied and the training then started. Depending on the required availability of data samples to multiple nodes using on-demand file systems like BeeOND can improve performance and can help to reduce network load. The least amount of network load would be achieved if the datasets or data samples required for different workers fit onto the node-local storage and are only required by those devices. Depending on the disk space, the required number of nodes and the number of available devies, this may not be possible. This approach also needs to be adapted for the training needs. If sampling, shuffling or other operations are required on the data samples it might not be possible to just copy and unpack archives on different nodes as this might not provide the required distribution. Possible libraries that may help in these cases are:

Improving I/O performance for parallel single node runs

The following assumes a GPU node with multiple GPUs and that the user runs many trainings simultaneously in differnt jobs, each using one GPU. When training multiple instances of the same model, or different models, but using the same dataset, it could prove beneficial to consolidate multiple jobs into a single job. By doing so, it is possible to make use of local SSDs (if available) through on-demand file systems like BeeOND. The dataset only needs to be transfered once onto the local storage over the network and can be accessed by mutliple training instances. This reduces the load on the network and also improves the access time to the data samples. For a node with four GPUs this would mean to submit one four-GPU job instead of four one-GPU jobs, requesting the available on-demand file system, and launching multiple independent training instances on the same dataset on single GPUs.