Jannis-klinkenberg-0962@rwth-aachen.de at 07:27, 1 July 2024

2024-07-01T07:27:24Z

Jannis-klinkenberg-0962@rwth-aachen.de at 07:05, 1 July 2024

2024-07-01T07:05:04Z

Jannis-klinkenberg-0962@rwth-aachen.de at 07:03, 1 July 2024

2024-07-01T07:03:43Z

Jannis-klinkenberg-0962@rwth-aachen.de: Created page with "Category:HPC-Developer Category:HPC-User Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and infere..."

2024-07-01T06:55:40Z

Created page with "Category:HPC-Developer Category:HPC-User Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and infere..."

New page

[[Category:HPC-Developer]] [[Category:HPC-User]]
Frameworks for machine learning (ML) and deep learning (DL) provide many tools
to facilitate the building, training and inference process of different machine
learning models. This article aims to provide an overview about common
frameworks and the surrounding execution environments, as well as some details
about the underlying strategies.

== Exisiting frameworks ==

As there is already an extensive number of ML/DL frameworks available and new
ones targeting more specialized use-cases are actively being developed, this
article only lists some of them and provides some classification as a basic
overview.
The choice for a suiting framework depends on multiple factors:
* the type of machine learning model: e.g. classification, regression, neural networks, large language models, evolutionary algorithms
* the training method: e.g. (un-)supervised learning, reinforcment learning, auto-regressive
* the targeted hardware: CPU, GPU, CPU+GPU, or other accelerators
* the used programming model: e.g. CUDA for Nvidia GPUs, ROCm via HiP for AMD GPUs, etc.
* the used programming language: C/C++, Python, Julia, Fortran, etc.
* others

In the following the focus on ML/DL frameworks lies on the Python programming
language, while some of them also offer support for different programming
languages like C++.

=== scikit-learn ===

[https://scikit-learn.org/stable/ scikit-learn] is a Python framework for
shallow machine learning. It provides both supervised and unsupervised machine
learning models like regression, support vector machines, neural networks
and clustering. scikit-learn only supports execution on the CPU.

=== PyTorch ===

[https://pytorch.org/ PyTorch] is a Python framework for machine and deep
learning. It is build upon the [http://torch.ch/ torch] library, which also
provides a C++ interface. Both CPU and GPU execution is supported for
single-node and multi-node systems. Distributed model training is possible
through a PyTorch native implementation as well as [https://horovod.ai/ horovod] and can be
extended with additional distributed strategies and algorithms through
libraries and frameworks like [https://www.deepspeed.ai/ DeepSpeed] or
[https://lightning.ai/docs/pytorch/stable/ PyTorch Lightning].

For more information on general setup and (distributed) machine learning, check
out: [[PyTorch|PyTorch in HPC]]

=== TensorFlow ===

[https://www.tensorflow.org/ TensorFlow] is a machine learning framework with
focus on deep neural networks, supporting CPU and GPU execution. It uses Keras
as a high-level API to help the user in constructing neural network models.
Distributed model training is possible through a TensorFlow native implementation and
[https://horovod.ai/ horovod].

For more information on general setup and (distributed) machine learning, check
out: [[TensorFlow|TensorFlow in HPC]]

=== Others ===

Another framework that was popular in the past was
[https://mxnet.apache.org/versions/1.9.1/ MXNet], which is no longer in
development. Many other machine and deep learning frameworks exist, where some
of them are tailored more specifically towards different fields of application.
[https://colossalai.org/ Colossal-AI] and [https://docs.mosaicml.com/en/latest/ Mosaic ML]
are other noteworthy mentions as frameworks for neural networks in
general, while [https://github.com/NVIDIA/Megatron-LM Megatron-LM] is a
framework meant for transformer-based large language models (LLM).

== General setup and software environment ==

In most cases installing a framework through a package manager like pip, when
using Python, is enough to get started. When GPU support is required,
additional software and libraries are necessary (which sometimes will be
installed as requirements, if not found). For Nvidia support, this includes at
least [https://developer.nvidia.com/cuda-toolkit CUDA] for the backend and
[https://developer.nvidia.com/nccl NCCL] for communication and sometimes
additional libraries like [https://developer.nvidia.com/cudnn cuDNN] for deep
neural networks and others. Similar libraries exist for other types of
accelerators like AMD GPUs and different ML/DL frameworks. As these libraries
tend to be large in size, it is advised to use pre-installed versions if
available. On HPC systems these are often available through the provided module
system and should be loaded before installing an applicable framework and each
time before executing a workload using that framework.

=== Containers ===

As dependencies between package, library and framework versions can be an issue,
it is sometimes a challenge to find a working combinitation of those. This is
were containers excel. Containers offer the option to provide a
pre-configured/pre-built copy of a configuration. One source for container
images is the [https://catalog.ngc.nvidia.com/containers Nvidia GPU Cloud (NGC)
Catalog], which offers many container images for different softwares to use
with Nvidia hardware. This also includes working environments for frameworks
like TensorFlow or PyTorch that are packed together with other tools and
library that improve the (GPU) performance of certain workloads within these
frameworks or in case of e.g. TensorFlow contain a Horovod installation for
distributed execution.

To make use of these containers check out which containerization software is
available on the cluster you are using. Possible software includes
[https://apptainer.org/ Apptainer], [https://github.com/NERSC/shifter Shifter]
or [https://www.docker.com/ Docker]. The latter one is most likely only
available through [https://github.com/NVIDIA/enroot Enroot] as an unpriviledged
container or managed via SLURM by the [https://github.com/NVIDIA/pyxis Pyxis]
plugin on HPC systems due to security considerations. Note that the other
mentioned containerization tools typically support converting existing docker
containers to their own container image type.

Consult the according documentation on how to run/execute a container depending
on the used software and check for flags that might be required on the desired HPC systems.
This includes ensuring GPU availabilty inside the container, via e.g.
<code>--nv</code> for Apptainer, or checking the set environment variables that
might get carried into the container environment and therefore have to be
either cleaned or expanded when requiring file paths that are not part of the
container.

In some cases it might be required to build own containers or build upon
exisiting ones, but it is strongly recommended to use containers provided on
the HPC systems to avoid unnecessary duplication of container images on file
systems. Consult applying guidlines regarding the use and availability of
containers on the HPC system in use.

==== Expanding containers without rebuilding ====

If a used container does not feature all required packages different options
exist that do not require to rebuild the image. One option provided by tools
like Apptainer are persistant overlays. While containers are typically
read-only file systems, persitant overlays are sandboxed file systems lying
ontop of the container enabling making additional software and packages
available to the containerized software. Read more about persistant overlays
[https://apptainer.org/docs/user/main/persistent_overlays.html here].

In case of Python, a second option is available through the additional use of
virtual environments. If packages are missing inside the container they can be
installed in a separate virtual environment. Ensure that the Python version
used to create the environment matches the version inside the container.
Otherwise compatibility issues are possible. The path to the virtual
environment can then be appended to the <code>PYTHONPATH</code> environment
variable and passed to the container when executing it. This allows the
containerized software to be able to find packages installed in the virtual
environment.

=== Virtual environments ===

Virtual environments allow separating different package installations to
account for dependencies between package versions, enabling separation of
framework installations for better maintenance and compatibility. The following
will cover virtual environments for Python installations. Virtualenv is a tool
that allows to create virtual environments. A version with a reduced, but for
most cases sufficient, feature set is integrated in the Python
<code>venv</code> module. Before creating a virtual environment ensure that the
desired Python version is loaded.
A virtual environment can be created and activated using the following commands:

$ python -m venv path/to/venv % To create the venv
$ source path/to/venv/bin/activate % To activate the venv

Once the environment is activated all package installations will be performed
inside the virtual environment. Ensure that required dependencies like e.g.
CUDA libraries are loaded before package installations, if applicable. To start
an execution using a virtual environment in a job script, simply load all
necessary modules, source the virtual envrionment to be activated and execute
the desired command, all inside the job script.

Be aware that virtual environments create overhead in form of around 50,000
files on creation which may make it not suitable to be put on file systems like
LUSTRE, where file quotas are often used. Also, resort to provided containers,
if suited, to minimize unnecessary duplication of packages.

== Possible workloads ==

This section is meant to provide rough guidlines to select available hardware
suited for the desired workload.

=== Training and fine-tuning ===

Training and fine-tuning (in this section referred to as simply training) of
machine learning models, especially (deep) neural networks are compute and
memory intensive tasks. These kind of tasks involve the loading of (large)
datasets and are well suited to be performed on GPUs, as they benefit from the
accelerated computation of matrix-matrix multiplications, which are the core of
many neural network computations. Depending on the field of application, like
e.g. computer vision and natural language processing, models vary heavily in
number of trainable parameters requiring different amounts of GPU memory to fit
on a device. Methods to run models that exceed the available memory of a single
GPU are covered in a sub-section for distributed training. The required amount
of GPU memory depends on the chosen model, the optimizer and the precision
(FP32, FP16, BF16, FP32+FP16 (mixed precision)). For example, the memory
requirements for the training of a LLama2 7B large language model with 7
billion parameters in FP32 precision can be estimated to roughly 112GB
(depending on the used optimizer).

Training is performed in so called batches
which inputs multiple training samples into the model and aggregates the
gradients of an entire batch before updating parameters to increase both
training throughput and model accuracy. While a dataset is most commonly first
loaded into the systems main memory, the data samples required for the training
batches need to be copied to the GPU and therefore have also be taken into
consideration when estimating the required memory. It is often required to
experiment with different batch sizes to find a balance between memory usage,
training speed and the accuracy of the final model. If a batch-size is required
for a certain outcome, but does not fit into the memory, techniques like
gradient accumulation can be used to trade additional computational overhead
for improved model performance by aggregating gradients from multiple batches
before updating the parameters, instead of updating after each batch.

For considerations regarding dataset storage and loading refer to
[Machine_and_Deep_Learning_Frameworks#Handling_datasets dataset handling]

==== Distributed training/fine-tuning ====

If the model that should be trained does not fit into the memory of a single
GPU or the model training takes too much time, the work can be distributed over
multiple CPUs or GPUs. While distributed training with multiple CPUs is
possible, the remaining part will only consider muli-GPU use-cases.

Training over multiple devices is mainly classified into two categories, model
parallel and data parallel, which also can be combined to allow even better
usage of distributed resources.
These concepts will be briefly explained for the use-cases of neural networks.
More detailed explanations can be found on:
[https://huggingface.co/docs/transformers/v4.15.0/en/parallelism Hugging Face]
or [https://colossalai.org/docs/concepts/paradigms_of_parallelism/ Colossal AI]
and other.

===== Model parallelism =====

Model parallelism focuses on the problem of fitting the model parameters into
GPU memory. By distributing the parameters among available GPUs reduces the
memory required for the model parameters per GPU and frees memory to train
larger models or allow larger batch sizes.

Splitting a network vertically distributes the different layers among the GPUs,
so one GPU will only need to save the parameters of a subset of layers. This
requires communication between the GPUs in both the forward and backwards path.
It also leaves all GPUs idle which require other GPUs to finish the
computations on their layers and exchange information. To increase usage of the
devices pipeline parallelism can be used to split the batch into micro-batches,
perform calculations on those micro-batches and already provide data to other
GPUs while still working on the remaining micro-batches.

Tensor parallelism offers another appproach to reduce the memory requirements.
By splitting the tensors along one of the dimensions a tensor can be
distributed among multiple GPUs reducing the memory required for the tensor on
each GPU. The results from each GPU are computed into one final result tensor
at the end.

===== Data parallelism =====

Data parallelism serves the main purpose of accelerating the training process
of a machine learning model. By distributing the training samples across
multiple GPUs each GPU needs to process less batches resulting in lower
training time per epoch. After each batch all GPUs exchange their gradients
through an all-reduce pattern to calculate the gradients for the weight
updates. Because all GPUs contribute with their number of samples the effective
batch size is scaled by the number of used GPUs. This might require adjustments
to the batch size to still achieve the required performance of a model, but can
also help to achieve certain batch sizes which otherwise would not be possible
on single devices. Specialized optimizer like LARS (Layer-wise Adaptive Rate
Scaling) are designed to perform well on large batch sizes that are achieved
through data parallel training. Basic data distributed parallelism is supported
by most ML/DL frameworks like PyTorch and TensorFlow.

===== Hybrid parallelism =====

Often some degrees of model and data parallelism are combined to achieve better
training performance.

[https://www.deepspeed.ai/tutorials/zero/ ZeRO], the zero redundancy optimizer,
partitions optimizer states across GPUs and CPUs to both accelerate training
and lower memory requirements. Different optimization levels have different
impact on communication overhead, training time and memory requirements.

[https://engineering.fb.com/2021/07/15/open-source/fsdp/ FSDP], fully sharded
data parallelism, is another approach to enable the training of large models
across multiple GPUs by sharding parameters across multiple devices. FSDP is
available in PyTorch.

===== Further readings =====

For more detailed information on distributed training for specific frameworks,
consult the pages below:
* [[PyTorch#Distributed_training|Distributed training with PyTorch]]
* [[TensorFlow#Distributed_training|Distributed training with TensorFlow]]

=== Inference ===

Inference allows to use a trained model with previously unseen data to create
predictions. It requires significantly less computational power and memory
resources than training. Therefore it is suited for CPU and GPU systems, where
GPU systems still outperform CPUs by a lot regarding the number of processed
samples per time step. But depending on the use-cases of a trained model, CPU
inference might be sufficient on smaller deployment systems if there is not a
large amount of data or if the latency of the inference is negligible. As a
reference for memory requirements a Llama 2 7B model would require about 28GB
of memory with FP32 precision, which is significantly less than required for
the model training as optimizer states and gradients don't have to be stored.

Inference can be sped up by using multiple devices. Unlike distributed training
this does not require communication or specialized algorithms/strategies.
Distributed inference is performed by supplying different instances with
separate data samples to work on.

To deploy trained models frameworks like PyTorch and Tensorflow provide
inference server, [https://pytorch.org/serve/ TorchServe] and
[https://www.tensorflow.org/tfx/guide/serving TensorFlow Serving], that are
suited for production environments. Nvidia also provides an inference server
with [https://developer.nvidia.com/triton-inference-server Triton] which is
optimized for Nvidia GPUs.

Additional libraries like [https://developer.nvidia.com/tensorrt TensorRT] can
increase the inference throughput by pre-compiling a trained model with
optimizations before deploying it in a target environment.

== Handling datasets ==

Being able to access a dataset for training with high bandwidth is crucial for
high resource utilization during model training. As datasets tend to require up
to multiple TB of storage and can span over millions of files choosing an
adequate filesystem is really important on HPC systems. HPC systems use shared
file systems to provide data storage to the users. Accessing datasets on those
file systems can add many I/O operations on the systems and can have significant
performance impact for other users.
Potentially required considerations that could be made are:

'''Storage quota:'''

The large size of datasets present challenges to storage space avaiable to a
user. A space-efficient approach are datasets that are provided to the cluster
users from a central storage which reduces the need for own copies of the same
dataset for users. Due to licensing and availability reasons some datasets can
not be provided to all users and may require special permission or access
groups to be available. If the targeted user group is too small it may not be
feasible to store a dataset centrally. Additionally users may require different
preprocessings or file formats for their datasets. This again creates the need
for additional copies and the space to store them.

'''File systems:'''

HPC systems often provide multiple types of storage and file systems. To reduce
the load on those systems it is advised to reduce the amount of single file
operations that are performed. One way to achieve this are archives/tarballs,
as this presents the file systems with a single file operation when copying to
a destination. This case requires additional considerations to be accessible.
If node-local storage is available those archives can be copied onto the nodes,
unpacked, potentially required preprocessing can be applied and the training
then started. Depending on the required availability of data samples to
multiple nodes using on-demand file systems like BeeOND can improve performance
and can help to reduce network load. The least amount of network load would be
achieved if the datasets or data samples required for different workers fit
onto the node-local storage and are only required by those devices. Depending
on the disk space, the required number of nodes and the number of available
devies, this may not be possible. This approach also needs to be adapted for
the training needs. If sampling, shuffling or other operations are required on
the data samples it might not be possible to just copy and unpack archives on
different nodes as this might not provide the required distribution.
Possible libraries that may help in these cases are:
* [https://github.com/mxmlnkn/ratarmount ratarmount]
* [https://github.com/webdataset/webdataset webdataset]
* [https://datadings.readthedocs.io/en/stable/ datadings]

=== Improving I/O performance for parallel single node runs ===

The following assumes a GPU node with multiple GPUs and that the user runs many
trainings simultaneously in differnt jobs, each using one GPU. When training
multiple instances of the same model, or different models, but using the same
dataset, it could prove beneficial to consolidate multiple jobs into a single
job. By doing so, it is possible to make use of local SSDs (if available)
through on-demand file systems like BeeOND. The dataset only needs to be
transfered once onto the local storage over the network and can be accessed by
mutliple training instances. This reduces the load on the network and also
improves the access time to the data samples. For a node with four GPUs this
would mean to submit one four-GPU job instead of four one-GPU jobs, requesting
the available on-demand file system, and launching multiple independent
training instances on the same dataset on single GPUs.

@@ Line 109: / Line 109: @@
 Consult the according documentation on how to run/execute a container depending
 on the used software and check for flags that might be required on the desired HPC systems.
-This includes ensuring GPU availabilty inside the container, via e.g.
+This might include the following entries
-<code>--nv</code> for Apptainer, or checking the set environment variables that
+* Ensuring GPU availabilty inside the container, e.g., via <code>--nv</code> for Apptainer
-might get carried into the container environment and therefore have to be
+* Mapping additional user-specific directories during container usage, where files can be accessed or manipulated, e.g, via <code>--bind</code> for Apptainer. Remember: Directories that are part of the container are typically read-only.
-either cleaned or expanded when requiring file paths that are not part of the
+* Checking the set environment variables that might get carried into the container environment and therefore have to be either cleaned or expanded when requiring file paths that are not part of the container.
 In some cases it might be required to build own containers or build upon

@@ Line 281: / Line 281: @@
 For more detailed information on distributed training for specific frameworks,
 consult the pages below:
-* [[PyTorch#Distributed_training|Distributed training with PyTorch]]
+* [[PyTorch#Distributed training|Distributed training with PyTorch]]
-* [[TensorFlow#Distributed_training|Distributed training with TensorFlow]]
+* [[TensorFlow#Distributed training|Distributed training with TensorFlow]]
 === Inference ===

@@ Line 359: / Line 359: @@
 * [https://github.com/webdataset/webdataset webdataset]
 * [https://datadings.readthedocs.io/en/stable/ datadings]
 === Improving I/O performance for parallel single node runs ===

Machine and Deep Learning Frameworks - Revision history

Jannis-klinkenberg-0962@rwth-aachen.de at 07:27, 1 July 2024

Jannis-klinkenberg-0962@rwth-aachen.de at 07:05, 1 July 2024

Jannis-klinkenberg-0962@rwth-aachen.de at 07:03, 1 July 2024

Jannis-klinkenberg-0962@rwth-aachen.de: Created page with "Category:HPC-Developer Category:HPC-User Frameworks for machine learning (ML) and deep learning (DL) provide many tools to facilitate the building, training and infere..."