Difference between revisions of "PyTorch"
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:HPC-Developer]] [[Category:HPC-User]] | [[Category:HPC-Developer]] [[Category:HPC-User]] | ||
− | + | [https://pytorch.org/ PyTorch] is a Python framework for machine and deep | |
+ | learning. It is build upon the [http://torch.ch/ torch] library and also | ||
+ | provides a C++ interface. It supports CPU and GPU execution for single-node and | ||
+ | multi-node systems. | ||
− | + | == General == | |
− | + | To learn how to use PyTorch take a look at the official | |
+ | [https://pytorch.org/tutorials/beginner/basics/intro.html Beginners Guide] and | ||
+ | the official [https://pytorch.org/tutorials/ Tutorials]. PyTorch is a very | ||
+ | versatile machine learning framework that can be further expanded through | ||
+ | various libraries that are built on-top of it. | ||
− | + | === Setup === | |
+ | |||
+ | For general setup of the Python environment consult the instructions from | ||
+ | [[Machine_and_Deep_Learning_Frameworks#General_setup_and_software_environment|Machine and Deep Learning Frameworks]]. Depending on the required | ||
+ | platform and GPU support visit [https://pytorch.org/ pytorch.org] for the | ||
+ | applying installation instructions. When using pip or conda this is important | ||
+ | to get the desired GPU (for Nvidia or AMD) or CPU build. Example: | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The command listed above would install PyTorch with CUDA 12.1 support as | ||
+ | well as the torchvision and torchaudio packages. These are additional packages | ||
+ | that supply datasets, tools and models for computer-vision and audio tasks. | ||
+ | Choose the framework version based on the library modules available on the HPC | ||
+ | system and the available hardware. | ||
+ | |||
+ | PyTorch offers multiple backends for distributed training. GLOO is only | ||
+ | available for using multiple CPUs. The default backend is | ||
+ | [https://developer.nvidia.com/nccl Nvidia's Collective Communication Library] | ||
+ | (NCCL) and supports multi CPU and GPU. AMD GPUs are supported via the | ||
+ | [https://rocm.docs.amd.com/projects/rccl/en/docs-5.4.3/index.html ROCm | ||
+ | Communication Collectives Library] (RCCL) when installing the ROCm build of | ||
+ | PyTorch. PyTorch also offers MPI as a backend, but this requires it to be build | ||
+ | from source with a MPI implementation installed and available to the system. | ||
+ | This might be advantageous if used in coupled applications where a simulation | ||
+ | or other workload is already using MPI and integrates machine learning methods. | ||
+ | |||
+ | '''Important Note:''' Often, HPC centers also provide optimized, pre-defined Apptainer or Docker containers that already contain all required packages. In that case, users do not need to create their own separate virtual environment but can conveniently use the container for execution. | ||
+ | |||
+ | == Building models == | ||
+ | |||
+ | Building machine-learning models can be done from scratch or by using | ||
+ | high-level APIs and abstractions like [https://keras.io/ Keras] or | ||
+ | [https://lightning.ai/docs/pytorch/stable/ PyTorch Lightning]. This offers many | ||
+ | ways to implement a wide range of machine-learning algorithms and models. This | ||
+ | article omits code examples of how to build models in PyTorch to avoid | ||
+ | duplication of already existing guides and recommends using the official | ||
+ | [https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html | ||
+ | tutorials]. | ||
+ | |||
+ | == Model training == | ||
+ | |||
+ | Training a model with PyTorch includes multiple steps: | ||
+ | * deciding on the type of training: single device or distributed; CPU or GPU | ||
+ | * define or load the model to be trained | ||
+ | * load and pre-process your dataset | ||
+ | * if applicable, choose and setup distribution strategy | ||
+ | |||
+ | These steps will be explained in detail in the following example based on the | ||
+ | MNIST dataset of handwritten digits to train a neural network that performs | ||
+ | image classification ([https://github.com/pytorch/examples/tree/main/mnist Full code here]). | ||
+ | This assumes familiarity with the concepts of machine learning and neural networks. | ||
+ | |||
+ | === Defining a model === | ||
+ | |||
+ | The following code defines a neural network with three linear layers and uses | ||
+ | the ''rectified linear units'' (ReLUs) as activation functions for the neurons. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | import torch.nn as nn | ||
+ | import torch.nn.functional as F | ||
+ | |||
+ | class SimpleNN(nn.Module): | ||
+ | def __init__(self): | ||
+ | super(SimpleNN, self).__init__() | ||
+ | self.fc1 = nn.Linear(28*28, 128) | ||
+ | self.fc2 = nn.Linear(128, 64) | ||
+ | self.fc3 = nn.Linear(64, 10) | ||
+ | |||
+ | def forward(self, x): | ||
+ | x = x.view(-1, 28*28) | ||
+ | x = F.relu(self.fc1(x)) | ||
+ | x = F.relu(self.fc2(x)) | ||
+ | x = self.fc3(x) | ||
+ | return x | ||
+ | |||
+ | # create an instance of the model | ||
+ | model = SimpleNN() | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | === Loading the dataset === | ||
+ | |||
+ | Now a dataset needs to be loaded. In this case this includes defining | ||
+ | transformations for the training data, loading the MNIST dataset and using a | ||
+ | data loader to apply shuffling on the samples and packs them into batches of 64 | ||
+ | samples each. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | import torch | ||
+ | from torchvision import datasets, transforms | ||
+ | |||
+ | transform = transforms.Compose([ | ||
+ | transforms.ToTensor(), | ||
+ | transforms.Normalize((0.5,), (0.5,)) | ||
+ | ]) | ||
+ | |||
+ | train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform) | ||
+ | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | === Training the model === | ||
+ | |||
+ | A rudimentary description of a training process is shown below: | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | ||
+ | model.to(device) | ||
+ | |||
+ | criterion = nn.CrossEntropyLoss() | ||
+ | optimizer = torch.optim.Adam(model.parameters(), lr=0.001) | ||
+ | |||
+ | for epoch in range(10): | ||
+ | for images, labels in train_loader: | ||
+ | images, labels = images.to(device), labels.to(device) | ||
+ | |||
+ | optimizer.zero_grad() | ||
+ | output = model(images) | ||
+ | loss = criterion(output, labels) | ||
+ | loss.backward() | ||
+ | optimizer.step() | ||
+ | |||
+ | print(f'Epoch {epoch+1}, Loss: {loss.item()}') | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | If a (Nvidia) GPU is available on the system, the model and data samples need | ||
+ | to be transferred into the GPU memory. If no GPU is available, the training | ||
+ | will be performed on the CPU instead. The training loop performs 10 epochs and | ||
+ | in each epoch the entire training dataset is processed by the device. | ||
+ | |||
+ | === Distributed training === | ||
+ | |||
+ | Depending on the system and the available hardware distributed training may | ||
+ | require more considerations than simple single device training. | ||
+ | |||
+ | Distributed training can be enabled by different modules and libraries: | ||
+ | * '''torch.distributed''': PyTorch's native distributed training module | ||
+ | * '''[https://Horovod.ai/ Horovod]''': A distributed training library supported by various ML/DL frameworks | ||
+ | |||
+ | The previously mentioned libraries/modules enable the setup of the distributed | ||
+ | environment and handle calls to the communication libraries. Additionally, it is | ||
+ | necessary to think about the required type of distributed parallelism. | ||
+ | As mentioned in the [[Machine_and_Deep_Learning_Frameworks#Distributed_training/fine-tuning|overview article]] different parallelization strategies | ||
+ | exist for different use-cases. | ||
+ | '''torch.distributed''' and '''Horovod''' support data parallelism and basic | ||
+ | model parallelism. For more specialized implementations of model parallelism | ||
+ | like e.g. [https://pytorch.org/docs/stable/distributed.pipelining.html pipeline parallelism] or [https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html FSDP], additional libraries are required. | ||
+ | |||
+ | The following example makes use of data parallelism to speed up training: | ||
+ | |||
+ | ==== Native distributed training ==== | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | import torch.distributed as dist | ||
+ | from torch.nn.parallel import DistributedDataParallel as DDP | ||
+ | |||
+ | dist.init_process_group(backend='nccl', init_method='env://') | ||
+ | |||
+ | model = SimpleNN().to(device) | ||
+ | model = DDP(model) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The above code initializes a process group using NCCL as a backend and using | ||
+ | environment variables to setup the TCP connection. If no backend is specified, | ||
+ | NCCL and GLOO instances will be created and automatically chosen from the used | ||
+ | device (GLOO for CPU, NCCL for GPU). The machine learning model is then wrapped | ||
+ | by the distributed handler. | ||
+ | |||
+ | ===== Setting up the environment ===== | ||
+ | |||
+ | When using environment variables to setup the process group the following has | ||
+ | to be supplied: | ||
+ | * <syntaxhighlight lang="bash" inline>MASTER_ADDR</syntaxhighlight>: The IP-address of the master node | ||
+ | * <syntaxhighlight lang="bash" inline>MASTER_PORT</syntaxhighlight>: The port to be used for communication | ||
+ | * <syntaxhighlight lang="bash" inline>WORLD_SIZE</syntaxhighlight>: The number of workers used for training | ||
+ | * <syntaxhighlight lang="bash" inline>RANK</syntaxhighlight>: The rank of a process | ||
+ | |||
+ | On a SLURM-based system this could be done in the following way for a number of | ||
+ | X nodes and Y GPUs using the first node from the jobs nodelist as the master node. | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | #SBATCH --nodes=X | ||
+ | #SBATCH --ntasks-per-node=Y | ||
+ | #SBATCH --gpus-per-node=Y | ||
+ | |||
+ | export MASTER_PORT=12340 | ||
+ | export WORLD_SIZE=$(( ${SLURM_GPUS_PER_NODE}*${SLURM_NTASKS} )) | ||
+ | export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) | ||
+ | |||
+ | srun bash -c 'eval RANK=\$$SLURM_PROCID && python training-script.py' | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | If multiple GPUs are available on a node, each process must set the visible | ||
+ | devices accordingly (example assumes CUDA devices). In a SLURM environment the | ||
+ | local process ID can be mapped to a GPU ID inside the training script: | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | import os | ||
+ | gpu = int(os.environ['SLURM_LOCALID']) | ||
+ | torch.cuda.set_device(gpu) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The training code needs to be adapted to the distributed training as well. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | from torch.utils.data.distributed import DistributedSampler | ||
+ | |||
+ | train_sampler = DistributedSampler(train_dataset) | ||
+ | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, sampler=train_sampler) | ||
+ | |||
+ | optimizer = optim.SGD(model.parameters()) | ||
+ | |||
+ | for epoch in range(10): | ||
+ | train_sampler.set_epoch(epoch) | ||
+ | for images, labels in train_loader: | ||
+ | images, labels = images.to(device), labels.to(device) | ||
+ | |||
+ | optimizer.zero_grad() | ||
+ | output = model(images) | ||
+ | loss = criterion(output, labels) | ||
+ | loss.backward() | ||
+ | optimizer.step() | ||
+ | |||
+ | print(f'Epoch {epoch+1}, Loss: {loss.item()}') | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | A distributed sampler is used to distributed the data samples among the | ||
+ | available devices. Once adjusted, the training can now be executed on an | ||
+ | arbitrary number of nodes and GPUs. Increasing the number of GPUs can speed up | ||
+ | the training but also impacts model accuracy and convergence behavior. This | ||
+ | requires further adjustments of hyperparameters like, e.g. batch size, learning | ||
+ | rate and optimizer. | ||
+ | |||
+ | NOTE: When using additional libraries that wrap PyTorch's native distributed | ||
+ | training, a different ratio of tasks to GPUs may be required to make | ||
+ | distributed training work. For example, when using | ||
+ | [https://github.com/mosaicml/composer composer] only one task per process is | ||
+ | required while still requesting multiple GPUs of a single node, because | ||
+ | composer will perform an internal setup that will then spawn new processes | ||
+ | handling the individual GPUs. Read the documentation of the used library to | ||
+ | find the correct mapping of tasks to GPUs. | ||
+ | |||
+ | ==== Horovod ==== | ||
+ | |||
+ | Using horovod simplifies the setup by not requiring to set environment | ||
+ | variables like in the native approach. For backends like NCCL horovod still | ||
+ | requires and uses MPI additionally to provide communication. This has to be | ||
+ | taken into considering when setting up the environment and running the training | ||
+ | on a HPC system. | ||
+ | |||
+ | The code from the example for native distributed training can be ported to | ||
+ | horovod only requiring small changes. | ||
+ | |||
+ | Calling the initialization function sets up the distributed environment | ||
+ | replacing the process group initialization from the native approach. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | import horovod.torch as hvd | ||
+ | hvd.init() | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | GPU pinning is now controlled by the local ranks assigned by horovod. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | torch.cuda.set_device(hvd.local_rank()) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The distributed sampling is also setup using variables provided by horovod. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | from torch.utils.data.distributed import DistributedSampler | ||
+ | train_sampler = DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Lastly, the optimizer needs to be wrapped by horovod's distributed optimizer | ||
+ | and the parameters need to be broadcasted from rank 0 to all processes before | ||
+ | starting the first epoch. | ||
+ | |||
+ | <syntaxhighlight lang="python"> | ||
+ | optimizer = optim.SGD(model.parameters()) | ||
+ | optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters()) | ||
+ | |||
+ | hvd.broadcast_parameters(model.state_dict(), root_rank=0) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The script can then be called either by using the <syntaxhighlight lang="bash" inline>horovodrun</syntaxhighlight> command or by | ||
+ | directly launching the processes through calls to <syntaxhighlight lang="bash" inline>srun</syntaxhighlight>, <syntaxhighlight lang="bash" inline>mpirun</syntaxhighlight> or similar. | ||
+ | |||
+ | == Inference == | ||
+ | |||
+ | Using a trained model for inference can be straightforward as loading a | ||
+ | pre-trained model and feeding it with data samples. Take a look at | ||
+ | [https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_models_for_inference.html | ||
+ | PyTorch's tutorial for inference] on how to perform basic inference with a | ||
+ | model. Also take a look at libraries like [https://pytorch.org/serve/ | ||
+ | TorchServe] or the [https://developer.nvidia.com/triton-inference-server Triton | ||
+ | Inference Server] to use inference in production environments or improve | ||
+ | inference throughput by pre-compiling a model through | ||
+ | [https://developer.nvidia.com/tensorrt-getting-started TensorRT] | ||
+ | |||
+ | Distributed inference is much simpler than distributed training as there is no | ||
+ | communication involved between the different instances. Inference can be scaled | ||
+ | to an arbitrary amount of GPUs by simply starting single GPU inference | ||
+ | processes on different parts of an inference dataset, but requires a large | ||
+ | enough number of samples to scale efficiently. |
Latest revision as of 15:15, 7 November 2024
PyTorch is a Python framework for machine and deep
learning. It is build upon the torch library and also
provides a C++ interface. It supports CPU and GPU execution for single-node and
multi-node systems.
General
To learn how to use PyTorch take a look at the official Beginners Guide and the official Tutorials. PyTorch is a very versatile machine learning framework that can be further expanded through various libraries that are built on-top of it.
Setup
For general setup of the Python environment consult the instructions from Machine and Deep Learning Frameworks. Depending on the required platform and GPU support visit pytorch.org for the applying installation instructions. When using pip or conda this is important to get the desired GPU (for Nvidia or AMD) or CPU build. Example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
The command listed above would install PyTorch with CUDA 12.1 support as well as the torchvision and torchaudio packages. These are additional packages that supply datasets, tools and models for computer-vision and audio tasks. Choose the framework version based on the library modules available on the HPC system and the available hardware.
PyTorch offers multiple backends for distributed training. GLOO is only available for using multiple CPUs. The default backend is Nvidia's Collective Communication Library (NCCL) and supports multi CPU and GPU. AMD GPUs are supported via the [https://rocm.docs.amd.com/projects/rccl/en/docs-5.4.3/index.html ROCm Communication Collectives Library] (RCCL) when installing the ROCm build of PyTorch. PyTorch also offers MPI as a backend, but this requires it to be build from source with a MPI implementation installed and available to the system. This might be advantageous if used in coupled applications where a simulation or other workload is already using MPI and integrates machine learning methods.
Important Note: Often, HPC centers also provide optimized, pre-defined Apptainer or Docker containers that already contain all required packages. In that case, users do not need to create their own separate virtual environment but can conveniently use the container for execution.
Building models
Building machine-learning models can be done from scratch or by using high-level APIs and abstractions like Keras or PyTorch Lightning. This offers many ways to implement a wide range of machine-learning algorithms and models. This article omits code examples of how to build models in PyTorch to avoid duplication of already existing guides and recommends using the official [https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html tutorials].
Model training
Training a model with PyTorch includes multiple steps:
- deciding on the type of training: single device or distributed; CPU or GPU
- define or load the model to be trained
- load and pre-process your dataset
- if applicable, choose and setup distribution strategy
These steps will be explained in detail in the following example based on the MNIST dataset of handwritten digits to train a neural network that performs image classification (Full code here). This assumes familiarity with the concepts of machine learning and neural networks.
Defining a model
The following code defines a neural network with three linear layers and uses the rectified linear units (ReLUs) as activation functions for the neurons.
import torch.nn as nn
import torch.nn.functional as F
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(28*28, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 28*28)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
# create an instance of the model
model = SimpleNN()
Loading the dataset
Now a dataset needs to be loaded. In this case this includes defining transformations for the training data, loading the MNIST dataset and using a data loader to apply shuffling on the samples and packs them into batches of 64 samples each.
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
Training the model
A rudimentary description of a training process is shown below:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(10):
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
If a (Nvidia) GPU is available on the system, the model and data samples need to be transferred into the GPU memory. If no GPU is available, the training will be performed on the CPU instead. The training loop performs 10 epochs and in each epoch the entire training dataset is processed by the device.
Distributed training
Depending on the system and the available hardware distributed training may require more considerations than simple single device training.
Distributed training can be enabled by different modules and libraries:
- torch.distributed: PyTorch's native distributed training module
- Horovod: A distributed training library supported by various ML/DL frameworks
The previously mentioned libraries/modules enable the setup of the distributed environment and handle calls to the communication libraries. Additionally, it is necessary to think about the required type of distributed parallelism. As mentioned in the overview article different parallelization strategies exist for different use-cases. torch.distributed and Horovod support data parallelism and basic model parallelism. For more specialized implementations of model parallelism like e.g. pipeline parallelism or FSDP, additional libraries are required.
The following example makes use of data parallelism to speed up training:
Native distributed training
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group(backend='nccl', init_method='env://')
model = SimpleNN().to(device)
model = DDP(model)
The above code initializes a process group using NCCL as a backend and using environment variables to setup the TCP connection. If no backend is specified, NCCL and GLOO instances will be created and automatically chosen from the used device (GLOO for CPU, NCCL for GPU). The machine learning model is then wrapped by the distributed handler.
Setting up the environment
When using environment variables to setup the process group the following has to be supplied:
MASTER_ADDR
: The IP-address of the master nodeMASTER_PORT
: The port to be used for communicationWORLD_SIZE
: The number of workers used for trainingRANK
: The rank of a process
On a SLURM-based system this could be done in the following way for a number of X nodes and Y GPUs using the first node from the jobs nodelist as the master node.
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=Y
#SBATCH --gpus-per-node=Y
export MASTER_PORT=12340
export WORLD_SIZE=$(( ${SLURM_GPUS_PER_NODE}*${SLURM_NTASKS} ))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
srun bash -c 'eval RANK=\$$SLURM_PROCID && python training-script.py'
If multiple GPUs are available on a node, each process must set the visible devices accordingly (example assumes CUDA devices). In a SLURM environment the local process ID can be mapped to a GPU ID inside the training script:
import os
gpu = int(os.environ['SLURM_LOCALID'])
torch.cuda.set_device(gpu)
The training code needs to be adapted to the distributed training as well.
from torch.utils.data.distributed import DistributedSampler
train_sampler = DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, sampler=train_sampler)
optimizer = optim.SGD(model.parameters())
for epoch in range(10):
train_sampler.set_epoch(epoch)
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
A distributed sampler is used to distributed the data samples among the available devices. Once adjusted, the training can now be executed on an arbitrary number of nodes and GPUs. Increasing the number of GPUs can speed up the training but also impacts model accuracy and convergence behavior. This requires further adjustments of hyperparameters like, e.g. batch size, learning rate and optimizer.
NOTE: When using additional libraries that wrap PyTorch's native distributed training, a different ratio of tasks to GPUs may be required to make distributed training work. For example, when using composer only one task per process is required while still requesting multiple GPUs of a single node, because composer will perform an internal setup that will then spawn new processes handling the individual GPUs. Read the documentation of the used library to find the correct mapping of tasks to GPUs.
Horovod
Using horovod simplifies the setup by not requiring to set environment variables like in the native approach. For backends like NCCL horovod still requires and uses MPI additionally to provide communication. This has to be taken into considering when setting up the environment and running the training on a HPC system.
The code from the example for native distributed training can be ported to horovod only requiring small changes.
Calling the initialization function sets up the distributed environment replacing the process group initialization from the native approach.
import horovod.torch as hvd
hvd.init()
GPU pinning is now controlled by the local ranks assigned by horovod.
torch.cuda.set_device(hvd.local_rank())
The distributed sampling is also setup using variables provided by horovod.
from torch.utils.data.distributed import DistributedSampler
train_sampler = DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
Lastly, the optimizer needs to be wrapped by horovod's distributed optimizer and the parameters need to be broadcasted from rank 0 to all processes before starting the first epoch.
optimizer = optim.SGD(model.parameters())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
The script can then be called either by using the horovodrun
command or by
directly launching the processes through calls to srun
, mpirun
or similar.
Inference
Using a trained model for inference can be straightforward as loading a pre-trained model and feeding it with data samples. Take a look at [https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_models_for_inference.html PyTorch's tutorial for inference] on how to perform basic inference with a model. Also take a look at libraries like [https://pytorch.org/serve/ TorchServe] or the [https://developer.nvidia.com/triton-inference-server Triton Inference Server] to use inference in production environments or improve inference throughput by pre-compiling a model through TensorRT
Distributed inference is much simpler than distributed training as there is no communication involved between the different instances. Inference can be scaled to an arbitrary amount of GPUs by simply starting single GPU inference processes on different parts of an inference dataset, but requires a large enough number of samples to scale efficiently.