PyTorch

From HPC Wiki
Jump to navigation Jump to search


PyTorch is a Python framework for machine and deep learning. It is build upon the torch library and also provides a C++ interface. It supports CPU and GPU execution for single-node and multi-node systems.

General

To learn how to use PyTorch take a look at the official Beginners Guide and the official Tutorials. PyTorch is a very versatile machine learning framework that can be further expanded through various libraries that are built on-top of it.

Setup

For general setup of the Python environment consult the instructions from Machine and Deep Learning Frameworks. Depending on the required platform and GPU support visit pytorch.org for the applying installation instructions. When using pip or conda this is important to get the desired GPU (for Nvidia or AMD) or CPU build. Example:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

The command listed above would install PyTorch with CUDA 12.1 support as well as the torchvision and torchaudio packages. These are additional packages that supply datasets, tools and models for computer-vision and audio tasks. Choose the framework version based on the library modules available on the HPC system and the available hardware.

PyTorch offers multiple backends for distributed training. GLOO is only available for using multiple CPUs. The default backend is Nvidia's Collective Communication Library (NCCL) and supports multi CPU and GPU. AMD GPUs are supported via the [https://rocm.docs.amd.com/projects/rccl/en/docs-5.4.3/index.html ROCm Communication Collectives Library] (RCCL) when installing the ROCm build of PyTorch. PyTorch also offers MPI as a backend, but this requires it to be build from source with a MPI implementation installed and available to the system. This might be advantageous if used in coupled applications where a simulation or other workload is already using MPI and integrates machine learning methods.

Important Note: Often, HPC centers also provide optimized, pre-defined Apptainer or Docker containers that already contain all required packages. In that case, users do not need to create their own separate virtual environment but can conveniently use the container for execution.

Building models

Building machine-learning models can be done from scratch or by using high-level APIs and abstractions like Keras or PyTorch Lightning. This offers many ways to implement a wide range of machine-learning algorithms and models. This article omits code examples of how to build models in PyTorch to avoid duplication of already existing guides and recommends using the official [https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html tutorials].

Model training

Training a model with PyTorch includes multiple steps:

  • deciding on the type of training: single device or distributed; CPU or GPU
  • define or load the model to be trained
  • load and pre-process your dataset
  • if applicable, choose and setup distribution strategy

These steps will be explained in detail in the following example based on the MNIST dataset of handwritten digits to train a neural network that performs image classification (Full code here). This assumes familiarity with the concepts of machine learning and neural networks.

Defining a model

The following code defines a neural network with three linear layers and uses the rectified linear units (ReLUs) as activation functions for the neurons.

import torch.nn as nn
import torch.nn.functional as F

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# create an instance of the model
model = SimpleNN()

Loading the dataset

Now a dataset needs to be loaded. In this case this includes defining transformations for the training data, loading the MNIST dataset and using a data loader to apply shuffling on the samples and packs them into batches of 64 samples each.

import torch
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

Training the model

A rudimentary description of a training process is shown below:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

If a (Nvidia) GPU is available on the system, the model and data samples need to be transferred into the GPU memory. If no GPU is available, the training will be performed on the CPU instead. The training loop performs 10 epochs and in each epoch the entire training dataset is processed by the device.

Distributed training

Depending on the system and the available hardware distributed training may require more considerations than simple single device training.

Distributed training can be enabled by different modules and libraries:

  • torch.distributed: PyTorch's native distributed training module
  • Horovod: A distributed training library supported by various ML/DL frameworks

The previously mentioned libraries/modules enable the setup of the distributed environment and handle calls to the communication libraries. Additionally, it is necessary to think about the required type of distributed parallelism. As mentioned in the overview article different parallelization strategies exist for different use-cases. torch.distributed and Horovod support data parallelism and basic model parallelism. For more specialized implementations of model parallelism like e.g. pipeline parallelism or FSDP, additional libraries are required.

The following example makes use of data parallelism to speed up training:

Native distributed training

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP 

dist.init_process_group(backend='nccl', init_method='env://')

model = SimpleNN().to(device)
model = DDP(model)

The above code initializes a process group using NCCL as a backend and using environment variables to setup the TCP connection. If no backend is specified, NCCL and GLOO instances will be created and automatically chosen from the used device (GLOO for CPU, NCCL for GPU). The machine learning model is then wrapped by the distributed handler.

Setting up the environment

When using environment variables to setup the process group the following has to be supplied:

  • MASTER_ADDR: The IP-address of the master node
  • MASTER_PORT: The port to be used for communication
  • WORLD_SIZE: The number of workers used for training
  • RANK: The rank of a process

On a SLURM-based system this could be done in the following way for a number of X nodes and Y GPUs using the first node from the jobs nodelist as the master node.

#SBATCH --nodes=X
#SBATCH --ntasks-per-node=Y
#SBATCH --gpus-per-node=Y

export MASTER_PORT=12340
export WORLD_SIZE=$(( ${SLURM_GPUS_PER_NODE}*${SLURM_NTASKS} ))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)

srun bash -c 'eval RANK=\$$SLURM_PROCID && python training-script.py'

If multiple GPUs are available on a node, each process must set the visible devices accordingly (example assumes CUDA devices). In a SLURM environment the local process ID can be mapped to a GPU ID inside the training script:

import os
gpu = int(os.environ['SLURM_LOCALID'])
torch.cuda.set_device(gpu)

The training code needs to be adapted to the distributed training as well.

from torch.utils.data.distributed import DistributedSampler

train_sampler = DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, sampler=train_sampler)

optimizer = optim.SGD(model.parameters())

for epoch in range(10):
    train_sampler.set_epoch(epoch)
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

A distributed sampler is used to distributed the data samples among the available devices. Once adjusted, the training can now be executed on an arbitrary number of nodes and GPUs. Increasing the number of GPUs can speed up the training but also impacts model accuracy and convergence behavior. This requires further adjustments of hyperparameters like, e.g. batch size, learning rate and optimizer.

NOTE: When using additional libraries that wrap PyTorch's native distributed training, a different ratio of tasks to GPUs may be required to make distributed training work. For example, when using composer only one task per process is required while still requesting multiple GPUs of a single node, because composer will perform an internal setup that will then spawn new processes handling the individual GPUs. Read the documentation of the used library to find the correct mapping of tasks to GPUs.

Horovod

Using horovod simplifies the setup by not requiring to set environment variables like in the native approach. For backends like NCCL horovod still requires and uses MPI additionally to provide communication. This has to be taken into considering when setting up the environment and running the training on a HPC system.

The code from the example for native distributed training can be ported to horovod only requiring small changes.

Calling the initialization function sets up the distributed environment replacing the process group initialization from the native approach.

import horovod.torch as hvd
hvd.init()

GPU pinning is now controlled by the local ranks assigned by horovod.

torch.cuda.set_device(hvd.local_rank())

The distributed sampling is also setup using variables provided by horovod.

from torch.utils.data.distributed import DistributedSampler
train_sampler = DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

Lastly, the optimizer needs to be wrapped by horovod's distributed optimizer and the parameters need to be broadcasted from rank 0 to all processes before starting the first epoch.

optimizer = optim.SGD(model.parameters())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

The script can then be called either by using the horovodrun command or by directly launching the processes through calls to srun, mpirun or similar.

Inference

Using a trained model for inference can be straightforward as loading a pre-trained model and feeding it with data samples. Take a look at [https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_models_for_inference.html PyTorch's tutorial for inference] on how to perform basic inference with a model. Also take a look at libraries like [https://pytorch.org/serve/ TorchServe] or the [https://developer.nvidia.com/triton-inference-server Triton Inference Server] to use inference in production environments or improve inference throughput by pre-compiling a model through TensorRT

Distributed inference is much simpler than distributed training as there is no communication involved between the different instances. Inference can be scaled to an arbitrary amount of GPUs by simply starting single GPU inference processes on different parts of an inference dataset, but requires a large enough number of samples to scale efficiently.