Lustre Tuning (Admin Guide)
This article describes tuning options for the Lustre file system with the example of the Noctua cluster of PC².
Lustre File System
The high-performance parallel file system of Noctua is a Lustre File System.
This Lustre file system has three major functional units
- one Metadata Server (MDS) with two metadata targets (MDTs)
- stores namespace metadata, such as filenames, directories, Access permissions, and file layout
- stores small files on SSDs to accelerate data access
- two Object Storage Servers (OSS), each with two Object Storage Targests (OST)
- each OST manages a single local disc filesystem
- Clients (the Noctua nodes) that access the data (read/write)
- Lustre presents all Clients with a unified Namespace for all the files and data in the file System
- allows concurrent and coherent read and write access to the files in the filesystem
- Lustre achieves high Performance through parallelism
- best Performance from multiple Clients writing to multiple OSTs
- Lustre is designed to achieve high bandwidth to/from a small number of files
- used as a scratch file System
- good match for scientific datasets and/or checkpoint data
- Lustre is not designed to handle large numbers of small files
- potential bottle necks at the MDS when files are opened
- data will not be spread over multiple OSTs
- not a good choice for program compilation
A powerful Lustre utility is lfs. The tool has a built-in help system.
> lfs help Available commands are: setstripe getstripe setdirstripe getdirstripe mkdir rm_entry pool_list find ...
Metadata operations are expensive
- the stat operations return information on file ownerships, permissions, size, update times etc.
- to obtain the file size requires a lookup on the MDS and an enquiry for file size on each OST owning a stripe
- avoid ls -l (like color ls)
- avoid file completion in shells
- open and fail instead of stat/INQUIRE
- don't stripe small files, Lustre check every OST that might own a part of the file
- open a file read-only if that is what you will do
- use tools optimzied for (aware of) Lustre
- lfs find, lfs df, ...
- stripe-aware tar (star)
- avoid to read the same files on many processes, better to read on one process and use MPI communication to move data to other processes
- avoid large directories, organize directory structure by processes/clients
- open() and seek() if you know the size, otherwise try to organize applications to write from only one process
- use the Lustre API in your application (see man lustreapi)
More Tuning tips for Lustre are in Noctua Tuning.