HPC Rebuild – 12 Months on

It’s great to be able to review work you have done in the past and then be able to re-factor it for the current climate and in 2023 we will be contracted to re-engineer and upgrade the HPC platform we designed and built for one of Australia’s most prestigious Medical Research Institutes back in 2020-2021.

The platform we built was a small HPC with 2560 cores, 3x v100s GPU’s and a 750TB Lustre file system interconnected with a 100G Converged RDMA backbone. The HPC Scheduler we built and installed was Slurm 20.11.0 and we deployed everything using PXE boot and SaltStack to manage the configuration and code deployments. The raw disks from the storage were presented to ZFS and built as pools of software RAID disks spread across four Object Storage Servers and two Metadata Servers.

Key Drivers

There are two key drivers for the upgrade scheduled for 2023. One is the changes in software deployments within Health Care/Life Science HPC workloads and the use of more GPU facilities, the other is the sequential nature of the software deployment/operation and the reduction in CPU core usage.

HPC Software

HPC environments are used in a number of industries, and the operational profile of each industry segment is different. In Life Science and in particular a lot of Genetic Engineering Research, a huge number of cores and lots of parallel tasks are rare. Instead, more processing requiring both CPU & GPU resources along with capacity for large files and fast sequential I/O is generally required. Large data files also implies larger memory requirements.

Our original build of 512GB of memory and dual 64 core CPU’s per compute node has resulted in an underutilization of a lot of the cores for a majority of the jobs running.

Lustre Storage Issues

The Lustre Storage has worked very well, the key issue has been memory corruption in a node causing the MDS/OSS servers to crash and do crash recovery on reboot. ZFS has worked well but we will scope replacement of raw disks (and the ZFS software) to a DAS connected solution with Large LUNS presented from the Storage Array directly as the I/O profile is more sequential in nature rather than random I/O.

Moving to array based LUNS means we can correctly configure HA servers using PCS clusters more efficiently. The current HP Smart Array cards will not allow PCS clustering to work due to disk corruption so a physical swap/connect of the storage is required.

New Hardware

Compute Node Hardware

Our proposed hardware will be 1U single or dual CPU nodes with 1TB of RAM, each CPU will have no more than 24 cores at the highest clock speed available 3.6G or better, a 200G Connectx-6 cards will interconnect everything and each Compute Node will get an NVIDEA A40 GPU. The OS will still be CENTOS 7.9 with no move to RHEL 8 platforms at this stage due to software availability. The OS will live on mirrored SSD internally and there will be a pair of mirrored 960G SSD’s for scratch space on each node.

These nodes can be rolled into production use over time to increase the capacity and replace nodes going out of warranty.

Storage

As the existing Lustre storage approaches it’s end of warranty, and the issues with the HA not being fully achieved by the vendor, a new Storage solution based on Dell ME5084/ME484 enclosures with SAS DAS connectivity into the Lustre Storage nodes will be deployed. Space will be 700-800TB.

If a HA solution is required, four servers will be provided by two MetaData Servers (MDS) and two Object Storage Servers. If a partial HA is justified then just the OSS servers will be provided and a single MDS with local SSD will be used.

Backbone networking

The exiting 100G network will remain but new nodes will get 200G Connectx-6 cards as standard. At some point the backbone will be upgraded to 200G depending on cost.

An analysis of the 4TB node will be made and if it is not being fully utilised then it will not be replaced simply run maintenance support.

-oOo-

You may also like...

Popular Posts