HPC Re-Engineering

Overview

We have been tasked with proposing how we would re-engineer an aging SGI High Performance Computing platform. The platform is a 35 Node cluster running Red Hat 6.7 on each node with a Lustre file system mount and a /home file system presented via NFS from a Hierarchical Storage Node.

In addition to the existing cluster of Compute Nodes there is an SGI UV2000 4TB Memory Node with 256 CPU’s for processing large data sets.

The nodes are connected using a mix of 10G Fibre, 40/56G FDR Infiniband and 1G/10G Ethernet as shown in the following photos:

Current Config

The Compute Nodes are currently AMD 6278 8 core, 2.4GHz, 256GB with 2M cache. There are four sockets and 64GB per CPU. These are very old and several generations behind so while they work, we propose to keep them as a “legacy” job queue and phase them out as they die. As this occurs new high performance Compute Nodes will be phased in to replace them.

Scheduling Software

The cluster runs PBS Pro 13 which is an old version. The current version available is v19. So there is an upgrade path and we will look at the costs of that in due course. But as scheduling software goes, there are plenty of others available in use in University, Research Institutes and Engineering Labs. One of the more popular Job Scheduling packages is Slurm. We will factor this scheduler into our proposal due to its popularity and functionality.

Lustre File System

The cluster runs an old implementation of the Lustre FS to provide a high speed file system shared with all nodes in the cluster. This will need to be rebuilt on new hardware at some stage and with newer storage. So some research into the performance of current Storage technology will be required. The older units have 8TB drives

Current Issues

The aging cluster is not being fully utilized and this is partly due to:

  • Poor performance of the CPU’s and their Floating Point capability
  • The OS version running on them is RedHat 6.7 that means more recent software can’t be built on them until they at a Redhat 7 or 8 level.

Since the data sets are getting larger, the time to process is taking longer than the allowed time limit for the researches to process and analyse the data, a fast system will enable faster turn around of Job Results.

The client has commissioned an upgrade process to upgrade the cluster, so our changes will kick in after that.

Proposed Roadmap

After some research we came up with a viableroad map to bring the HPC up to a modern fast platform.

To help phase out much of the older equipment, our road map has addressed the following:

  • Build and commission a new Lustre File server.
  • Install a new smaller 72 node Infiniband EDR cluster.
  • Install new scheduler, login and build Nodes.
  • Implement some high end GPU Compute Nodes
  • Implement some large memory Compute Nodes.
  • roll out some basic Fast compute Nodes.

Lustre File server upgrade

For the MDS nodes, we have specified two new Intel servers with 256G RAM and a Direct Attach Storage flash storage array featuring 24 NVMe drives. The config will be a RAID-10, comprising an 11 drive stripe and then a mirror of that with some hot spares. Both servers will see the new storage and connect to the new IB network via Dual Port EDR ConnectX5 card.

There will be four Object Storage Servers (OSS), they will connect to another Direct Attached storage array with about 750TB of disks. The Array will present to all four servers. All the servers will be configured with Corosync and Pacemaker. The MDS will comprise two servers in a HA config, while the OSS’s will be in two separate HA pairs.

We are also going to create the /home directory and software shares on lustre as most jobs run from it so its and easy addition.

72-Node IB Network Deployment

The original FDR network allowed for 108 nodes. After examining what exactly was connected, it was determined that many of these were Infiniband storage interfaces which are no longer supported on most new product offerings. With no Infiniband interface on the new storage it seamed that a large network was not needed. The next step down from 108 nodes is 72 nodes using 6 switches as shown below.

If a larger cluster is needed later, an additional 3 switches can be purchased and the links re-cabled accordingly.

New Compute Nodes

Selecting new Compute Nodes has been difficult, mainly choosing CPU’s, with the Intel 8278 series and AMD EPYC 64 core CPU being the top choices. There is also the combined power load of the newer high end CPU’s which limit how many servers will fit in a rack especially when you add 300W GPU cards in them.

We have identified three use cases for the new CPU’s:

  • GPU Node with NVIDEA v100 cards and at least 768GB of RAM.
  • Large Memory Nodes with 4TB of RAM.
  • and either Dual or Quad socket configurations for a set of Compute Nodes with 384G of RAM.

At this stage we look like specifying Dell R6525 and R7525servers with EDR infiniband and 10/25/100G Ethernet capability.

Final Steps

When the new scheduler nodes are up and running, we will bring the older Compute Nodes across to the new EDR Infiniband and this will enable the old Infiniband, storage and admin nodes to be decommissioned.

Final decision will be the configuration of the Job Queues after an analysis of the workloads over the past 12 months.

References:

You may also like...

Popular Posts

1 Comment

  1. […] we took on our HPC Re-engineering Contract we were exposed to FDR Infiniband from Mellanox which was already old technology and in […]

Comments are closed.