System Architecture

Scalable Storage Unit (SSU)

SSUs are the building blocks of Taiga. Two SSUs are housed per rack with room at the top for service nodes, DTN nodes, and switches. Taiga is built to scale out with more SSUs as demand for the system grows over time.

Gen 1 SSU:

  • Controller: DDN ES400NVX w/ 4 x SS9012 JBOD

  • NVME Capacity per SSU: 100TB Usable (12 x 15.36TB)

  • HDD Capacity per SSU: 5PB Usable (352 x 18TB)

  • Performance per SSU: ~30GB/s

Gen 2 SSU:

  • Controller: DDN ES400NVX2 w/ 5 x SS9024 JBOD

  • NVME Capacity per SSU: 100TB Usable (12 x 15.36TB)

  • HDD Capacity per SSU: 6PB Usable (440 x 18TB)

  • Performance per SSU: ~55GB/s

Special Unit:

  • Controller: DDN 18KXE w/ 10 x SS9012 JBOD

  • NVME Capacity per SSU: 100TB Usable (24 x 7.68TB)

  • HDD Capacity per SSU: 9PB Usable (900 x 12TB)

  • Performance per SSU: ~50GB/s

Note: This Special SSU was integrated from another project; not a standard SSU

Interconnect Fabric

Taiga is built on a Mellanox HDR fabric with an aggregate bandwidth of 4TB/s. All DDN controller couplets, cluster LNET routers, and cluster export nodes connect to a pair of Mellanox QM8700 40-port Managed HDR switches. This fabric serves as the backbone for the storage infrastructure.

LNET routers placed in each system provide the network translation to their own fabric type.

Mellanox switch connections:

  • Delta/DeltaAI LNET routers: HDR200 to Slingshot 11

  • Radiant LNET routers: HDR200 to 100GbE

  • Campus Cluster routers: HDR200 to 200GbE (coming soon)

  • Industry LNET routers: HDR200 to EDR

  • Internal/Generic LNET routers: HDR200 to 100GbE

  • Cluster export nodes (NSF, SMB, S3, Globus)

Filesystem

Taiga uses DDN’s EXAScaler Lustre appliance and is running EXA6 based on Lustre 2.15. Lustre is a highly scalable and performant HPC file system with a growing number of features that help support bare metal workloads, virtual machines (VMs), and containers. Newer Lustre features such as subdirectory mount, PFL, and DNE3 are leveraged on Taiga in addition to tools and features provided by DDN’s EXAScaler platform such as the Stratagem policy engine.

System Architecture Diagram

Taiga architecture diagram showing the SSUs connected to two Mellanox switches. The switches are also connected to the Delta, Radiant, Industry, and Internal/Generic LNET routers, and the cluster export nodes. The LNET routers are also connected to the respective compute systems. The cluster export nodes are also connected to the NPCF core LAN routers.