The Bull NovaScale

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER5+

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Networks

Infiniband

InfiniPath

Myrinet

QsNet

SCI

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray X1E

The Cray XT3

The Cray XT4

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM eServer p575

The IBM BlueGene/L&P

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-8

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type ccNUMA system.
Models NovaScale 5325.
Operating system Linux, WindowsServer 2003, GCOS 8
Connection structure Full crossbar
Compilers Intel's Fortran 95, C(++)
Vendors information Web page http://www.bull.com/novascale/
Year of introduction 2005.

System parameters:

Model NovaScale 5325
Clock cycle 1.6 GHz
Theor. peak performance 204.8 Gflop/s
No. of processors 8–32
Comm. bandwidth
Point-to-point 6.4 GB/s
Aggregate 25.6 GB/s

Remarks:
The NovaScale 5005 series is the second generation of Itanium-2 based systems targeting the HPC field. Besides the model listed under System Parameters it also includes the 5085, 5165, and 5245 models which we do not discuss separately as they are simply models with a maximum of 8, 16, and 24 processors, respectively. The main difference with the first generation, the 5230 system, is the doubling of the density: where the 5230 had to be housed in two 40 U racks, a 5325 systems fit in one rack. In about all other regards the new series is equal to the first generation.

The NovaScales are therefore ccNUMA SMPs.They are built from standard Intel Quad Building Blocks (QBBs) each housing 4 Itanium 2 processors and a part of the memory. The QBBs in turn are connected by Bull's proprietary FAME Scalability Switch (FSS) providing an aggregate bandwidth of 25.6 GB. For reliability reasons a NovaScale 5165 is equipped with 2 FSSes. This ensures that when any link between a QBB and a switch or between switches fails the system is still operational, be it on a lower communication performance level. As each FSS has 8 ports and only 6 of these are occupied within a 5165 system, the remaining ports can be used to couple two of these systems thus making a 32-processor ccNUMa system. Larger configurations can be made by coupling systems via QsNet^II (see section QsNet). Bull provides its own MPI implementation which turns out to be very efficient (see "Measured Performances" below and [45]).

A nice feature of the NovaScale systems is that they can be partitioned such that different nodes can run different operating systems and that repartitioning can be done dynamically. Although this is not particularly enticing for HPC users, it might be interesting for other markets, especially as Bull still has clients that use their proprietary GCOS operating system.

Bull has introduced a new server series by mid 2006, the NovaScale 3005. The largest one, the 3045, contains 4 Montecito processors and as such due to it peak of 49.2 Gflop/s has no place in this report. However, the form factor makes it very compact and very fit to be used in heavy-node clusters. Bull offers the 3005 nodes with Quadrics QsNet^II as the standard network medium for large configurations. We thought it to be of sufficient interest to mention it here.

Measured Performances:
In the spring of 2004 rather extensive benchmark experiments with the EuroBen Benchmark were performed on a 16-processor NovaScale 5160 with the 1.3 GHz variant of the processor. Using the EuroBen benchmark, the MPI version of a dense matrix-vector multiply was found to be 13.3 Gflop/s on 16 processors while both for solving a dense linear system of size N = 1,000 and a 1-D FFT of size N = 65,356 speeds of 3.3–3.4 Gflop/s are observed (see [45]). For the recently installed Tera-10 system at CEA, France, a Linpack performance of 42,900 Gflop/s out of 55,705.6 Gflop/s installed is reported. An efficiency of 77% on a linear system of unknown rank (see [49]).