The Bull NovaScale

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processors
    4. Intel Itanium 2
    5. Intel Xeon
    6. The MIPS processor
    7. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XT3
  5. The Cray XT4
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM eServer p575
  13. The IBM BlueGene/L&P
  14. The Liquid Computing LiquidIQ
  15. The NEC Express5800/1000
  16. The NEC SX-8
  17. The SGI Altix 4000
  18. The SiCortex SC series
  19. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References
Machine type ccNUMA system.
Models NovaScale 5325.
Operating system Linux, WindowsServer 2003, GCOS 8
Connection structure Full crossbar
Compilers Intel's Fortran 95, C(++)
Vendors information Web page http://www.bull.com/novascale/
Year of introduction 2005.

 

System parameters:
Model NovaScale 5325
Clock cycle 1.6 GHz
Theor. peak performance 204.8 Gflop/s
No. of processors 8–32
Comm. bandwidth  
Point-to-point 6.4 GB/s
Aggregate 25.6 GB/s

Remarks:

The NovaScale 5005 series is the second generation of Itanium-2 based systems targeting the HPC field. Besides the model listed under System Parameters it also includes the 5085, 5165, and 5245 models which we do not discuss separately as they are simply models with a maximum of 8, 16, and 24 processors, respectively. The main difference with the first generation, the 5230 system, is the doubling of the density: where the 5230 had to be housed in two 40 U racks, a 5325 systems fit in one rack. In about all other regards the new series is equal to the first generation.

The NovaScales are therefore ccNUMA SMPs.They are built from standard Intel Quad Building Blocks (QBBs) each housing 4 Itanium 2 processors and a part of the memory. The QBBs in turn are connected by Bull's proprietary FAME Scalability Switch (FSS) providing an aggregate bandwidth of 25.6 GB. For reliability reasons a NovaScale 5165 is equipped with 2 FSSes. This ensures that when any link between a QBB and a switch or between switches fails the system is still operational, be it on a lower communication performance level. As each FSS has 8 ports and only 6 of these are occupied within a 5165 system, the remaining ports can be used to couple two of these systems thus making a 32-processor ccNUMa system. Larger configurations can be made by coupling systems via QsNetII (see section QsNet). Bull provides its own MPI implementation which turns out to be very efficient (see "Measured Performances" below and [45]).

A nice feature of the NovaScale systems is that they can be partitioned such that different nodes can run different operating systems and that repartitioning can be done dynamically. Although this is not particularly enticing for HPC users, it might be interesting for other markets, especially as Bull still has clients that use their proprietary GCOS operating system.

Bull has introduced a new server series by mid 2006, the NovaScale 3005. The largest one, the 3045, contains 4 Montecito processors and as such due to it peak of 49.2 Gflop/s has no place in this report. However, the form factor makes it very compact and very fit to be used in heavy-node clusters. Bull offers the 3005 nodes with Quadrics QsNetII as the standard network medium for large configurations. We thought it to be of sufficient interest to mention it here.

Measured Performances:
In the spring of 2004 rather extensive benchmark experiments with the EuroBen Benchmark were performed on a 16-processor NovaScale 5160 with the 1.3 GHz variant of the processor. Using the EuroBen benchmark, the MPI version of a dense matrix-vector multiply was found to be 13.3 Gflop/s on 16 processors while both for solving a dense linear system of size N = 1,000 and a 1-D FFT of size N = 65,356 speeds of 3.3–3.4 Gflop/s are observed (see [45]). For the recently installed Tera-10 system at CEA, France, a Linpack performance of 42,900 Gflop/s out of 55,705.6 Gflop/s installed is reported. An efficiency of 77% on a linear system of unknown rank (see [49]).