The IBM BlueGene/L&P

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processors
    4. Intel Itanium 2
    5. Intel Xeon
    6. The MIPS processor
    7. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XT3
  5. The Cray XT4
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM eServer p575
  13. The IBM BlueGene/L&P
  14. The Liquid Computing LiquidIQ
  15. The NEC Express5800/1000
  16. The NEC SX-8
  17. The SGI Altix 4000
  18. The SiCortex SC series
  19. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

Machine type RISC-based distributed-memory multi-processor
Models IBM BlueGene/L&P.
Operating system Linux
Connection structure 3-D Torus, Tree network
Compilers XL Fortran (Fortran 90), XL C, C++
Vendors information Web page www-1.ibm.com/servers/deepcomputing/
Year of introduction 2004 for BlueGene/L, 2007 for BlueGene/P

System parameters:

Model BlueGene/L BlueGene/P
Clock cycle 700 MHz 850 MHz
Theor. peak performance    
Per Proc. (64-bits) 2.8 Gflop/s 3.4 Gflop/s
Maximal 367/183.5 Tflop/s 1.5/3 Pflop/s
Main memory    
Memory/card ≤ 512 MB ≤ 2 GB
Memory/maximal ≤ 16 TB ≤ 442 TB
No. of processors ≤ 2×65,536 ≤ 4×221,184
Communication bandwidth    
Point-to-point (3-D Torus) 175 MB/s 350 MB/s
Point-to-point (Tree network) 350 MB/s 700 MB/s

Remarks:

The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing. The individual speed of the processor has therefore been traded in favour of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz. Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache and of data cache on board. In favourable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.

The packaging in the system is as follows: two chips fit on a compute card with 512 MB of memory. Sixteen of these compute cards are placed on a node board of which in turn 32 go into one cabinet. So, one cabinet contains 1024 chips, i.e., 2048 CPUs. For a maximal configuration 64 cabinets are coupled to form one system with 65,356 chips/130,712 CPUs. In normal operation mode one of the CPUs on a chip is used for computation while the other takes care of communication tasks. In this mode the Theoretical Peak Performance of the system is 183.5 Tflop/s. It is however possible when the communication requirements are very low to use both CPUs for computation, doubling the peak speed; hence the double entries in the System Parameters table above. The number of 360 Tflop/s is also the speed that IBM is using in its marketing material.

The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc. The hardware bandwidth of the tree network is twice that of the torus: 350 MB/s against 175 MB/s per link.

BlueGene/P
Very recently the second generation BlueGene system, the BlueGene/P has been announced. A first system is expected to be installed this year. The macro-architecture of the BlueGene/P is very similar to that of the L model, except that about everything in the system is faster and bigger. The chip is a variant of the PowerPC 450 family and will run at 850 MHz. As, like in the BlueGene/L processor 4 floating-point instructions can be performed per cycle, the theoretical peak performance is 3.4 Gflop/s. Four processor cores reside on a chip (as opposed to 2 in the L model). The L3 cache grows from 4 to 8 MB and the memory per chip increases four-fold to 2 GB. In addition the bandwidth in B/cycle has doubled and becomes 13.6 GB/s. Unlike the dual-core BlueGene/L chip the quad-core model P chip can work in true SMP mode, making it amenable to the use of OpenMP.

One board in the system carries 32 quad-core chips while again 32 boards can be fitted in one rack with 4,096 cores. A rack therefore has a heoretical Peak Performance of 13.9 Tflop/s. The IBM Press release sets the maximum number of cores in a system to 884,736 in 216 racks and a Theoretical Peak Performance of 3 Pflop/s. The higher bandwidth of the main communication networks (torus and tree) also goes up by a factor of about 2 while the latency is halved.

Like the BlueGene/L the P model is very energy-efficient: a 1024-processor (4096-core) rack only draws 40 KW.

Measured Performances:
Obviously no results for the BlueGene/P are available yet but in [49] a speed of 280.6 Tflop/s on the HPC Linpack benchmark for the BlueGene/L is reported, solving a linear system of size N = 1,769,471, with 131,072 processors amounting to an efficiency of 76%.