The Hitachi SR11000

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processors
    4. Intel Itanium 2
    5. Intel Xeon
    6. The MIPS processor
    7. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XT3
  5. The Cray XT4
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM eServer p575
  13. The IBM BlueGene/L&P
  14. The Liquid Computing LiquidIQ
  15. The NEC Express5800/1000
  16. The NEC SX-8
  17. The SGI Altix 4000
  18. The SiCortex SC series
  19. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

Machine type RISC-based distributed memory multi-processor
Models SR11000 K1.
Operating system AIX (IBM's Unix variant).
Connection structure Mult-dimensional crossbar (see remarks)
Compilers Fortran 77, Fortran 95, Parallel Fortran, C, C++
Vendors information Web page www.hitachi.co.jp/Prod/comp/hpc/SR_e/11ktop_e.html
Year of introduction 2005.

System parameters:

Model SR11000 K1
Clock cycle 2.1 GHz
Theor. peak performance  
Per proc. (64-bits) 134.4 Gflop/s
Maximal 68.8 Tflop/s
Main memory  
Memory/node ≤ 128 GB
Memory/maximal 16.4 TB
No. of processors 4–512
Communication bandwidth  
Point-to-point 12 GB/s (bidirectional)

Remarks:

The SR11000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR8000 (see Systems Disappeared from the List). We discuss here thelatest model, the SR11000 K1. There is a J1 model which is identical to the K1 model except for the clock cycle which is 1.9 GHz. The J1 and K1 systems replace the H1 model that had exactly the same structure but was based on the 1.7 GHz IBM POWER4+ processor instead of the POWER5.

The basic node processor in the K1 model is a 2.1 GHz POWER5 from IBM. Unlike in the former SR2201 and SR8000 systems no modification of the processor its done to make it fit for Hitachi's Pseudo Vector Processing, a technique that enabled the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required. Presumably Hitachi is now relying on advanced prefetching of data to bring about the same effect.

The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 6.8 Gflop/s on the SR11000. However, 16 basic processors are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 134.4 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves.

The SR11000 has a multi-dimensional crossbar with a single-directional link speed of 12 GB/s. Also here IBM technology is used: the IBM Federation Switch fabric is used, be it in a different topology than IBM did for its own p690 servers. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops.

Like in some other systems as the Cray XT4, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

Of course the usual communication libraries like PVM and MPI are provided. In case one uses MPI it is possible to access individual IPs within the nodes. Furthermore, in one node it is possible to use OpenMP on individual IPs. Mostly this is less efficient than using the automatic parallelisation as done by Hitachi's compiler but in case one offers coarser grained task parallelism via OpenMP a performance gain can be attained. Hitachi provides its own numerical libraries to solve dense and sparse linear systems, FFTs, etc. As yet it is not known whether third party numerical libraries like NAG and IMSL are available.

Note: Large HPC configurations of the SR11000 are not sold in Europe as they are judged to be of insufficient economical interest by Hitachi.

Measured Performances:
The first SR11000 was introduced by the end of 2003. A few model H1 systems have been sold in Japan in the mean time but there are no performance results available for these systems. Presently only one model K1 result is available in [49]: for a 80-node system at Hitachi's Enterprise Server Division a Linpack speed of 9,036 Gflop/s was measured for a 542,700 order linear system on a 10,752 Gflop/s peak performance system. Thus attaining an efficiency of 84.0%.