The Cray Inc. X1E

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processors
    4. Intel Itanium 2
    5. Intel Xeon
    6. The MIPS processor
    7. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XT3
  5. The Cray XT4
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM eServer p575
  13. The IBM BlueGene/L&P
  14. The Liquid Computing LiquidIQ
  15. The NEC Express5800/1000
  16. The NEC SX-8
  17. The SGI Altix 4000
  18. The SiCortex SC series
  19. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

Machine type Shared-memory multi-vector processor.
Models X1 (cluster).
Operating system UNICOS (Cray Unix variant).
Connection structure Crossbar.
Compilers Fortran 90, C, C++, Co-Array Fortran, UPC.
Vendors information Web page www.cray.com/products/x1e/index.html
Year of introduction 2004.

 

System parameters:

Model Cray X1E AC Cray X1E LC Cray X1E MF
Clock cycle 1.125 GHz 1.125 GHz 1.125 GHz
Frames 1 air-cooled 1 liquid-cooled 64 liquid-cooled
Theor. peak performance      
Per Proc. 4.5/18 Gflop/s 4.5/18 Gflop/s 4.5/18 Gflop/s
Maximal 576 Gflop/s 2.3 Tflop/s 147.2 Tflop/s
Memory ≤ 128 GB ≤ 512 GB ≤ 32 TB
No. of proc.s (MSP, see below) ≤ 32 ≤ 128 ≤ 8192
Memory bandwidth      
Memory-Cache 34.1 GB/s 34.1 GB/s 34.1 GB/s
Cache-CPU 76.8 GB/s 76.8 GB/s 76.8 GB/s
Maximum aggregate 816 GB/s 3.2 TB/s 204.8 TB/s

Remarks:

Each processor board an X1E contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 4.5 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance of 18 Gflop/s. The reconfiguration into MSPs and/or single CPU combinations can be done dynamically as the workload dictates. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. MSP mode is regarded as the standard mode of operation. This is also visible in the processor count given in the data sheets of Cray: the maximum number within one cabinet is 32 MSP processors for the air-cooled model while it is 128 MSPs in the liquid-cooled variant. In the present Cray optimisation documentation it is said that the Cray Programming Environment is as yet not optimised for SSP processing which is not to say that suitable programs would not run efficiently in SSP mode.

The logical hardware structure of the Cray X1E is largely identical to that of the former Cray X1 (see "Systems disappeared from the list"). However, technology improvements made it possible to increase the density twofold by placing two MSPs on one multi-chip module (MCM). Four of these MCMs are located in one X1E Compute Module comprised of two (logical) 4-MSP nodes. The 8 MCMs are connected to routing logic in the Compute Module which contains 32 network ports to connect it to other Compute Modules. The clock frequency of the processors was raised from 800 MHz in the X1 to 1.125 GHz in the X1E and the amount of memory and the bandwidth to the processors are increased.

The relative bandwidth both from memory to the CPU boards and from the cache to the CPUs has improved in comparison to the predecessor X1: from memory to the CPU board 5.3 8-byte operands can be transferred. From the cache the peak bandwidth to the CPUs is 12 8-byte operands, enough to sustain dyadic operations. The cache structure is rather complex: each of the 4 SSPs on a board have their own 16 KB 2-way set-associative L1 data and instruction cache. The L1 data cache only stores scalar data. The L2 cache is 2MB in size and is shared by the SSP processors on the CPU board.

New features that are less visible to the user are adherence to the IEEE 754 floating-point standard arithmetic and a new vector instruction set that can make better use of the new features like caches and addressability and synchronisation of remote nodes. This is because every cabinet can be regarded as a node in a cluster of which a maximum of 64 can be configured in what is called a modified 2-D torus topology. Cray itself regards a board with 4 MSPs as a "node". Each node has two connections to the outside world. Odd and even nodes are connected in pairs and the other connection from the board is connected via a switch to the other boards. Thus requiring at most two hops to reach any other MSP in the cabinet. The aggregate bandwidth in such a fully populated cabinet is 400 GB/s. Multi-cabinet configurations are extending the 2-D torus into a 3-D torus structure much like the late Cray T3E. Latency and bandwidth data for point-to-point communication are not provided but various measurements in an MPI environment have been done, see the Measured Performances below.
On a 4-SSP CPU board OpenMP can be employed. When accessing other CPU boards one can use Cray's shmem library for one-sided communication, MPI, Co-Array Fortran, etc.

Measured Performances: In [49] a speed of 15,706 Gflop/s is reported for solving a 494,592-order linear system on a 1020-(MSP)processor machine at the Korea Meteorological Administration. This amounts to an efficiency of 85%.
A more extensive evaluation of 504-MSP system at ORNL system is reported in [13]. Here a point-to-point bandwidth of 13.9 GB/s was found within a 4-MSP node and 11.9 GB/s between nodes. MPI latencies for small messages were 8.2 and 8.6 µs, respectively. For shmem and Co-Array Fortran the latencies were only 3.8 and 3.0 µs.