The Cray Inc. X1E

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER5+

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Networks

Infiniband

InfiniPath

Myrinet

QsNet

SCI

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray X1E

The Cray XT3

The Cray XT4

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM eServer p575

The IBM BlueGene/L&P

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-8

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type Shared-memory multi-vector processor.
Models X1 (cluster).
Operating system UNICOS (Cray Unix variant).
Connection structure Crossbar.
Compilers Fortran 90, C, C++, Co-Array Fortran, UPC.
Vendors information Web page www.cray.com/products/x1e/index.html
Year of introduction 2004.

System parameters:

Model Cray X1E AC Cray X1E LC Cray X1E MF

Clock cycle 1.125 GHz 1.125 GHz 1.125 GHz

Frames 1 air-cooled 1 liquid-cooled 64 liquid-cooled

Theor. peak performance

Per Proc. 4.5/18 Gflop/s 4.5/18 Gflop/s 4.5/18 Gflop/s

Maximal 576 Gflop/s 2.3 Tflop/s 147.2 Tflop/s

Memory ≤ 128 GB ≤ 512 GB ≤ 32 TB

No. of proc.s (MSP, see below) ≤ 32 ≤ 128 ≤ 8192

Memory bandwidth

Memory-Cache 34.1 GB/s 34.1 GB/s 34.1 GB/s

Cache-CPU 76.8 GB/s 76.8 GB/s 76.8 GB/s

Maximum aggregate 816 GB/s 3.2 TB/s 204.8 TB/s

Remarks:
Each processor board an X1E contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 4.5 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance of 18 Gflop/s. The reconfiguration into MSPs and/or single CPU combinations can be done dynamically as the workload dictates. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. MSP mode is regarded as the standard mode of operation. This is also visible in the processor count given in the data sheets of Cray: the maximum number within one cabinet is 32 MSP processors for the air-cooled model while it is 128 MSPs in the liquid-cooled variant. In the present Cray optimisation documentation it is said that the Cray Programming Environment is as yet not optimised for SSP processing which is not to say that suitable programs would not run efficiently in SSP mode.

The logical hardware structure of the Cray X1E is largely identical to that of the former Cray X1 (see "Systems disappeared from the list"). However, technology improvements made it possible to increase the density twofold by placing two MSPs on one multi-chip module (MCM). Four of these MCMs are located in one X1E Compute Module comprised of two (logical) 4-MSP nodes. The 8 MCMs are connected to routing logic in the Compute Module which contains 32 network ports to connect it to other Compute Modules. The clock frequency of the processors was raised from 800 MHz in the X1 to 1.125 GHz in the X1E and the amount of memory and the bandwidth to the processors are increased.

The relative bandwidth both from memory to the CPU boards and from the cache to the CPUs has improved in comparison to the predecessor X1: from memory to the CPU board 5.3 8-byte operands can be transferred. From the cache the peak bandwidth to the CPUs is 12 8-byte operands, enough to sustain dyadic operations. The cache structure is rather complex: each of the 4 SSPs on a board have their own 16 KB 2-way set-associative L1 data and instruction cache. The L1 data cache only stores scalar data. The L2 cache is 2MB in size and is shared by the SSP processors on the CPU board.

New features that are less visible to the user are adherence to the IEEE 754 floating-point standard arithmetic and a new vector instruction set that can make better use of the new features like caches and addressability and synchronisation of remote nodes. This is because every cabinet can be regarded as a node in a cluster of which a maximum of 64 can be configured in what is called a modified 2-D torus topology. Cray itself regards a board with 4 MSPs as a "node". Each node has two connections to the outside world. Odd and even nodes are connected in pairs and the other connection from the board is connected via a switch to the other boards. Thus requiring at most two hops to reach any other MSP in the cabinet. The aggregate bandwidth in such a fully populated cabinet is 400 GB/s. Multi-cabinet configurations are extending the 2-D torus into a 3-D torus structure much like the late Cray T3E. Latency and bandwidth data for point-to-point communication are not provided but various measurements in an MPI environment have been done, see the Measured Performances below.
On a 4-SSP CPU board OpenMP can be employed. When accessing other CPU boards one can use Cray's shmem library for one-sided communication, MPI, Co-Array Fortran, etc.
Measured Performances: In [49] a speed of 15,706 Gflop/s is reported for solving a 494,592-order linear system on a 1020-(MSP)processor machine at the Korea Meteorological Administration. This amounts to an efficiency of 85%.
A more extensive evaluation of 504-MSP system at ORNL system is reported in [13]. Here a point-to-point bandwidth of 13.9 GB/s was found within a 4-MSP node and 11.9 GB/s between nodes. MPI latencies for small messages were 8.2 and 8.6 µs, respectively. For shmem and Co-Array Fortran the latencies were only 3.8 and 3.0 µs.