The Cray Inc. XMT

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER5+

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Networks

Infiniband

InfiniPath

Myrinet

QsNet

SCI

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray X1E

The Cray XT3

The Cray XT4

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM eServer p575

The IBM BlueGene/L&P

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-8

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type Distributed-memory multi-vector processor
Models XT4
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure 3-D Torus
Compilers Fortran 95, C, C++
Vendors information Web page www.cray.com/products/xmt/
Year of introduction 2006

System parameters:

Model Cray XMT
Clock cycle 500 MHz
Theor. peak performance
Per Processor 1.5 Gflop/s
Per Cabinet 36 Gflop/s
Max. Configuration 12.3 Tflop/s
Memory
Per Cabinet ≤ 324 GB
Max. Configuration 128 TB
No. of processors
Per Cabinet 24
Max. Configuration 8192
Communication bandwidth
Point-to-point ≤ 7.6 GB/s
Bisectional/cabinet 127.8 GB/s

Remarks:
The macro architecture of the Cray XMT is very much alike those of the Cray XT3 and the XT4. However, the processors used are completely different: They are made for massive multithreading and resemble the processors of the late Cray MTA-2 (see Systems disappeared from the list and [46]).

Let us look at the architectural features: Although the memory in the XMT is physically distributed, the system is emphatically presented as a shared-memory machine (with non-uniform access time). The latency incurred in memory references is hidden by multi-threading, i.e., usually many concurrent program threads (instruction streams) may be active at any time. Therefore, when for instance a load instruction cannot be satisfied because of memory latency the thread requesting this operation is stalled and another thread of which an operation can be done is switched into execution. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams per processor and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. A construction that worked out similarly was to be found in the late Stern Computing Systems SSP machines (see in section Systems disappeared from the list).

An XMT processor has 3 functional units that together can deliver 3 flops per clock cycle for a theoretical peak performance of 1.5 Gflop/s. There is only one level of caches, data and instruction, because due to the nature of the applications at which the machine is directed more cache levels would be virtually useless. The high degree of latency hiding through massive multi-threading is the mechanism of choice here to combat memory latency.

Unlike the earlier MTA-2 there is no Fortran compiler anymore for the XMT. Furthermore, the new 3-D torus network, identical to that of the Cray XT3 and the faster clock cycle of 500 MHz makes the machine highly interesting for applications with very unstructured but massively parallel work as for instance in sorting, data mining, combinatorial optimisation and other complex pattern matching applications. Also algorithms like sparse matrix-vector multiplications will perform well.

Measured Performances:
As yet no independent performance results for this new machine are available to prove the value of this interesting architecture.