The Cray Inc. XMT

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processors
    4. Intel Itanium 2
    5. Intel Xeon
    6. The MIPS processor
    7. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XT3
  5. The Cray XT4
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM eServer p575
  13. The IBM BlueGene/L&P
  14. The Liquid Computing LiquidIQ
  15. The NEC Express5800/1000
  16. The NEC SX-8
  17. The SGI Altix 4000
  18. The SiCortex SC series
  19. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

Machine type Distributed-memory multi-vector processor
Models XT4
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure 3-D Torus
Compilers Fortran 95, C, C++
Vendors information Web page www.cray.com/products/xmt/
Year of introduction 2006

System parameters:

Model Cray XMT
Clock cycle 500 MHz
Theor. peak performance  
Per Processor 1.5 Gflop/s
Per Cabinet 36 Gflop/s
Max. Configuration 12.3 Tflop/s
Memory  
Per Cabinet ≤ 324 GB
Max. Configuration 128 TB
No. of processors  
Per Cabinet 24
Max. Configuration 8192
Communication bandwidth  
Point-to-point ≤ 7.6 GB/s
Bisectional/cabinet 127.8 GB/s

Remarks:

The macro architecture of the Cray XMT is very much alike those of the Cray XT3 and the XT4. However, the processors used are completely different: They are made for massive multithreading and resemble the processors of the late Cray MTA-2 (see Systems disappeared from the list and [46]).

Let us look at the architectural features: Although the memory in the XMT is physically distributed, the system is emphatically presented as a shared-memory machine (with non-uniform access time). The latency incurred in memory references is hidden by multi-threading, i.e., usually many concurrent program threads (instruction streams) may be active at any time. Therefore, when for instance a load instruction cannot be satisfied because of memory latency the thread requesting this operation is stalled and another thread of which an operation can be done is switched into execution. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams per processor and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. A construction that worked out similarly was to be found in the late Stern Computing Systems SSP machines (see in section Systems disappeared from the list).

An XMT processor has 3 functional units that together can deliver 3 flops per clock cycle for a theoretical peak performance of 1.5 Gflop/s. There is only one level of caches, data and instruction, because due to the nature of the applications at which the machine is directed more cache levels would be virtually useless. The high degree of latency hiding through massive multi-threading is the mechanism of choice here to combat memory latency.

Unlike the earlier MTA-2 there is no Fortran compiler anymore for the XMT. Furthermore, the new 3-D torus network, identical to that of the Cray XT3 and the faster clock cycle of 500 MHz makes the machine highly interesting for applications with very unstructured but massively parallel work as for instance in sorting, data mining, combinatorial optimisation and other complex pattern matching applications. Also algorithms like sparse matrix-vector multiplications will perform well.

Measured Performances:

As yet no independent performance results for this new machine are available to prove the value of this interesting architecture.