System parameters:
Remarks: Each processor board an X1E contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 4.5 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance of 18 Gflop/s. The reconfiguration into MSPs and/or single CPU combinations can be done dynamically as the workload dictates. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. MSP mode is regarded as the standard mode of operation. This is also visible in the processor count given in the data sheets of Cray: the maximum number within one cabinet is 32 MSP processors for the air-cooled model while it is 128 MSPs in the liquid-cooled variant. In the present Cray optimisation documentation it is said that the Cray Programming Environment is as yet not optimised for SSP processing which is not to say that suitable programs would not run efficiently in SSP mode. The logical hardware structure of the Cray X1E is largely identical to that of the former Cray X1 (see "Systems disappeared from the list"). However, technology improvements made it possible to increase the density twofold by placing two MSPs on one multi-chip module (MCM). Four of these MCMs are located in one X1E Compute Module comprised of two (logical) 4-MSP nodes. The 8 MCMs are connected to routing logic in the Compute Module which contains 32 network ports to connect it to other Compute Modules. The clock frequency of the processors was raised from 800 MHz in the X1 to 1.125 GHz in the X1E and the amount of memory and the bandwidth to the processors are increased. The relative bandwidth both from memory to the CPU boards and from the cache to the CPUs has improved in comparison to the predecessor X1: from memory to the CPU board 5.3 8-byte operands can be transferred. From the cache the peak bandwidth to the CPUs is 12 8-byte operands, enough to sustain dyadic operations. The cache structure is rather complex: each of the 4 SSPs on a board have their own 16 KB 2-way set-associative L1 data and instruction cache. The L1 data cache only stores scalar data. The L2 cache is 2MB in size and is shared by the SSP processors on the CPU board.
New features that are less visible to the user are adherence to the IEEE 754
floating-point standard arithmetic and a new vector instruction set that can
make better use of the new features like caches and addressability and
synchronisation of remote nodes. This is because every cabinet can be regarded
as a node in a cluster of which a maximum of 64 can be configured in what is
called a modified 2-D torus topology. Cray itself regards a board with 4 MSPs
as a "node". Each node has two connections to the outside world. Odd and even
nodes are connected in pairs and the other connection from the board is
connected via a switch to the other boards. Thus requiring at most two hops to
reach any other MSP in the cabinet. The aggregate bandwidth in such a fully
populated cabinet is 400 GB/s. Multi-cabinet configurations are extending the
2-D torus into a 3-D torus structure much like the late
Cray T3E. Latency and bandwidth data for
point-to-point communication are not provided but various measurements in an
MPI environment have been done, see the Measured Performances below.
Measured Performances:
In [49] a speed of 15,706 Gflop/s is
reported for solving a
494,592-order linear system on a 1020-(MSP)processor machine at the Korea
Meteorological Administration. This amounts to an efficiency of 85%.
|