|
The Itanium 2 is a representative of Intel's IA-64 64-bit processor family and
as such the second generation. The first Itanium processor came out in 2001, but
has not spread widely, primarily because the Itanium 2 would follow quickly with
projected performance levels up to twice that of the first Itanium. The first
Itanium 2 implementation ran at 0.8–1.0 GHz and has been followed quickly
by a technology shrink (code name Madison) to the present-day Itanium 2 with
clock frequencies in the range 1.3–1.6 GHz. At the time of writing this
report the next generation is available for about a year. The processor core is
almost unaltered with respect to the Madison processor but it is now built with
90 nm feature size instead of 130 nm and two cores are put onto a chip working
at a clock frequency of 1.66 GHz at maximum. The present processor, code name
Montecito, is a dual core processor like most other processors are these days.
Another major change that is not immediately obvious from the block diagram in
Figure 10 is that it is dual-threaded, be it not so fine-grained as the IBM POWER5+.
Because of the technology shrink the power requirements are lower than for the
Madison processor, slightly over 100W, even if there are two processor cores on
the chip. The successor of the Montecito processor, code-named Montvale is due
in a few months. The differences with the Montecito are small: instead of a 667
MHz frontside bus a 800 MHz frontside bus might be offered and the clock
frequency may go up a bit to 1.7–1.8 GHz. However, the main structure of the
processor will be the same and change with the next generation, the Tukwila
processor that will be connected to the memory by the Common System Interface
(CSI) at much higher speeds than with a frontside bus.
Figure 10: Block diagram of an Intel Itanium 2 core.
Figure 9 shows a large amount of functional units that must be kept busy. This is done by large instruction words of 128 bits that contain 3 41-bit instructions and a 5-bit template that aids in steering and decoding the instructions. This is an idea that is inherited from the Very Large Instruction Word (VLIW) machines that have been on the market for some time about ten years ago. The two load/store units fetch two instruction words per cycle so six instructions per cycle are dispatched. The Itanium has also in common with these systems that the scheduling of instructions, unlike in RISC processors, is not done dynamically at run time but rather by the compiler. The VLIW-like operation is enhanced with predicated execution which makes it possible to execute instructions in parallel that normally would have to wait for the result of a branch test. Intel calls this refreshed VLIW mode of operation EPIC, Explicit Parallel Instruction Computing. Furthermore, load instructions can be moved and the loaded variable used before a branch or a store by replacing this piece of code by a test on the place it originally came from to see whether the operations have been valid. To keep track of the advanced loads an Advanced Load Address Table (ALAT, and there are two of them) records them. When a check is made about the validness of an operation depending on the advanced load, the ALATs are searched and when no entry is present the operation chain leading to the check is invalidated and the appropriate fix-up code is executed. Note that this is code that is generated at compile time so no control speculation hardware is needed for this kind of speculative execution. This would become exceedingly complex for the many functional units that may be simultaneously in operation at any time. As can be seen from Figure 10 there are four floating-point units capable of performing Fused Multiply Accumulate (FMAC) operations. However, two of these work at the full 82-bit precision which is the internal standard on Itanium processors, while the other two can only be used for 32-bit precision operations. When working in the customary 64-bit precision the Itanium has a theoretical peak performance of 6 Gflop/s at a clock frequency of 1.5 GHz. Using 32-bit floating arithmetic, the peak is doubled. In the first generation Itanium there were 4 integer units for integer arithmetic and other integer or character manipulations. Because the integer performance of this processor was modest, 2 integer units have been added to improve this. In addition four MMX units are present to accommodate instructions for multi-media operations, an inheritance from the Intel Pentium processor family. For compatibility with this Pentium family there is a special IA-32 decode and control unit. The register files for integers and floating-point numbers is large: 128 each. However, only the first 32 entries of these registers are fixed while entries 33–128 are implemented as a register stack. The primary data and instruction caches are 4-way set associative and rather small: 16 KB each. This is the same as in the former Itanium processors. The L1 cache is full speed: data and instructions can be delivered every clock cycle to the registers. An enhancement with respect to the Madison processor is that the L2 cache has been split: instead of a unified secondary cache with a size of 256 KB now the data cache alone has that size while an L2 instruction cache of 1 MB is added. The code for VLIW/EPIC processors tends to be larger than equivalent RISC code by a factor of 2–3, so this enhancement was welcome, also because the processor is now dual-threaded and therefore the instruction cache may contain instructions from both threads. Both L2 caches are 8-way set-associative. Floating-point data are loaded directly from the L2 cache to registers. Moreover, the L3 cache resides on the chip and is no less than 12 MB/core. The bus is 128 bits wide and operates at a clock frequency of 400 MHz or 667 MHz totaling to 6.4 or 10.6 GB/s, respectively. For the data hungry Montecito the latter bandwidth is no luxury. As already remarked before, the Montecito is dual-threaded. So, when the control logic (located at the upper left in Figure 10) decides that no progress will be made with one thread it dispatches the other thread to minimise the idle stages in the instructions that are executed. This will most often happen with very irregular data access patterns where it is impossible to load all relevant data in the caches beforehand. The switch between threads is based on an "urgency level" ranging from 0–7. When the urgency of the active thread falls below that of the inactive one the latter becomes active and vice versa. Because now two cores are present on a chip some provisions had to be added to let them cooperate without problems. The synchronisers in the core feed their information about read and write requests and cache line validity to the arbiter (see Figure 11). The arbiter filters out the unnecessary requests and combines the snooping information from both cores before handing the requests over to the system interface. In addition, the arbiter assures a fair access of both cores to the system interface.
Figure 11: Block diagram of 2 processor cores on a Montecito chip.
The introduction of the first Itanium has been deferred time and again which quenched the interest for use in high-performance systems. With the availability of the Itanium 2 in the second half of 2002 the adoption has sped up. Apart from HP also Bull, Fujitsu, Hitachi, NEC, SGI, and Unisys are offering now multiprocessor systems with this processor replacing the Alpha, PA-RISC, SPARC, and MIPS processors as were employed by HP, Fujitsu, and SGI. |