[I wrote this last year for something else - maybe sort of interesting given there's been news recently about Intels' Larrabee]
Thread-Level Parallelism and the UltraSPARC T1/T2
For a long period of time, increases in performance of CPUs were achieved by maximising sequential performance. This was achieved by increasing the instructions/cycle, using ever deeper, multi-issue pipelines and ever more speculative execution – thus extracting ever more instruction-level parallelism (ILP) from blocks of code, at ever higher clock rates.  Processors such as the DEC Alpha 21264, MIPS R12000 and Intel Pentium 4 were the result of this. As an example, the Pentium 4 Extreme Edition featured a 4-issue, out-of-order, 31 stage pipeline, and clocked at up to 3.2GHz, consuming 130W in the process.
Unfortunately, though this approach served the processor industry well throughout the 1990s and most of the 2000s, its doom had already been foretold. David Wall showed empirically that there were practical limits to the amount of ILP available. These limits were surprisingly low for many typical codes – between 4 and 10. Others observed that the disparity in the growth of CPU compute power, relative to memory access times, was increasing at such a rate that the latter would govern system performance ever more – dubbed the “memory wall”. This meant all the gains made by complex, ILP-extracting logic would eventually be reduced to nothing, relative to the time spent waiting on memory.
Faced with these apparent hard limits on the gains that could be had from ILP, other levels of parallelism were investigated. Tullsen et al argued for thread-level parallelism (TLP): extracting parallelism by issuing instructions to multiple functional units from multiple threads of execution. This is sometimes called “Chip Multi-Threading” or CMT. Olukotun et al, of the Computer Systems Laboratory of MIPS’ birthplace, argued that multiple, simpler processor-cores on a chip would be a more efficient use of the transistor budget for a chip than a single, complex, super-scalar core. This is sometimes called “Chip Multi-Processing” or CMP.
Intel added CMT to their P4 CPUs, calling it “Hyper-Threading”. While it improved performance on many workloads, it was known to reduce performance on single-threaded numerical loads. Intels’ Core/Core2 Duo range are not CMT, but are CMP (being multi-core).
A 3rd, more practical factor would also come into play: Energy usage. If an ILP-maximising processor is less efficient than a TLP-maximising processor, then the latter can potentially save a significant amount of energy. If the memory-wall means complex ILP logic will be to no avail, then ditching that logic should save power without affecting performance (in the aggregate).
In 2006, Sun released a brand new implementation of UltraSPARC, the T1 – codenamed “Niagara”. Like several other server-orientated processors, the UltraSPARC T1 was both CMP and CMT, having up to 8 cores, each core handling 4 threads. The Niagara differed reasonably radically from those other offerings in that it was designed specifically with the “memory wall” in mind, as well as power-consumption. To this end, rather than using complex, super-scalar processor cores, the T1 cores instead were very simple. Each core is single-issue, in-order, non-speculative and just 6 stages deep. The pipeline is the classic 5-stage RISC pipeline, with a “Thread Select” stage added between IF and ID. Each core is therefore relatively slow compared to a traditional super-scalar design. The T1 clocks at 1 to 1.2GHz due to the short pipeline. Rather than trying to mask memory latency with speculation, the T1 switches processing to another thread. So, when given a highly-multithreaded workload, the T1 can make very efficient use of both its compute and memory resources by amortising the memory-wait time for any one thread across the execution time of a large number of other threads. Thread selection is the only speculative logic in the T1. It is aimed at highly threaded, memory/IO bound workloads (e.g. the server half of client/server).  
This approach has additional benefits. Less speculation means fewer wasted cycles. Rather than spending design time on complex, stateful logic to handle speculation and out-of-order issue, the designers can instead concentrate on optimising the chip’s power usage. The many-core approach also physically distributes heat better. The T1 uses just 72W, compared to the 130W of the P4EE, despite the T1s greater transistor count, and much higher peak performance/Watt on server work-loads. 
The T1 has just one FPU, shared by all cores. This a weakness for which the T1 has been much criticised. Indiscriminate use of floating-point by programmers, particularly less sophisticated ones, is not unheard of in target applications like web-applications (PHP, etc.). The T1 may not perform at all well in such situations. The T1 core also includes a ‘Modular Arithmetic Unit’ (MAU) to accelerate cryptographic operations.
In late 2007, Sun released the UltraSPARC T2. This CPU is very similar to T1. Changes include a doubling of number of threads per core to 8; an additional execution unit per-core and 2 issues per cycle; an 8-stage pipeline; an FPU per core (fixing the above major sore-point of the T1); and the ability to do SMP. This allows for at least 32 cores and 256 threads in a single server (4-way SMP). A single CPU T2 system soon held the SPECWeb world record.
Intel have gone down a somewhat similar path with their Atom processor – a very simple x86 CPU with CMT, but there are no CMP versions of it (yet). Intel are also working on a massively-CMP x86 CPU, called “Larrabee”, using simple, in-order cores. This will have 16 or more cores initially. It is to be aimed at the graphics market, rather than as a general-purpose CPU. Caveon offer the ‘OCTEON’, a MIPS64 based CMP, with up to 16 cores, targetted at networking/communications applications.
To date there are no other simple-core, massively-CMP,CMT processors available as general-purpose systems other than systems using the UltraSPARC T1 and T2. The simple-core, massively-CMP/CMT approach it uses is undoubtedly the future, given the long-standing research results and more recent practical experience. It may take longer for such chips to become prevalent as desktop CPUs though, given the need to continue to expend transistors to maintain high sequential performance for existing, commonly minimally-threaded desktop software.
1. “Limits of Instruction-Level Parallelism“, Wall, DEC WRL researc report, 1993
2. “Hitting the Memory Wall: Implications of the Obvious“, Wulf, Mckee, Computer Architecture News, 1995
3. “The Case for a Single-Chip Multiprocessor“, Olukotun, Nayfeh, Hammond, Wilson, Chang, IEEE Computer, 1996
4. “Performance/Watt: The New Server Focus” (2005), James Laudon 1st Workshop on Design, Architecture and Simulation of CMP (dasCMP), 2005.
5. “Niagara: a 32-way multithreaded Sparc processor” Kongetira, Aingaran, Olukotun, IEEE Micro, 2005,
6. “A Power-Efficient High-Throughput 32-Thread SPARC Processor“, Leon, Tam, Weisner, Schumacher, IEEE Solid-State Circuits, 2007,
8. “Simultaneous Multithreading: Maximizing On-Chip Parallelism“, Tullsen, Eggers, Levy. 22nd An. Int. Symposium on Comp. Arch., 1995.
9. “Larrabee: a many-core x86 architecture for visual computing“, Seiler, et al. ACM Transactions on Graphics, Aug 2008.
11. “UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC“, Shah, et al. A-SSCC, 2007.
Auxiliary Sources for Thread-Level Parallelism: