Your Ad Here

06 September 2009

IBM's 8-core POWER7 Crams an Amazing Amount of Hardware

IBM's Hot Chips presentation on the 45nm server processor POWER7 had a wealth of information on the chip, which with 1.2 billion transistors and 567mm2, is actually quite slim given that it offers. The secret is to first use a special caching technology that is lauded by IBM since 2007, but more about that in a moment.

POWER7 come in 4 -, 6 - and 8-core varieties, the standard is probably the 8-core and lower basic versions are offered to improve yields. Each core is equipped with 4-way simultaneous multithreading, which means that the 8-core a total of 32 simultaneous threads per socket support. POWER7 Multi Socket is designed for systems that scale to 32 sockets, which means that a full 32-socket 8-core components of the system in 1024 would support son.

Feed eight cores on a socket is a challenge, which is why each POWER7 has a pair of four-channel DDR3 controller that supports up to 100 GB / s memory bandwidth supported. Also help the situation is less than 32 MB on-die L3 cache, IBM has managed to ride that much cache uses a special embedded DRAM (eDRAM) transistor design that reduces the cost of swimming pool large cache of about half.

To see how the economies of this system EDRAM transistor cache, compare the 8-core, 32MB cache POWER7 of 1.2 billion transistors in number 2 billion transistor 4-core, 30MB cache, "Tukwila 'Itanium d 'Intel. Naturally EDRAM POWER7 is almost certainly a bit slower than SRAM Tukwila, but today the power of the transistor age sensitive than the level of savings is impressive. Also consider how cells POWER7 the eight-core Nehalem EX, which has 24MB of cache and weighs more than 2.2 billion transistors, more, IBM has more with less.

Note that the four-SMT design is another thing that helps with the problem of feeding all the material acting as a mechanism for latency hiding while basic backend. If one thread stalls waiting for memory, the kernel (ideally) find instructions for running another wire to feed the execution units to keep them occupied. This issue of bandwidth is probably one of the reasons behind the decision of IBM to deal with such a high level of SMT.

Speaking of a back-end POWER7 core, each core includes a solid set of execution resources. There are 12 units in total performance, as follows:
  • 2 integer units
  • 2 Load in-store units
  • 4 Double-precision floating-units
  • 1 branch unit
  • 1 record unit condition
  • 1 unit vector
  • 1 decimal floating-point unit
Those of you who have read my past articles or my book microprocessor will know that most units are above, with the possible exception of CSF-specific unit's status register (which was present in the 970) and float - unit, which mathematical functions commonly found in mainframe workload increases.

My only comment about the above is four floating-DP unit has plenty of power point in floating point. This makes streaming memory bandwidth is crucial for the performance POWER7 FP, so it is good that it was enough.

I heard that the remains POWER7 "pass" group scheme, which is part of the power line from the days POWER4. It represents the amount of accounting logic necessary during the flight by sending instructions and follow the instructions in bundles . The POWER4 and 970, instructions are sent to the queue instruction in the back in packages of 5 pieces, but groups are available now expanded to 6 locations.

Overall, IBM has a very impressive 32-thread chip with a sample of a ton and a large portion of cache memory bandwidth, and done with half the transistors of the game. This is quite an achievement, and reaffirms the IBM mainframe continues strong market very lucrative.

0 comments: