An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted.
The architecture for the Intel 48-core Single Chip Cloud Computer (SCC) processor is “arbitrarily scalable,” said Intel researcher Timothy Mattson, during a talk at the Supercomputer 2010 conference being held this week in New Orleans.
“This is an architecture that could, in principle, scale to 1,000 cores,” he said. ” I can just keep adding, adding, adding cores.”
Only after 1,000 cores or so, the diameter of the mesh, or the on-chip network connecting the many cores, will grow to such an extent that it would negatively impact performance, Mattson said.
Intel remains adamant that the future progress of microprocessors will depend on packing ever more cores onto a chip. As more cores are added, however, Intel designers must confront the problem of scalability.
Initial multicore chip architectures
depended on a set of protocols that assures that each core has the same view of the system’s memory, a technique called cache coherency.
As more cores are added to chips, this approach becomes problematic insofar that “the protocol overhead per core grows with the number of cores, leading to a ‘coherency wall’ beyond which the overhead exceeds the value of adding cores,” the paper accompanying Mattson’s talk noted.
Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.
The recent work of the design team has centered on developing message-passing techniques for the chip that would scale as more cores are added.
Designed by Intel’s TeraScale Research Program over the past several years, the chip itself is an experimental one and is not on the Intel product road map, Mattson said. A limited number of copies have been distributed to researchers and developers so they can build development tools for the design.
The chip, first fabricated with a 45-nanometer process at Intel facilities about a year ago, is actually a six-by-four array of tiles, each tile containing two cores. It has more than 1.3 billion transistors and consumes from 25 to 125 watts.
For simplicity’s sake, the team used an off-the-shelf 1994-era Pentium processor design for the cores themselves. “Performance on this chip is not interesting,” Mattson said. It uses a standard x86 instruction set.
The novelty of this processor is in its tiled architecture and the network and address infrastructure. Each core has a “mesh interface component” that packages data into packets and connects to an on-board router. Each tile also has a “message-passing buffer,” with 16 kilobytes of random access memory.
The team has tried various approaches to streamline the ability of the processor to pass messages among the many cores.
By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it “would be boring.”
“To make this interesting, I would have to ask, how would the programming models map onto the unique features of this chip,” he said.
The team also developed a small API (application programming interface) library for message passing among the cores, called RCCE, and which Mattson pronounced as “Rocky.”
In tests, the team showed that message passing among the cores could be just as speedy using RCCE as with TCP/IP-based Linux cluster.