16 cores may well be the limit?

NormCameron · Dec 8, 2008

"Analysis: more than 16 cores may well be pointless

One of the ongoing themes of my microprocessor coverage over the past few years has been the relationship between on-chip execution bandwidth and the "memory wall." So I was intrigued to learn of new research from Sandia National Labs that indicates that the severity of the memory wall problem may be much greater than the industry generally anticipates.
In a nutshell, the "memory wall" problem is pretty straightforward, and it's by no means new to the multicore era. The problem arises when the execution bandwidth (i.e., aggregate instructions per second, either per-thread or across multiple threads and programs) available in a single socket is constrained by the amount of memory bandwidth available to that socket. As execution bandwidth increases, either because clockspeeds get faster or because the die contains more cores, memory bandwidth has to increase in order to keep up.
To put this in simple multicore terms, cramming a ton of processor cores onto a single die does you no good if you can't keep those cores fed with code and data.
But memory bandwidth isn't keeping up. Memory bus bandwidth (latency and/or throughput) hasn't increased quickly enough in proportion to Moore's Law, a fact that leaves processors starving for bytes. In this respect, the "memory wall" is a classic producer/consumer problem, and it's the reason that on-die cache sizes have ballooned in recent years. As the memory wall gets higher and higher, it takes more and more cache to get you over it. At this point, it would be fair to say that most modern server processors are really high-speed memories with some processor core stuck on the die, instead of vice versa.
The memory wall is therefore an added barrier to the success of the many-core paradigm. I say "added," because the most famous barrier is the programming model. Massively multithreaded programming isn't just a "hard problem"—rather, it's a generation's worth of Ph.D. dissertations that have yet to be written.
The work from the Sandia team, at least as it's summarized in an IEEE Spectrum article that infuriatingly omits a link to the original research, seems to indicate that 8 cores is the point where the memory wall causes a fall-off in performance on certain types of science and engineering workloads (informatics, to be specific). At the 16-core mark, the performance is the same as it is for dual-core, and it drops off rapidly after that as you approach 64 cores.
The chart included in the report is striking, and I wish I had the appropriate background to interpret it. (Again, the lack of any link, DOI, report title, deck title, or other reference information is unbelievable.) Nonetheless, despite the lack of color from the source, I'm sure the many-core skeptics in the audience—and there are quite a few—will seize on it as further validation that the maximum worthwhile core count is well below 16.
It looks like Sandia is proposing that stacking memory chips on top of the processor is the solution to this bandwidth problem. If that is indeed their proposal, then they're in good company. Both Intel and IBM have touted advances in chip-stacking techniques, and Sun has published research in the area of high-bandwidth memory interconnects that involve placing dice edge-to-edge. But, to my knowledge, these die-stacking schemes are further from down the road than the production of a mass-market processor with greater than 16 cores."

Analysis: more than 16 cores may well be pointless

pacinitaly · Aug 5, 2009

good read

MDA400 · Aug 5, 2009

Intel said back in 2007, that Nehalem would be an octocore design and hyperthreaded so it would have a total of 16 threads (16 logical cores to the OS).

Now in 2009, Nehalem has 4 cores and 8 threads Hyperthreaded (8 logical processors to the OS), with the exception of the 32nm Gulftown having 6 cores hyperthreaded with 12 threads.

It is said that by the time we invent a microarchitecture of 20-10nm fabs, Either what you stated, NormCameron, will be correct and Moore will be as well.

Or the group of scientists that claimed Moore's law false, will change the perspective on how much is too much in a processor.

roy69 · Aug 6, 2009

I was to belive that some server boards had memory for each chip:huh:.
Not dealing with server boards I dont know how true this is.
I can see this going the blade way of having cores with there own dedicated memory with controller chips for each core.
16 cores with 2GB each allows for 32GB of ram.
Or would that be to simple?

spkay · Aug 6, 2009

My interpretation of the current responses to the memory wall problem are the introduction of the LGA1366 socket and the triple-channel DDR3 memory architecture supported by X58 and i7. My guess is by the time we get to 16 core CPU's we would naturally see the introduction of the LGA1760 and the quad-channel DDR5 on the X78 chipset. I.e. more pins, more parallel multi-channel memory streaming and of course bigger caches, possibly even and 4th level on-die cache of course that would be the i9 or whatever ;-)

Blackhawk7188 · Aug 6, 2009

From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.

MDA400 · Aug 7, 2009

Blackhawk7188 said:
From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.

Right now, its not the issue with memory bandwidth, but with the QPI/FSB bandwidth.

the QPI was a great leap forward for Intel because it morally balanced the bandwidth difference from the CPU to the memory, DDR3 generally speaking.

spkay · Aug 7, 2009

Blackhawk7188 said:
From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.

I run dual quad Xeon's all day at work as well, while it may not vastly improve performance over a fast dual core for compiles, searching code tree, etc. it certainly does make a very large difference in allowing me to run multiple VM sessions alongside my boot O/S and not bring the system to it's knees. For those who think an 8 way system is going to make everything 6-8x faster you are exactly right. For those who are trying to multitask better using virtualization and/or running large databases at the same time it does provide a very significant performance improvement over dual or even quad core systems.

But that leads directly into another topic that most software developers have recognized for nearly two decades. Developing applicaton software, service and device drivers that leverage the maximum amount of parallelism is still a very slow, error prone and painstaking development task. Even given some of the newer tools (Intel's Parallel Tools, etc.) trying to develop, test and debug software is still not all that far from where it was 10-15 years ago. I'm hoping that the next 5-10 years of progress in software development tools produces tools that allow the average skilled developer to produce software the extracts a much greater level of performance from multicore architectures while keeping the programming model simple enough that programmers can become highly productive at creating new, faster applications without getting bogged down in the details of parallel, multi-threaded programming. Maybe just a pipe dream ;-)

aerofanadam · Aug 27, 2009

perhaps I'm not tech savvy enough to know the ins and outs but I've always wondered if ram could be made fast enough so that it could be inserted between processors. Let's Say you have one 3.0 clocked processor dividing original logic task between two or four 3.0 processors that do the actual math in sections then send it to a 5.0 processor for execution. Would Ram even be needed between or would current Ram configurations work? Would that speed logic processing or slow it? Would your BIOS be a mini OS at that stage? AND what is the limit to processor downsizing? will we eventually calculate on the atomic level (.1-.5nm)? I know they're trying it and hope to do so eventually. Hell compared to where we were ten years ago, 45nm is DAMN close to atomic level processing.

Ojaser6 · Aug 28, 2009

aerofanadam said:
perhaps I'm not tech savvy enough to know the ins and outs but I've always wondered if ram could be made fast enough so that it could be inserted between processors. Let's Say you have one 3.0 clocked processor dividing original logic task between two or four 3.0 processors that do the actual math in sections then send it to a 5.0 processor for execution. Would Ram even be needed between or would current Ram configurations work? Would that speed logic processing or slow it? Would your BIOS be a mini OS at that stage? AND what is the limit to processor downsizing? will we eventually calculate on the atomic level (.1-.5nm)? I know they're trying it and hope to do so eventually. Hell compared to where we were ten years ago, 45nm is DAMN close to atomic level processing.

ALERT ALERT!

You have bumped a thread that is 2 weeks old! You should be ashamed of yourself!

(No, I'm not kidding... You shouldn't bump threads that are 2 weeks old.)

aerofanadam · Aug 28, 2009

my bad, I just thought I had some legit questions and hypotheses and this thread had the correct context. Besides, it was on top page of overclock and cooling still so I figured it was fresh enough. Should I start new thread to cover same topic? Seems frivolous. Especially on such a slow moving forum as OC&C

MDA400 · Aug 30, 2009

ill bump...

16 cores may well be the limit?

My Computer

HappyGuru Lovin2Help

My Computer

Nvidia and Intel

My Computer

47,65,65,6B

My Computer

New Member

My Computer

My Computer

Nvidia and Intel

My Computer

New Member

My Computer

My Computer

My Computer

My Computer

Nvidia and Intel

My Computer