Saturday, December 4, 2010

Predicting computer architecture in five years

Has Moore's law run out? How many cores will be common in CPUs in a few years? Will GPUs and CPUs merge into one? Let's look at where technology is heading now, and make rash predictions!

I realize that the technologies we actually get in five years are not going to match up perfectly with what I predict here, and some of these sound pretty ambitious, but predicting that things won't change much has always been an easy way to be hilariously wrong about computers.

Solid state drives

Here's a really easy prediction: we'll see more solid state drives, because they make computers so much faster, and their price per gigabyte is only going to go down. I won't even talk about this one more; anybody can see that it's going to happen.

Moore's law

This predicts that the density of transistors per square millimeter will roughly double every two years. It doesn't say anything about computer speed. Intel and TSMC and GlobalFoundries say that they can keep it up for a while longer, and they're dropping the gigabucks to make it happen, so let's take them at their word and assume a 4x increase in the density of features on a chip.

What does this mean for us? In the 90s, Moore's law delivered big gains in speed, as miniaturization gave us faster clock speeds and we got more sophisticated architectures that took better advantage of instruction-level parallelism (ILP), fancy cache hierarchies, and so on. Those increases have tapered off over the past decade or so, with clock speeds roughly holding steady and ILP running into its limits. Sucks to be us, huh?

In order to get more speed (or at least convince customers to buy new processors), the CPU makers have been doing the easy thing and putting multiple cores onto a single chip. This works great if you've got a bunch of processes running in parallel, as in most server workloads, but it's less impressive for desktop computers. On top of that, making software that takes advantage of multiple cores to speed itself up can be hard.

More and more cores

For some workloads -- servers and supercomputers, mostly -- it makes sense to pack a lot of cores onto a single chip. Take IBM's new POWER7 processor: 8 cores, 4 threads per core, and a 32 MB L3 cache, running at 3-4.25 GHz. It's a beast, and has a scary price tag, but for the workloads for which it's designed, it delivers amazing performance for relatively modest costs.

The manycore approach will certainly be helped along by Moore's law. If you assume a 4x improvement in transistor density, then five years from now we could be packing four or eight of those on a chip the same size. That's 32-64 cores, and 128-256 MB of L3 cache. Cool, right?

Well, there are some wrinkles. Let's talk about memory bandwidth: the amount of data you can read to or write from memory, per second. When you have a few cores on one chip, memory bandwidth isn't such a huge problem. But when you have maybe 32 cores, it gets very hard for your RAM to move data back and forth fast enough. This is already a problem on servers, and it's only going to get worse if you stick more cores on a chip.

One partial solution is to keep as much of the communication on the chip as possible. Really big L3 caches help with this. By replacing traditional shared buses and point-to-point interconnections with packet-switched networks on a chip, we can get a lot of bandwidth for on-chip communication, while saving quite a bit of power.

That still doesn't solve the problem of communicating with off-chip RAM. Current server chips do things like having four DDR3 channels. You might wonder why they don't just add more. The problem there is that adding more memory buses takes up more pins on the chip, and there are very difficult limits on how many pins you can realistically put on a chip, and route on a circuit board. Which brings me to:

Optical interconnect

Wouldn't it be great if we could use optical signals instead of electric ones? Instead of spending a bunch of pins on connecting to RAM, use a single fiber-optic channel! Unholy gobs of bandwidth, lower latency, easier routing -- what's not to like?

The problem here is that optical interconnect technology isn't quite there yet. It's being used for connections between processors in some recent supercomputers, and Intel is working on an optical replacement for everything from USB to SATA to HDMI called Light Peak, but we're not yet to the point where we can economically put optical transceivers onto a CPU.

That may be changing, though. There's been a lot of progress in on-chip photonics recently (PDF), and according to IBM, it's just a matter of time before they can even use optical communications within the same chip. In five years, they're predicting that we'll at least be using it for communicating with other chips. If their predictions are realistic, then we're going to see server chips with huge numbers of cores, and enough memory bandwidth to keep them pumped with data.

3D integration

We can already make three-dimensional integrated circuits, but the manufacturing and design techniques still need a lot of work. In a 3D chip, you make a few dies conventionally, then stack them on top of one another. You either connect tiny wires between them or use through-silicon vias to connect them directly. I'll assume the latter approach.

So what can we do with this? Theoretically we could make a CPU spread out over several layers, but that would take improvements to design tools that I don't see happening in five years. A more realistic option would be to have the processor cores on one layer, and the cache on another, stacked on top. This has some advantages over having both the cores and the cache on one chip:
  • Wires are shorter, so cache latency will be smaller. Memory is already the big bottleneck in processor speed, so this is a big deal.
  • Each die is smaller. This improves yield, the percentage of dies which turn out to actually work. Chip manufacturers end up throwing out a lot of dies because of defects, and this is a major part of price. Making chips in several smaller pieces could make them quite a bit cheaper.
  • Massive bandwidth.
In fact, chips for laptops and tablets and such could even have all their memory stacked on top of the processor. That would be fast.

Summary 1: What will server chips look like?

Loads of cores. I would predict between 16 and 64 cores per chip. If nothing else, cloud computing providers like Amazon will love it. More memory per server, and optical connections all over the place for massive bandwidth. 40 or 100 Gbit Ethernet seems a safe bet, at least in data centers. Networks on chip, probably electrical, but converting to electric-optical hybrids at some point in the next decade. At least some of the cache will migrate to another layer, making it faster and (eventually) cheaper. Also, virtualization should be faster.

Summary 2: How about desktops, laptops, tablets, smartphones, etc.?

Several cores, though nowhere near as many as on servers. We may see heterogeneous multicore systems, where we might have one slow but power-efficient core, and a few faster but less power-efficient cores, and let the operating system divide processes between them based on priority.

Graphics will be integrated with the CPU. We're already seeing the start of this with integrated systems-on-a-chip in smartphones and tablets, which combine one or more processor cores with some memory, video decoding hardware, and a graphics accelerator on a single chip. Making a CPU-GPU combination is a natural outgrowth of such integration, and seems only to be expected now that AMD and ATI are the same company. Such a combination could have several CPU cores, many smaller GPU cores, and a shared memory pool. This would make General-Purpose GPU computing more practical, by all bit eliminating one of the biggest sources of overhead: transferring memory between the CPU and the GPU.

In more tightly integrated systems, I can easily imagine stacking all the memory on top of the processor. It takes a very long time to read or write memory in current computers, and if you can stack your memory directly on top of your processor, it will certainly make the computer faster by cutting down on memory latency.

What does this mean for programmers?

Oddly enough, not as much as you might think.

If you write server software, taking advantage of multiple cores is clearly a good thing. Luckily, this is actually pretty easy to do with most server software. Take web servers, for instance: fire up several instances, and the OS automatically uses multiple processor cores. In general, if your software is designed to work on clusters on more than one server, then it can use multiple cores on a single server. If you work in high-performance computing, things are harder, but if you work in HPC then you already know that.

Programmers can expect computers five years from now to be significantly but not dramatically faster on single-threaded workloads, thanks to the technologies which can cut down on memory latency. You can also expect higher-speed processors to get cheaper and less power-consuming, so if you're planning on programming for power-constrained devices, I think you can count on the processors being respectably zippy.

But that's all a little underwhelming, isn't it? None of the changes I predict here require dramatic overhauls to how we think about programming. There's nothing really radical here. I'm not even mentioning graphene electronics, or spintronics, or any of the other really cool things that may swoop in from out of nowhere and make my predictions look laughably conservative.

What do you think?