Saturday, December 4, 2010

Predicting computer architecture in five years

Has Moore's law run out? How many cores will be common in CPUs in a few years? Will GPUs and CPUs merge into one? Let's look at where technology is heading now, and make rash predictions!

I realize that the technologies we actually get in five years are not going to match up perfectly with what I predict here, and some of these sound pretty ambitious, but predicting that things won't change much has always been an easy way to be hilariously wrong about computers.

Solid state drives

Here's a really easy prediction: we'll see more solid state drives, because they make computers so much faster, and their price per gigabyte is only going to go down. I won't even talk about this one more; anybody can see that it's going to happen.

Moore's law

This predicts that the density of transistors per square millimeter will roughly double every two years. It doesn't say anything about computer speed. Intel and TSMC and GlobalFoundries say that they can keep it up for a while longer, and they're dropping the gigabucks to make it happen, so let's take them at their word and assume a 4x increase in the density of features on a chip.

What does this mean for us? In the 90s, Moore's law delivered big gains in speed, as miniaturization gave us faster clock speeds and we got more sophisticated architectures that took better advantage of instruction-level parallelism (ILP), fancy cache hierarchies, and so on. Those increases have tapered off over the past decade or so, with clock speeds roughly holding steady and ILP running into its limits. Sucks to be us, huh?

In order to get more speed (or at least convince customers to buy new processors), the CPU makers have been doing the easy thing and putting multiple cores onto a single chip. This works great if you've got a bunch of processes running in parallel, as in most server workloads, but it's less impressive for desktop computers. On top of that, making software that takes advantage of multiple cores to speed itself up can be hard.

More and more cores

For some workloads -- servers and supercomputers, mostly -- it makes sense to pack a lot of cores onto a single chip. Take IBM's new POWER7 processor: 8 cores, 4 threads per core, and a 32 MB L3 cache, running at 3-4.25 GHz. It's a beast, and has a scary price tag, but for the workloads for which it's designed, it delivers amazing performance for relatively modest costs.

The manycore approach will certainly be helped along by Moore's law. If you assume a 4x improvement in transistor density, then five years from now we could be packing four or eight of those on a chip the same size. That's 32-64 cores, and 128-256 MB of L3 cache. Cool, right?

Well, there are some wrinkles. Let's talk about memory bandwidth: the amount of data you can read to or write from memory, per second. When you have a few cores on one chip, memory bandwidth isn't such a huge problem. But when you have maybe 32 cores, it gets very hard for your RAM to move data back and forth fast enough. This is already a problem on servers, and it's only going to get worse if you stick more cores on a chip.

One partial solution is to keep as much of the communication on the chip as possible. Really big L3 caches help with this. By replacing traditional shared buses and point-to-point interconnections with packet-switched networks on a chip, we can get a lot of bandwidth for on-chip communication, while saving quite a bit of power.

That still doesn't solve the problem of communicating with off-chip RAM. Current server chips do things like having four DDR3 channels. You might wonder why they don't just add more. The problem there is that adding more memory buses takes up more pins on the chip, and there are very difficult limits on how many pins you can realistically put on a chip, and route on a circuit board. Which brings me to:

Optical interconnect

Wouldn't it be great if we could use optical signals instead of electric ones? Instead of spending a bunch of pins on connecting to RAM, use a single fiber-optic channel! Unholy gobs of bandwidth, lower latency, easier routing -- what's not to like?

The problem here is that optical interconnect technology isn't quite there yet. It's being used for connections between processors in some recent supercomputers, and Intel is working on an optical replacement for everything from USB to SATA to HDMI called Light Peak, but we're not yet to the point where we can economically put optical transceivers onto a CPU.

That may be changing, though. There's been a lot of progress in on-chip photonics recently (PDF), and according to IBM, it's just a matter of time before they can even use optical communications within the same chip. In five years, they're predicting that we'll at least be using it for communicating with other chips. If their predictions are realistic, then we're going to see server chips with huge numbers of cores, and enough memory bandwidth to keep them pumped with data.

3D integration

We can already make three-dimensional integrated circuits, but the manufacturing and design techniques still need a lot of work. In a 3D chip, you make a few dies conventionally, then stack them on top of one another. You either connect tiny wires between them or use through-silicon vias to connect them directly. I'll assume the latter approach.

So what can we do with this? Theoretically we could make a CPU spread out over several layers, but that would take improvements to design tools that I don't see happening in five years. A more realistic option would be to have the processor cores on one layer, and the cache on another, stacked on top. This has some advantages over having both the cores and the cache on one chip:
  • Wires are shorter, so cache latency will be smaller. Memory is already the big bottleneck in processor speed, so this is a big deal.
  • Each die is smaller. This improves yield, the percentage of dies which turn out to actually work. Chip manufacturers end up throwing out a lot of dies because of defects, and this is a major part of price. Making chips in several smaller pieces could make them quite a bit cheaper.
  • Massive bandwidth.
In fact, chips for laptops and tablets and such could even have all their memory stacked on top of the processor. That would be fast.

Summary 1: What will server chips look like?

Loads of cores. I would predict between 16 and 64 cores per chip. If nothing else, cloud computing providers like Amazon will love it. More memory per server, and optical connections all over the place for massive bandwidth. 40 or 100 Gbit Ethernet seems a safe bet, at least in data centers. Networks on chip, probably electrical, but converting to electric-optical hybrids at some point in the next decade. At least some of the cache will migrate to another layer, making it faster and (eventually) cheaper. Also, virtualization should be faster.

Summary 2: How about desktops, laptops, tablets, smartphones, etc.?

Several cores, though nowhere near as many as on servers. We may see heterogeneous multicore systems, where we might have one slow but power-efficient core, and a few faster but less power-efficient cores, and let the operating system divide processes between them based on priority.

Graphics will be integrated with the CPU. We're already seeing the start of this with integrated systems-on-a-chip in smartphones and tablets, which combine one or more processor cores with some memory, video decoding hardware, and a graphics accelerator on a single chip. Making a CPU-GPU combination is a natural outgrowth of such integration, and seems only to be expected now that AMD and ATI are the same company. Such a combination could have several CPU cores, many smaller GPU cores, and a shared memory pool. This would make General-Purpose GPU computing more practical, by all bit eliminating one of the biggest sources of overhead: transferring memory between the CPU and the GPU.

In more tightly integrated systems, I can easily imagine stacking all the memory on top of the processor. It takes a very long time to read or write memory in current computers, and if you can stack your memory directly on top of your processor, it will certainly make the computer faster by cutting down on memory latency.

What does this mean for programmers?

Oddly enough, not as much as you might think.

If you write server software, taking advantage of multiple cores is clearly a good thing. Luckily, this is actually pretty easy to do with most server software. Take web servers, for instance: fire up several instances, and the OS automatically uses multiple processor cores. In general, if your software is designed to work on clusters on more than one server, then it can use multiple cores on a single server. If you work in high-performance computing, things are harder, but if you work in HPC then you already know that.

Programmers can expect computers five years from now to be significantly but not dramatically faster on single-threaded workloads, thanks to the technologies which can cut down on memory latency. You can also expect higher-speed processors to get cheaper and less power-consuming, so if you're planning on programming for power-constrained devices, I think you can count on the processors being respectably zippy.

But that's all a little underwhelming, isn't it? None of the changes I predict here require dramatic overhauls to how we think about programming. There's nothing really radical here. I'm not even mentioning graphene electronics, or spintronics, or any of the other really cool things that may swoop in from out of nowhere and make my predictions look laughably conservative.

What do you think?

Saturday, September 18, 2010

Sets in JavaScript

JavaScript does not have a set data type, but it really should. Sadly, I'm not the boss of the world, so I can't require that they add one -- but what I can do is write my own. This is a bit tricky, actually. The standard way of faking sets in JavaScript is to use object properties; something like this:

This is fast, and only a bit ugly. There are a few problems with it, though. First, you don't have the usual set-theoretic operations like intersection, difference, union, and so on. There's no easy way to find the cardinality of the set, unless you maintain a separate counter. You can fix all these by writing a custom set class, but that would take effort.

The second problem with this approach is more serious: object properties in JavaScript must be strings. Anything that's not a string is cast to a string. A proper set data type should be able to have the number 3 and a string '3' in the same set, as distinct members, but both of those get cast to the same string.

What we need, to make a general-purpose set data type, is to use unordered arrays for storage. At least, that's the best option I could come up with that didn't involve making an extension in C. I've written a library to do this, and tested it with Node.js, although it should work on any web browser as well. Here's the repository on GitHub. The code can be very simple:

It supports pretty much the same API as Python's sets module. If you use Node.js, and you have the Node Package Manager, you can install it with "npm install simplesets". If you have any feedback, or want to contribute changes, please email me or fork it on GitHub.

Friday, June 18, 2010

Search pages should not use POST

This'll be a short post, because it's a simple thing. If you're writing a web site search form, it should not use HTTP POST to submit the query. This breaks the back button ("do you want to resubmit?") and is a poor fit for the semantics of the HTTP requests. What you want to use is a regular GET request.

The main problem with this is that sometimes you don't want people to resubmit a search form because searches are computationally expensive. The solution is not to make searching more painful; the solution is to cache the results on the server for a few minutes with memcached. It's easy to install and use, very fast, and popular enough to be widely available. Use memcached and HTTP GET requests, and your web site will be closer to Just Working.

Wednesday, June 9, 2010

91 Free Icons from Dmitry Baranovskiy

Dmitry Baranovskiy, the author of the excellent Raphaƫl JavaScript vector graphics library, made a bunch of nice icons. Then he gave them away to everybody, but not in the formats most people would like. I wrote some Python code to convert them to SVG and PNG files in various sizes.

If you want to convert them to some other image size, you'll need librsvg and optipng. I tried using ImageMagick, but it wouldn't convert some of the SVG files correctly. As for optipng, I assume that you'll want the PNG files to be small and load quickly. If you don't have optipng or don't want to use it, just comment out the line in that runs it.

Thursday, March 11, 2010

Threefish encryption

For one of my classes, I'm trying to figure out how to make hardware acceleration for the Threefish cipher. As step one of this, I had to actually figure out how to calculate it. That was surprisingly hard. Threefish may be state-of-the-art, really hot stuff, but there are no clearly written programs that show how you calculate it. There are some programs written for greasy-fast speed, and some programs written for smallness, but none that are actually clear about what it is they're doing.

So I had to write my own, obviously. I went to the submission to NIST, which explains Threefish clearly, but puts everything in a very mathematical notation that is quite unlike most programming languages. So I took the description in the NIST document and translated that as closely as possible into Haskell code, which resulted in a small, elegant, correct program that would have taken days to actually finish a single calculation.

The problem is that the mathematical description has about the same performance characteristics as the naive Fibonacci function, i.e. it explodes exponentially. This was fixed through a really dirty trick: I added some list indirection to get Haskell to memoize values. I would tell more about it, but I'm still not entirely sure what I did myself.

Anyway, triumph! Hurrah! The code is here. Be warned, it's pretty rough.