Sunday, July 24, 2011

Port of Murmur3 hash to C

Murmur3 is a non-cryptographic hash function, stupendously fast, with excellent mathematical properties. It's just the thing to use if you want to make a hash table, or a Bloom filter, or do any number of other things that require fast, high-quality hashing. Unfortunately, the original code was in C++ and designed mostly for Microsoft Visual Studio, making it not the tool of choice for those of us who would prefer C code, compiled with gcc or something similar. So I ported the code to C and cleared out the MSVC-specific parts.


It's public domain, so do whatever you want with it. There's just a single C file and a short header file, so it should be really easy to incorporate into your own projects. See the readme file on GitHub for more details, or the example program to see how it's used.

Happy hashing!

Sunday, February 6, 2011

Storing passwords securely in Haskell

If you need to store and verify passwords, the usual advice is to use bcrypt. It neatly handles all the issues, with a simple API. But Haskell doesn't have bcrypt bindings, so people are tempted to roll their own password storage methods, and perhaps get it wrong. I decided to fix that. So now here's how you store passwords securely in Haskell:

Just use pwstore. It handles the details for you.

What this library does is this:
  1. Combine the user's password with a randomly-generated salt.
  2. Hash this slowly. By iterating SHA-256 a few thousand times, we make brute-force guessing a lot less practical.
  3. Store this has along with the salt.
This scheme is essentially an implementation of the PBKDF1 key derivation function (as specified in RFC 2898) with some convenience code around it to make it easy to use and really hard to mess up.

How to use it

Just follow the instructions in the documentation. You store a single string containing the hashed password, the salt, and some extra book-keeping information:
"sha256|12|Ge9pg8a/r4JW356Uux2JHg==|Fdv4jchzDlRAs6WFNUarxLngaittknbaHFFc0k8hAy0="
You can make one of these with the makePassword function:

>>> makePassword "hunter2" 12
"sha256|12|lMzlNz0XK9eiPIYPY96QCQ==|1ZJ/R3qLEF0oCBVNtvNKLwZLpXPM7bLEy/Nc6QBxWro="
To verify a password given by a user, use the verifyPassword function:

>>> verifyPassword "wrong guess" passwordHash
False
>>> verifyPassword "hunter2" passwordHash
True
There are also functions for adjusting the number of iterations of the hash function, so you can make the hashing slower as Moore's law progresses. This works in essentially the same way as bcrypt's work factor, except that the API here is a bit nicer than bcrypt's.

The theory

There are many wrong ways to store a password, most of them all too common. The most obvious way to go wrong is to store users' passwords in plain text. Then, when an attacker manages to get their hands on this file, they have the passwords for every user's account. One step up, but still wrong, is to simply hash all passwords with SHA1 or something. This is vulnerable to rainbow table and dictionary attacks. One step up from that is to hash the password along with a unique salt value. This is still vulnerable to dictionary attacks, since guessing a password is very fast.

The right thing to do is to use a slow hash function, to add some small but significant delay, that will be negligible for legitimate users but prohibitively expensive for someone trying to guess passwords by brute force. That is what this library does.

Where to get it

There are two versions of the package available from Hackage:
  1. pwstore-fast is the preferred version.
  2. pwstore-purehaskell has the same API, but only pure Haskell dependencies. It's usable, but about 25 times slower than pwstore-fast.
The source code is on GitHub. If anybody has suggestions for improvements, I'd love to hear them.

Discussions: Reddit [22 comments].

Thursday, February 3, 2011

My strings can beat up your strings

Strings! they're a fundamental data type, and every language more modern than C has some kind of built-in support for them. Naturally, we want them to work efficiently and correctly. But that turns out to be surprisingly complicated.

C represented strings as null-terminated arrays of ASCII characters. This meant that you couldn't include '\0' in your strings, and if you forgot the null terminator, you were screwed. I hope you like buffer overflows. And the fact that strings can be changed in-place by any piece of code with a pointer to it should make anybody a little nervous. Strings, in order to behave the way we expect, should be immutable by default.

More recent languages generally agree on these things. Java, Python, and C# (among many others) all represent strings as immutable objects, with the length recorded explicitly -- no null-termination shenanigans. You can share memory among substrings, and pass string values around without having to worry about them being modified from some other thread somewhere. This sort of thing pleases a compiler greatly, and makes programs less bug-prone. This also makes it possible to store only a single copy of short strings, to improve memory efficiency, cache locality, and so on. Plus, you get ridiculously fast string comparisons, since two equal strings will always be represented by the same pointer.

Unicode headaches

There's one more big wrinkle here, though, and that's Unicode. If we want to represent more characters than the ones that ASCII gives us, we need to be able to use more than one byte per character. And that's a pain in the ass.

You can represent any Unicode character in 32 bits. This gives you a very simple representation, just an array of 32-bit values. It also uses several times more memory than alternatives like UTF-16 and UTF-8, which encode characters with a variable number of bytes. For people using mostly ASCII characters, UTF-8 is going to be the most compact, by far. It also makes random string indexing take O(n) time, because you don't know how wide the characters are until you look at them, so you can't just use O(1) array indexing like you can with fixed-width encodings. Another option, that several systems use, is to use a fixed-width encoding of 16 bits per character, and just ignore the less commonly-used characters. That's vaguely adequate-ish, but it's overkill for ASCII characters, and doesn't support all of Unicode.

So what do we do? For small strings, in European languages, UTF-8 is probably just about right. Having O(n) random indexing doesn't really matter when n is small; cache locality matters more. Most strings are small, so it's important that they be efficient. First rule of optimization: make the common case fast.

How about longer strings? What if we have a 10 MB string, and we want to do random indexing, substring slicing, maybe insert a few characters in the middle... how do we do it efficiently? UTF-32 is preposterously memory-heavy. UCS-2 doesn't support all of Unicode. UTF-8 is memory-efficient, but variable-width. UTF-16 is usually less memory-efficient than UTF-8, and still variable width. That's the worst of both worlds. What do we do?

Heavyweight strings

At this point, we really should stop thinking of strings as arrays of bytes. They're sequences of characters. It's a subtle distinction -- but there are other ways of representing sequences.

Haskell's default strings, for example, are laughably inefficient linked lists of of 32-bit characters. No, seriously. There's a reason people use libraries for more efficient strings. But while this isn't exactly a practical string encoding, it at least illustrates that strings don't have to be arrays.

What would a practical heavyweight string look like? One way to do it would be as a tree of character array chunks. For example, the string "hello, world" might be represented like this:



This is a balanced binary tree, but of course you can pick whatever representation you like. This kind of data structure is called a rope. If you go with ropes as your string representation of choice, you get cheap concatenation and substring slicing.

Ropes also make string modification cheaper. With immutable array-based strings, any time you want to change so much as a single character, you have to allocate a new copy. With any array-based strings, if you want to add or remove characters, you need to copy a bunch of other characters around. With ropes, you can share most of the blocks of characters between the old and new versions of the string, and the insertion and deletion operations are both cheap. Suppose you add an exclamation mark to the end of our string, turning it into "hello, world!" and keep both versions around. You just need to allocate a new block to hold "rld!":

This is the kind of string representation often used in text editors, where they need to support fast and easy string modification. This allows constant- or logarithmic-time concatenation, logarithmic-time modification, and good memory efficiency. The drawback is that it makes our strings more complicated, and makes random indexing an O(lg n) time operation -- worse than the O(1) that people get with fixed-width array-based strings.

But wait! That O(lg n) is still better than the O(n) we would get with variable-width array-based strings, if we went with (say) UTF-8. How can this be?

The answer is that, at each node in the tree, we store the length of the string formed by its children. The tree actually looks something like this:

This way, we can combine efficient string indexing with efficient Unicode encodings. If you encode each of the chunks in UTF-8, then that doesn't change a thing about how you traverse the tree to find the chunk with the nth character. And once you've narrowed it down to a single chunk, the time left for string indexing is proportional to the chunk size -- usually not a very large overhead.

Thus, oddly enough, heavyweight string data types can actually use less memory than simpler ones, by making compact variable-length encodings fast. They give better asymptotic time bounds on most operations. They allow cheap modification of strings, without throwing away immutability. I think that's pretty cool.

(I'm working on an implementation of this in Haskell using the Data.ByteString.UTF8 and Data.FingerTree libraries. It's proving surprisingly straightforward and easy. You just need to combine UTF-8 array strings with a balanced tree that supports caching the value of a monoid at the leaves. That's exactly what a finger tree is all about, so it all fits together nicely. I'll put it up on GitHub sometime in the not-too-distant future.)

Wednesday, February 2, 2011

Newton's method, in very few lines of Haskell

I hesitate to even write this blog post, because when I saw the code, it seemed trivial. But then, maybe that's the best kind of code. I'm going to assume that you know about Newton's method for approximately solving equations of the form f(x) = 0. The short explanation is that you start out with a guess point, x0, and then iterate from there with this equation:

x_{n+1} = x_n + \frac{f(x_n)}{f^\prime(x_n)}

This requires you to know the derivative of the function. Let's write this in Haskell:


next f f' xn = xn - ((f xn) / (f' xn))


Apply this to some starting number, and you get the next one. For example, suppose that you're trying to solve sin(x) = 0, and your starting guess is 3:

*Main> next sin cos 3
3.142546543074278

Now try applying it again a couple more times:

*Main> let iter = next sin cos
*Main> iter 3
3.142546543074278
*Main> iter (iter 3)
3.141592653300477
*Main> iter (iter (iter 3))
3.141592653589793
*Main> pi
3.141592653589793

After just three iterations, we get pi, correct to at least fifteen decimal places. Since this is Haskell, let's make a list of all the iterations:

newton f f' x0 = iterate next x0
where next xn = xn - ((f xn) / (f' xn))

The iterate function takes a function f and an initial value x, and returns the list [x, f(x), f(f(x)), ...]. This is exactly what we need to express the iteration in Newton's method. Let's try it out:

*Main> take 4 $ newton sin cos 3
[3.0, 3.142546543074278, 3.141592653300477, 3.141592653589793]

Score! Let's try it out on a more complicated function, f(x) = 12x4 - 5x3. Here's what it looks like:


graph


Let's see if we can find that root a little past x = 0.4, with some wild guess.

*Main> mapM_ print $ take 50 $ newton (\x -> 12*x^4 - 5*x^3) (\x -> 48*x^3 - 15*x^2) 400
400.0
300.02606202762576
225.04561534004728
168.81028938363653
126.63380700188723
95.00146134126632
71.2772236173846
53.48407405669417
40.13925026128285
30.130683698176643
22.62432736171233
16.994651928171486
12.772518440796631
9.606083627985576
7.231480047805808
5.4508278904529295
4.115746372269005
3.1149912022861663
2.365188920791852
1.8038979233235874
1.384421748669219
1.071949969695422
0.8407198212811359
0.6719880367106018
0.5526705013094125
0.47442889059722404
0.4321200911656781
0.4181639590095718
0.4166825795028489
0.4166666684895603
0.4166666666666667
0.41666666666666663
0.41666666666666663
...

Not bad! Starting it with an even higher guess doesn't hurt the speed much, either. So there you have it: Newton's method in two lines of Haskell code. I feel like there ought to be more to it, but there just isn't.

(A word about speed, in case you were wondering: the list construction will get optimized away, and this will turn into essentially the same compiled code as if you had written it in C. The compiler is allowed to be smart, sometimes.)