Thursday, February 3, 2011

My strings can beat up your strings

Strings! they're a fundamental data type, and every language more modern than C has some kind of built-in support for them. Naturally, we want them to work efficiently and correctly. But that turns out to be surprisingly complicated.

C represented strings as null-terminated arrays of ASCII characters. This meant that you couldn't include '\0' in your strings, and if you forgot the null terminator, you were screwed. I hope you like buffer overflows. And the fact that strings can be changed in-place by any piece of code with a pointer to it should make anybody a little nervous. Strings, in order to behave the way we expect, should be immutable by default.

More recent languages generally agree on these things. Java, Python, and C# (among many others) all represent strings as immutable objects, with the length recorded explicitly -- no null-termination shenanigans. You can share memory among substrings, and pass string values around without having to worry about them being modified from some other thread somewhere. This sort of thing pleases a compiler greatly, and makes programs less bug-prone. This also makes it possible to store only a single copy of short strings, to improve memory efficiency, cache locality, and so on. Plus, you get ridiculously fast string comparisons, since two equal strings will always be represented by the same pointer.

Unicode headaches

There's one more big wrinkle here, though, and that's Unicode. If we want to represent more characters than the ones that ASCII gives us, we need to be able to use more than one byte per character. And that's a pain in the ass.

You can represent any Unicode character in 32 bits. This gives you a very simple representation, just an array of 32-bit values. It also uses several times more memory than alternatives like UTF-16 and UTF-8, which encode characters with a variable number of bytes. For people using mostly ASCII characters, UTF-8 is going to be the most compact, by far. It also makes random string indexing take O(n) time, because you don't know how wide the characters are until you look at them, so you can't just use O(1) array indexing like you can with fixed-width encodings. Another option, that several systems use, is to use a fixed-width encoding of 16 bits per character, and just ignore the less commonly-used characters. That's vaguely adequate-ish, but it's overkill for ASCII characters, and doesn't support all of Unicode.

So what do we do? For small strings, in European languages, UTF-8 is probably just about right. Having O(n) random indexing doesn't really matter when n is small; cache locality matters more. Most strings are small, so it's important that they be efficient. First rule of optimization: make the common case fast.

How about longer strings? What if we have a 10 MB string, and we want to do random indexing, substring slicing, maybe insert a few characters in the middle... how do we do it efficiently? UTF-32 is preposterously memory-heavy. UCS-2 doesn't support all of Unicode. UTF-8 is memory-efficient, but variable-width. UTF-16 is usually less memory-efficient than UTF-8, and still variable width. That's the worst of both worlds. What do we do?

Heavyweight strings

At this point, we really should stop thinking of strings as arrays of bytes. They're sequences of characters. It's a subtle distinction -- but there are other ways of representing sequences.

Haskell's default strings, for example, are laughably inefficient linked lists of of 32-bit characters. No, seriously. There's a reason people use libraries for more efficient strings. But while this isn't exactly a practical string encoding, it at least illustrates that strings don't have to be arrays.

What would a practical heavyweight string look like? One way to do it would be as a tree of character array chunks. For example, the string "hello, world" might be represented like this:



This is a balanced binary tree, but of course you can pick whatever representation you like. This kind of data structure is called a rope. If you go with ropes as your string representation of choice, you get cheap concatenation and substring slicing.

Ropes also make string modification cheaper. With immutable array-based strings, any time you want to change so much as a single character, you have to allocate a new copy. With any array-based strings, if you want to add or remove characters, you need to copy a bunch of other characters around. With ropes, you can share most of the blocks of characters between the old and new versions of the string, and the insertion and deletion operations are both cheap. Suppose you add an exclamation mark to the end of our string, turning it into "hello, world!" and keep both versions around. You just need to allocate a new block to hold "rld!":

This is the kind of string representation often used in text editors, where they need to support fast and easy string modification. This allows constant- or logarithmic-time concatenation, logarithmic-time modification, and good memory efficiency. The drawback is that it makes our strings more complicated, and makes random indexing an O(lg n) time operation -- worse than the O(1) that people get with fixed-width array-based strings.

But wait! That O(lg n) is still better than the O(n) we would get with variable-width array-based strings, if we went with (say) UTF-8. How can this be?

The answer is that, at each node in the tree, we store the length of the string formed by its children. The tree actually looks something like this:

This way, we can combine efficient string indexing with efficient Unicode encodings. If you encode each of the chunks in UTF-8, then that doesn't change a thing about how you traverse the tree to find the chunk with the nth character. And once you've narrowed it down to a single chunk, the time left for string indexing is proportional to the chunk size -- usually not a very large overhead.

Thus, oddly enough, heavyweight string data types can actually use less memory than simpler ones, by making compact variable-length encodings fast. They give better asymptotic time bounds on most operations. They allow cheap modification of strings, without throwing away immutability. I think that's pretty cool.

(I'm working on an implementation of this in Haskell using the Data.ByteString.UTF8 and Data.FingerTree libraries. It's proving surprisingly straightforward and easy. You just need to combine UTF-8 array strings with a balanced tree that supports caching the value of a monoid at the leaves. That's exactly what a finger tree is all about, so it all fits together nicely. I'll put it up on GitHub sometime in the not-too-distant future.)

8 comments:

  1. Java's substring sort of does this in a limited fashion.

    http://www.jroller.com/dbrosius/entry/the_curious_case_of_stringbuilder

    ReplyDelete
  2. I've thought that this ought to be the default string representation for most high level languages ever sense I learned what a rope was. I look forward to seeing your code. Thanks for the post.

    ReplyDelete
  3. Any reason you're using Data.ByteString.UTF8 and not Data.Text? It's basically the same as ByteStrings but it knows about encoding, which means you could have ropes with any encoding that Text supports (utf-{8, 16LE, 16BE, 32LE, 32BE})

    ReplyDelete
  4. I remember that 'rope' was one of the built-in C++ string types for one of the early C++ compilers (Ah, a little googling shows this was for SGI). But it never really caught on. There are not many places where a complex data structure for strings outperforms the old 'sequence of bytes' implementation.

    ReplyDelete
  5. There's no particular reason that I'm using Data.ByteString.UTF8 instead of Data.Text. I really just wanted to make a proof-of-concept, and for that, D.BS.UTF8 is working fine. You're right that it would be good to be able to support UTF-16 as well, since some languages are significantly more space-efficient in that encoding.

    ReplyDelete
  6. Nice idea! I think I'm going to try something similar for ATS, which pretty much just has C style strings available.

    ReplyDelete
  7. BTW, For completeness sake Java's String now no longer does this as of 1.7_45. :)

    ReplyDelete
  8. Thanks for the post. Isn't this ought to be the by default string representation for well, all the languages? Correct me if im wrong.

    ReplyDelete