Find out about the latest issue
Never miss an update! Get PC Plus in your RSS reader and follow us on Twitter

Learning The Ropes

When a string just can’t help you out with your programming quandary, ropes come to the rescue.

Since the dawn of the computer age, programming languages have treated text as an array of characters. Procedural languages like C assume that the text in the character array is terminated with a null character, whereas Pascal variants maintain a length count as part of the array, usually as a hidden field that the compiler knows about.

No matter how these arrays are implemented, they have become known as ‘strings’. The compiler and the runtime provide methods and routines for operating on these strings, giving you instant access to functions responsible for finding the length of the text, concatenating two strings, finding a substring in a string, iterating through the characters and so on. These library routines are usually written in collaboration with the compiler, in the sense that the compiler knows about strings and can optimise such routines.

In modern languages like Java, C#, Ruby and JavaScript, strings are important enough to be included as a primitive data type.

Looking at strings

It’s clear that the implementation of string types will vary according to the language, but in general it’s safe for us to make the following assumptions:

1. The string data type stores its text as a continuous array of characters.
2. The string data type is a complex type in the sense that there is other information stored alongside the text. This information is usually hidden and includes things like a field for the length, a field for the size of the memory holding the string and so on. Some string implementations also ensure that the text is terminated with a null character (operating system APIs tend to use such strings).
3. Strings are ‘immutable’, so they cannot be modified after they have been created.

The first assumption here is a consequence of the history of strings: it’s just the simplest data structure with which to hold a chunk of text. The advantage of treating the string as an ordinary array is that you can easily do things like parsing the text. On the flip side, large strings stored this way aren’t very memory efficient. Imagine processing a text log file that’s megabytes in size. You’d end up allocating or deallocating large amounts of contiguous memory, resulting in serious fragmentation issues.

The second assumption is a reflection of the need to treat strings as a primitive type rather than just an array. Once you’ve done this, it makes sense to make some operations on strings work more efficiently by using information stored in extra memory (since we’re already hiding the internals of the string based on its history as an array).

The third assumption is easily the most important. When you pass strings around, you don’t want to have to worry about one routine altering the string as another uses it. After all, you don’t have to worry about some part of the program changing the literal 0 to something else, such as a 1. In fact, even thinking about such a possibility hurts the brain. If strings are to be a primitive type, it makes sense to enforce the rule that their values are immutable. Then you don’t even have to worry about such a possibility. This assumption is even more important with regard to concurrency in programs, where you have multiple threads operating on the same data at the same time.

Once we have these assumptions in place, we’d also like the following couple of characteristics to apply to our string type:

1. Common operations should be very efficient. Finding the length of a string, iterating over the characters in a string, concatenating strings and finding or retrieving a substring must be quick and easy operations.
2. These operations should remain efficient no matter how many characters the given string contains.

In our enthusiasm for strings, the second characteristic here is often forgotten. The strings we deal with are usually fairly small: text for labels on-screen, dates for display and so on. But when we talk about really long strings, like a voluminous text log file or a string representation of a strand of DNA, we find that the usual operations don’t scale.

Introducing ropes

In 1994, Hans J Boehm – one of the foremost researchers into memory garbage collection in applications and inventor of the ‘Boehm collector’ algorithm – joined fellow researchers Russ Atkinson and Michael Plass in publishing a paper called Ropes: An Alternative to Strings, which detailed an alternative data structure that could be used to hold and manipulate text. This new structure was called a ‘rope’ because of its relationship to a string. Just as real rope is used for more heavyweight purposes than real string, computing rope is a more complex structure than a computing string but useful for similar purposes.

Let’s break it down. A rope is an example of a binary tree. Its internal nodes are known as concatenation nodes, and the ‘leaves’ are nodes containing chunks of text. Figure 1, below, compares the text ‘PC Plus is the UK’s premier technology magazine’ when it’s stored in a normal string with when it’s stored in a rope. From this figure you can see that in order to read the full text you have to scan the chunks in ‘infix’ order. This means that once you’re given a node you have to read the left sub-tree and then the right sub-tree in order.

At this point, you may well be wondering where the benefits are to such a representation. Well, let’s imagine you have two blocks of text that you want to concatenate, first represented as strings and then as ropes. Furthermore, imagine that both blocks are 100MB in size.

For a string, the operation would work like this. Find out the length of both strings; allocate a new block of string memory that can hold both these strings joined together; copy the first string to this new block; and then copy the second string to follow the first block. Oh, and after all that, you’ll need to deallocate the memory used by the first and the second string (although if you’re working in a garbage

-collected environment, this will occur later on – sometimes much later). To put things mildly, this is a large amount of processing. Furthermore, in the middle of this operation, before any deallocation has taken place, we’ll need over 400MB of memory allocated, twice as much as the two strings together.

The rope operation is much simpler. We allocate a new concatenation node, set the first rope as the left child and the second rope as the right child, and we’re done. Compared to the string case, the amount of processing is virtually zero and the extra memory required is almost nothing. Figure 2, below, shows the process.

Substring operations

In order to perform substring-type operations, we have to make this binary tree a little more complex by including a ‘position’ value in each node. This turns the binary tree into a search tree. If you look at Figure 1, we’ve included the position value in the concatenation nodes. By using this position value, it becomes simple to find the nth character in the string.

For example, let’s find the 20th character in Figure 1. The first node is at position 29. That is, characters 0 to 28 are to the left of the node, and characters 29 to the end are to the right. To find character 20, we go left. The next node is at position 13. Since 20 is greater than 13, we go right to the leaf node. At that point we can easily calculate the correct character in that node.

This procedure points out the main overhead that a rope has in comparison to a string: in order to find the nth character, you have to start at the root node, compare the position values and work your way down the concatenation nodes in the tree until you reach the correct leaf node with the text containing your character. This process takes an amount of time proportional to the depth of the tree. For a perfectly balanced tree, this time will be proportional to the logarithm of the number of leaves in the tree. Indeed, if you know anything about binary search trees, you’ll already know that we have to take into account the rebalancing of the tree during insertion. For a string, the time needed to get the nth character is constant.

Luckily, in normal programming, getting the nth character is usually done in a process of iterating over all the characters in the tree in a sequential manner. We can easily provide a special method for the rope that iterates over the characters in the text, and this method can maintain its position in the tree very easily to avoid the time wasted by continually tracing the path from the root. In fact, amortised over all the characters in the text, the sequential retrieval of a single character is constant for a rope, just like the string.

Dealing with concatenations

Although concatenation of text runs in constant time for a rope, the method we use can have important implications for efficient operations on the rope later on. The most common reason for using concatenation is to append text.

For the first concatenation in a sequence, we can imagine having two strings. For this we can create a simple rope with a single concatenation node. For the subsequent append operations, we look at the right child of the rope. If it’s a leaf node, we create a new node, set the left child to this leaf node, set the right child to the new bit of text and then attach this new concatenation node as the right child of the original root.

Figure 3, above, demonstrates what happens when a progression of concatenations is completed in this manner. Note that we can put off rebalancing the tree for longer than we would have had to originally. Also notice that this algorithm will work just as easily for the prepend operation, as well as for the occasions where we continually add strings to the start of a rope.

In brief, a rope is more efficient than a string at creating text through concatenation, insertion and deletion, especially if the amount of text is large. However, a string is still more efficient at accessing the characters in the text, even though such an operation is an amortised constant within the rope framework.

Another scenario that favours a string is when you’re using regular expressions to match text: the rope has too much complexity to work well with such an operation, and the ability to quickly perform random access of characters in the text becomes very important. This is especially true because regular expressions generally involve some kind of backtracking over characters in the text.

If the rope library has been written to share leaf nodes, there’s another scenario where a rope implementation shines: a text editor with an undo facility. A text editor is all about insertion and deletion of text. In such an application, using a rope is much more efficient, especially for large texts. Furthermore, managing the undo stack becomes extremely simple and efficient. For every change, store the previous rope. Since every rope that’s created is immutable, and will share leaf nodes with all the other ropes that you’ve been creating through editing the text, the undo stack will take a minimal amount of memory. Therefore it makes sense to store it as a rope.

This 'ropes' are just Erlang's list :)

Daniel Kwiecinski's picture

Post new comment

The content of this field is kept private and will not be shown publicly.
If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <p>

More information about formatting options

CAPTCHA
We apologise for making you prove your humanity...
Find out about the latest issue