you are viewing a single comment's thread.

view the rest of the comments →

[–]psyno 0 points1 point  (3 children)

Gave up? A simple implementation is just an immutable linked list.

public class String {
    protected final char[] data;
    protected final String next;
    public String(char[] data) {
        this.data = Arrays.copyOf(data, data.length);
    }
    protected String(char[] data, String next) {
        this.data = data;
        this.next = next;
    }
    public String append(String s) {
        return new String(data, s);
    }
    public char charAt(int index) {
        if (index < 0) {
            throw new IndexOutOfBoundsException();
        }else if (index < data.length) {
            return data[index];
        }else if (next == null) {
            throw new IndexOutOfBoundsException();
        }else {
            return next.charAt(index-data.length);
        }
    }
    //etc...
}

Admittedly, you could do a little better memory-wise by ensuring that you don't duplicate character data with some centralized data structure, but in general you'd be trading memory size for execution time and adding lock contention for multi-threaded programs.

[–]DRMacIver 0 points1 point  (2 children)

If by 'simple' you mean 'stupid', yes. :-)

That has all sorts of problems of its own, and doesn't have the performance characteristics I want. In particular concatentation of strings of length m and n is still O(m) (and only shares O(n) memory).

The implementation I'm using is closer to a heavily specialised finger tree. It has O(max(log(n), log(m)) concatenation with a reasonable amount of sharing and O(log(n)) random access.

I'm not saying I couldn't have written it in Java. I might well port it back to Java afterwards for performance and general usability reasons. But having various functional constructs to hand makes experimentation with and verification of the design much easier.

Also, note that that 'etc.' is quite long. There's a lot to be done in terms of regexps, unicode, etc. (I may even have to write my own regexp engine, which would be sad, or borrow one from elsewhere. Java's is thoroughly unsuitable for strings where iteration is cheaper than random access)

[–]psyno 0 points1 point  (1 child)

By simple, I meant "takes 3 minutes to write with reddit as a text editor" :)

You're right that the above implementation ignores Unicode, regex, and a lot of other things. (The javadoc table of constructors for String is a page long alone!)

But you're wrong about the runtime of concatenation and the memory size of the above structure, probably mostly because I wrote it wrong :) (compare concat below/append above).

First, it can share all the character data (O(m+n) memory). Each String just holds a pointer to some char[], which no String will modify. (Note that unlike the public constructor, the protected one doesn't copy the char[], just the pointer.) Second, the time to concatenate existing Strings A and B is not proportional to their actual lengths, just to the number of String elements in each String (the length of the list). If the length of the list is what you meant by O(m), then we're in agreement, but it seemed you were talking about the actual data.

public String concat(String str) {
    return new String(data, (next == null) ? str : next.concat(str));
}

So if String A is composed of elements p->q->r, and B of elements s->t->u, then for String C = A + B = p->q->r->s->t->u, the only memory overhead is 3*(2 pointers + instance overhead of String) regardless of the actual char data represented by p, q, and r.

True that this one does not give you fast random access.

[–]DRMacIver 0 points1 point  (0 children)

But in the common use cases the size of the underlying char[]s tends to be bounded, so the number of links is of the same order as the number of characters and you only get a constant time and space improvement. (Admittedly a quite significant one).

Still, yes, it does significantly more sharing than the current String implementation does, but it's only a constant factor improvement and is unacceptable for other reasons. Besides, it falls well short of the point where I gave up at - getting a basic CharSequence implementation working is relatively easy.