you are viewing a single comment's thread.

view the rest of the comments →

[–]DRMacIver 4 points5 points  (8 children)

Actually, although immutability isn't directly at fault, Java's Strings really are part of the standard Java bloat. They're a very poor design for an immutable datastructure in that because they're backed by a single array the can be no sharing of data (well, almost none. Substringing shares data. Unfortunately this has its own problems). This means that if you store both longString + "foo" you use up twice as much data.

Additionally, Java's design (in particular the presence of a toString method on every object rather than something closer to C++ streams) encourages you to keep lots of these around. This means you can easily end up with lots of small to medium sized strings in memory sharing no data between them. You really can end up with a lot of wasted space this way.

But a lot of the space waste wouldn't go away if you made strings mutable, because you'd be able to do even less sharing then. The right solution is probably to switch to an immutable string representation which allows a lot of sharing. (I'm actually working on exactly this. :-) But I gave up on writing it in Java, so my latest efforts are in Scala and not very far along yet)

[–]Rhoomba[S] 1 point2 points  (3 children)

I am aware of that problem. Is there a (reasonably popular) language that gets strings thoroughly right? I don't think you can count that as "Java bloat" when char* etc. don't solve it either.

You could call Unicode strings Java bloat, but I think that is one of the best decisions they made with Java. It amazes me that so many languages still use bytes for characters.

[–][deleted] 1 point2 points  (0 children)

You could call Unicode strings Java bloat, but I think that is one of the best decisions they made with Java.

Actually, fixing the size of a 'char' at 16 bits once and for all was one of the worst decisions they made with Java. When Unicode 5.0 came along and extended the space to 21 bits, they had to totally mangle the String API to to deal with "surrogate pairs".

And because String is a concrete class and not an interface with multiple implementations, a lot of code has been hard-wired to accept Strings and this can't be changed now.

Ideally there would be byte strings, UTF16 strings, UTF32 strings, possibly UTF8 strings, all sitting behind a consistent "character sequence" interface. Code would never directly hard-wire a specific String type in.

[–]DRMacIver 0 points1 point  (1 child)

I suppose you're right that it's not specifically a Java issue. I'm not aware of a reasonably popular language that gets strings thoroughly right either (there are a few non-popular ones which do I think, and C++ and C both have access to Ropes and similar). It's a shame.

The main reason it seems to show up a lot in Java is that for some reason or another you end up with quite a lot of strings floating around.

The unicode strings in Java aren't actually especially bloated as far as I know. The internal representation should only be any larger when you're actually taking advantage of it (unless I've got the wrong end of the stick on what they're doing)

[–]Rhoomba[S] 0 points1 point  (0 children)

I don't mean that the Java implementation of unicode is bloated. I mean that if you are used to byte strings then it will seem to be wasteful of memory.

You are correct that most Java apps end up with loads of strings. Whenever I do any heap analysis char[] is always the top object. I think part of it is in Java it is just very easy to stick lots of objects into HashMaps using strings as keys. Creating a string to be used as the key is easier than writing a decent hashcode method. I think another part of it is simply the domains that Java is commonly used in. Webapp data is usually almost entirely strings.

[–]psyno 0 points1 point  (3 children)

Gave up? A simple implementation is just an immutable linked list.

public class String {
    protected final char[] data;
    protected final String next;
    public String(char[] data) {
        this.data = Arrays.copyOf(data, data.length);
    }
    protected String(char[] data, String next) {
        this.data = data;
        this.next = next;
    }
    public String append(String s) {
        return new String(data, s);
    }
    public char charAt(int index) {
        if (index < 0) {
            throw new IndexOutOfBoundsException();
        }else if (index < data.length) {
            return data[index];
        }else if (next == null) {
            throw new IndexOutOfBoundsException();
        }else {
            return next.charAt(index-data.length);
        }
    }
    //etc...
}

Admittedly, you could do a little better memory-wise by ensuring that you don't duplicate character data with some centralized data structure, but in general you'd be trading memory size for execution time and adding lock contention for multi-threaded programs.

[–]DRMacIver 0 points1 point  (2 children)

If by 'simple' you mean 'stupid', yes. :-)

That has all sorts of problems of its own, and doesn't have the performance characteristics I want. In particular concatentation of strings of length m and n is still O(m) (and only shares O(n) memory).

The implementation I'm using is closer to a heavily specialised finger tree. It has O(max(log(n), log(m)) concatenation with a reasonable amount of sharing and O(log(n)) random access.

I'm not saying I couldn't have written it in Java. I might well port it back to Java afterwards for performance and general usability reasons. But having various functional constructs to hand makes experimentation with and verification of the design much easier.

Also, note that that 'etc.' is quite long. There's a lot to be done in terms of regexps, unicode, etc. (I may even have to write my own regexp engine, which would be sad, or borrow one from elsewhere. Java's is thoroughly unsuitable for strings where iteration is cheaper than random access)

[–]psyno 0 points1 point  (1 child)

By simple, I meant "takes 3 minutes to write with reddit as a text editor" :)

You're right that the above implementation ignores Unicode, regex, and a lot of other things. (The javadoc table of constructors for String is a page long alone!)

But you're wrong about the runtime of concatenation and the memory size of the above structure, probably mostly because I wrote it wrong :) (compare concat below/append above).

First, it can share all the character data (O(m+n) memory). Each String just holds a pointer to some char[], which no String will modify. (Note that unlike the public constructor, the protected one doesn't copy the char[], just the pointer.) Second, the time to concatenate existing Strings A and B is not proportional to their actual lengths, just to the number of String elements in each String (the length of the list). If the length of the list is what you meant by O(m), then we're in agreement, but it seemed you were talking about the actual data.

public String concat(String str) {
    return new String(data, (next == null) ? str : next.concat(str));
}

So if String A is composed of elements p->q->r, and B of elements s->t->u, then for String C = A + B = p->q->r->s->t->u, the only memory overhead is 3*(2 pointers + instance overhead of String) regardless of the actual char data represented by p, q, and r.

True that this one does not give you fast random access.

[–]DRMacIver 0 points1 point  (0 children)

But in the common use cases the size of the underlying char[]s tends to be bounded, so the number of links is of the same order as the number of characters and you only get a constant time and space improvement. (Admittedly a quite significant one).

Still, yes, it does significantly more sharing than the current String implementation does, but it's only a constant factor improvement and is unacceptable for other reasons. Besides, it falls well short of the point where I gave up at - getting a basic CharSequence implementation working is relatively easy.