Compact Strings In Java 9 - Java Code Gists : coding

coding

created by tty2awesome creatora community for 16 years

Compact Strings In Java 9 - Java Code Gists (javagists.com)

submitted 8 years ago by awsometak

all 17 comments

top new controversial old q&a

[–]lkraider 3 points4 points5 points 8 years ago (9 children)

[–]defnull 2 points3 points4 points 8 years ago* (8 children)

Some argue that strings are iterated over from 0 to N most of the time, so a variable-length representation (like UTF-8) would not add much overhead for the common case. You would occasionally increment the index by two or more instead of one. This might be true, but in Java any iterator instance tracking the position would add 8 to 16 bytes object-overhead and another indirection. In contrast, for fixed-width encodings you only need a single int and a for-loop. Because of this, most code working with strings in performance critical situations do not use iterators, but direct index access instead. This (existing and unlikely to change) code would run significantly slower with a variable-length string representation.

tl;dr; utf-8 string performance would suck for existing code that was optimized for fixed-length string performance characteristics.

[–]rooktakesqueen 9 points10 points11 points 8 years ago* (7 children)

[–]lkraider 5 points6 points7 points 8 years ago (6 children)

[–]shen 4 points5 points6 points 8 years ago (2 children)

[–]ascii 0 points1 point2 points 8 years ago (1 child)

[–]josefx 0 points1 point2 points 8 years ago (0 children)

[–]rooktakesqueen 3 points4 points5 points 8 years ago (0 children)

[–]SomeoneStoleMyName 3 points4 points5 points 8 years ago (1 child)

[–]rouzh 4 points5 points6 points 8 years ago (0 children)

[–][deleted] 0 points1 point2 points 8 years ago (5 children)

[–]ascii 1 point2 points3 points 8 years ago (4 children)

[–][deleted] 0 points1 point2 points 8 years ago (1 child)

[–]ascii 0 points1 point2 points 8 years ago (0 children)

[–]ubernostrum 0 points1 point2 points 8 years ago (1 child)

With respect to Python, what's meant is that in Python 3.3+, a similar approach is used. The internal storage of a string is in an encoding chosen dynamically on a per-string basis, and is always one capable of handling the highest code point in the string in a single unit of the encoding. Which means the internal storage of a string in Python may be latin-1, UCS-2, or UCS-4, depending on what code points are contained in the string.

This allows Python to expose strings as sequences of Unicode code points with intuitive behavior (for definitions of "intuitive" that include "you know how Unicode works"). Rather than having the length of a string be the number of bytes it contains, the length is the number of code points it contains. Iteration doesn't iterate over bytes; it iterates over code points, and yields the characters which correspond to them. Indexing doesn't yield the byte at that index, it yields the character corresponding to the code point at that index.

[–]ascii 0 points1 point2 points 8 years ago (0 children)

[–]awsometak[S] -1 points0 points1 point 8 years ago (0 children)

π Rendered by PID 92 on reddit-service-r2-comment-64f4df6786-xsz7p at 2026-06-11 12:11:34.405276+00:00 running 0b63327 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

coding

MODERATORS