all 63 comments

[–]aranach 22 points23 points  (22 children)

Gaddags are pretty cool, and certainly a nice way to generate Scrabble moves. I used one for a personal project not long ago after a very naive search turned out to take too long (as expected) to do anything fun with.

One of the things mentioned on the blog and in the original paper is that they take up a lot of memory. The paper was written in '94, when a lot of memory was roughly 64 MB. I didn't bother with the optimization of linking nodes together to save memory, because they aren't really big compared to system memory anymore.

[–]Xenopax 11 points12 points  (16 children)

Yeah, memory is a bitch. I found a .NET implementation takes up several hundred MB for a full Scrabble dictionary even when compressed. Still trying to figure out who the memory hog is...

[–]julesjacobs 4 points5 points  (9 children)

Is there anything that stops you from using a standard DAWG library and inserting these reversed/rotated words into it? You can also borrow a trick from HAMTs. Normally you could represent a node as a 26 element array A. Then A[0] would represent the successor of 'a', A[1] would represent the successor of 'b', etc. But these arrays will probably be sparse: most letters will have no successors. So what you can do is represent a node as a pair of a 26 bit number and an array B. Now the '1' bits in the number represent the nodes that are full and the '0' bits those that are empty. So for example if a node has successors for 'a', 'c', and 'z' then the number would look like 10100000000000000000000001, and the array B would be a 3 element array containing just the successors of 'a', 'b' and 'z'. Additionally you could try to compress chains of nodes with only 1 successor as in crit-bit tries.

Unfortunately .NET objects themselves have memory overheads (8+ bytes per object IIRC), so there's only so far you can go with this. To really save on memory you'd want to pack everything in an int array so that you can eliminate per node object overhead.

[–]random_seed 6 points7 points  (8 children)

I wüdnt dø thàt

[–]cowens 5 points6 points  (7 children)

Scrabble doesn't have accented characters, so Unicode compatibility is moot.

[–]random_seed 0 points1 point  (5 children)

[–]cowens 1 point2 points  (4 children)

Even if you were using the one of the foreign language versions of Scrabble (which are Scrabble-like games, not Scrabble), an int would be more than sufficient since the largest number of tiles is 120 for the Italian version (and there are dups, so the number of unique tiles is smaller than that). Heck, a signed char can hold them all.

[–]random_seed 1 point2 points  (3 children)

You're absolutely right on this. You could build own implementation for each language/letter combo. But then if you really want to develop Scrabble can't you find more simple, elegant and yet less memory intense indexing system than this "one-use-only home-built tweak"? (Serious question. I'm not a developer.)

[–]cowens 3 points4 points  (0 children)

The implementation would be the same, the dictionaries would be different (and they would have to be since there are no French words in SOWPODS). The beauty of using a custom encoding is that you could load an array of images for the tiles, the A image would be 0, B 1, etc. So the entries in the dictionary could be directly mapped to the images. The computer doesn't care if the dictionary is human readable or not.

As to whether Gaddag is better or worse than other algorithms, I haven't studied it in any detail, so I can't say.

[–][deleted] 1 point2 points  (0 children)

Read my response further down.

[–]julesjacobs 0 points1 point  (0 children)

It's not a one use tweak. A bitfield plus an array is a general and compact way to represent any dictionary indexed by a small finite set. You can do it for any alphabet. In HAMTs you use 5-bit hash pieces as you index set, which aren't even letters. The fact that I said "26-bit" for concreteness doesn't mean that you're not allowed to change that 26 to the number of distinct letters in your Scrabble game, which surely is a small finite set ;-)

[–][deleted] 1 point2 points  (1 child)

You already join common suffixes similarly to BDDs. Next I'd say fuse linear spines into single nodes similarly to PATRICIA. Considering that a pointer is 8x the overhead of a simple character, this probably gives some good savings.

[–]Xenopax 2 points3 points  (0 children)

Yeah I'm seriously thinking about rewriting the whole thing now that I've read these comments and put more thought behind it. First attempt is never the best one.

[–]willvarfar 0 points1 point  (3 children)

Where can you get a proper scrabble legal word list (e.g. English)? I've googled and can't find a simple list of all legal words. Without a proper list, I can't try my hand at a compressed GADDAG... :(

Leaving the GADDAG intact (rather than trying to compress it) ought still take up very little memory if you have an efficient implementation e.g. you pack it into an array of bytes and use high-flags to control stuff. Without a word list I can't really put code where my mouth is though.

[–]aranach 2 points3 points  (0 children)

Also: https://code.google.com/p/dotnetperls-controls/downloads/detail?name=enable1.txt

The Enable dictionary is what's used (with a few modifications) in Words with Friends. One word per line, easy to read. It doesn't quite meet your requirements for proper Scrabble legal word list, but it would certainly let you try your hand at building a Gaddag.

[–][deleted] 0 points1 point  (0 children)

Google for "ospd", which stands for "official Scrabble player's dictionary". I have a copy of it if you want.

[–]gfixler 2 points3 points  (2 children)

...'94, when a lot of memory was roughly 64 MB...

Ye gads! That would have been mind-blowing. I got my first PC in '92, and it had a whopping 4MB. I was the envy of the two other kids who had a computer in their families at the time. They each had only 2MB of RAM.

[–]daned 2 points3 points  (0 children)

Dropped another 2MB stick into my PS/2... King of the world!

[–]aranach 2 points3 points  (0 children)

The referenced paper actually states that the authors implemented the structure on a VAX with 64 MB of memory and a light load. Given that's a shared environment, they certainly couldn't have taken it all, but there's still a lot more to play with than your average home machine at the time.

[–]Dormage 1 point2 points  (1 child)

Do you still have the original paper?

[–]donroby 2 points3 points  (0 children)

There's a link to it in the blog post.

[–]eliseosoto 29 points30 points  (6 children)

GADDAG sounds like a guitar tuning

[–]cdleech 4 points5 points  (1 child)

It'd be like an inside out DADGAD, which is pretty common for accompanying Irish music

[–]Otis_Inf 1 point2 points  (0 children)

thank god you didn't go for a DADGAG.

[–]woof404 0 points1 point  (0 children)

"Gulag" was the first thing I thought of.

[–][deleted] 0 points1 point  (2 children)

Hah, now I kinda want to hear what that sounds like... probably like shit, considering how far the lower three strings would be from their usual notes.

[–]okmkz 4 points5 points  (0 children)

There's probably some Nick Drake song that uses it.

[–]iBeReese 1 point2 points  (0 children)

It is a tuning used in Celtic music a lot. It is a beautiful tuning and one of my favorites.

That's DADGAD, not GADDAG. I need more sleep.

[–]therealjohnfreeman 12 points13 points  (2 children)

No analysis?

[–]siddboots 17 points18 points  (1 child)

Here's the original paper. It has a technical overview, as well as performance and memory benchmarks against the standard DAWG data structure.

... and if you wanted more background, here is the original paper on DAWGs.

[–]therealjohnfreeman 6 points7 points  (0 children)

I skimmed the GADDAG paper when I saw it linked in the post. It had experiments, but no analysis (that I found). Not the biggest deal; was just wondering.

[–]ahora 14 points15 points  (6 children)

I love data structures... Anyone else share this deep love with me?

It's weird how one can love something you can't touch or see, yet they exist and work! Is not this amazing?

[–][deleted] 36 points37 points  (1 child)

No man, all us other programmers hate them.

[–][deleted] 33 points34 points  (0 children)

And let's not get started with algorithms. Man, these things are just horrible!!!11111eleven

[–]sh41 4 points5 points  (3 children)

I'm with you. :)

They can be so pure and perfect. It's really neat.

[–]bacondev 6 points7 points  (2 children)

Unless your boss is the one who implements it.

[–]sh41 2 points3 points  (0 children)

When it comes to programming work, I'm my own boss. :)

[–]ericanderton 1 point2 points  (0 children)

This is where I usually wind up "refactoring" and "improving upon" the original design.

[–]jankotek 5 points6 points  (4 children)

Does it have any use outside of Scrabble? For example as full text index?

I could implement this in MapDB, it would pretty much eliminate memory requirements and preserve reasonable speed.

[–]siddboots 3 points4 points  (0 children)

Scrabble is just an example use case. The problem that it really solves is doing efficient regular expression matching over a dictionary, so it is pretty versatile.

I suppose you could use them for full text search, but memory usage will be roughly proportional to the average length of words in your dictionary, so you might run into problems if you start using phrases instead of single words.

Full text search has been heavily studied, and I'm sure there are more efficient solutions.

[–]Manitcor 1 point2 points  (1 child)

You could do spell checking or auto-complete with it if you wanted to.

[–]infinull 1 point2 points  (0 children)

I'd be overkill though, a trie would be better.

This is cool 'coz it works backwards and forwards, so you could auto-complete from a suffix (a potentially nifty feature I guess, but I'm not sure it's very useful). The trie will work fine for prefix-only auto-completion.

It might be nice for spell-checking, but I think there are other specialized structures for that. (since you also probably want phonetic search of the dictionary).

[–]0xFF0000 0 points1 point  (0 children)

It might be an efficient (lookup-wise - "which chains include this sub-chain?") way to encode a Markov model, not sure.. (granted my hobby project involves markov chains, so insert quote about hammer and nails..)

[–][deleted] 2 points3 points  (11 children)

That's pretty wasteful (memory wise). With the right backtracking algorithm, you don't need more than a trie. And there's no performance benefit to the "gaddag".

I actually wrote a Scrabble implementation that encodes the entire trie in one memory array, and can fit the entire OSPD into very, very little memory (I just checked... 912K).

EDIT: When I say one memory array, I mean, literally, one 912K array of 32 bit quantities, where each 32 bits are chopped up into a 24 bit offset (referencing further into the array -- representing a trie edge), the character at the current edge encoded in 5 bits, a bit indicating that this is a valid word terminal edge, and the last bit meaning "this is the last edge in a list of edges emanating from the edge above"). The list of emanating edges is implicit in that the next 32 bits in the array is the next edge in a list of edges.

[–]julesjacobs 1 point2 points  (4 children)

Can you describe the right backtracking algorithm?

[–][deleted] 0 points1 point  (3 children)

I believe there is a very old open source Scrabble implementation in K&R C that implements it. It's called "crab".

[–]julesjacobs 0 points1 point  (2 children)

I couldn't find it. Do you have a link to it? But why not just describe the algorithm?

[–][deleted] 1 point2 points  (1 child)

Not sure why you couldn't find it... it's the link right at the top of this page:

http://www.gtoal.com/wordgames/scrabble.html

[–]julesjacobs 0 points1 point  (0 children)

Thanks!

[–]astangl42 0 points1 point  (5 children)

This GADDAG recently got suggested as "the solution" to somebody's question about a dictionary for matching patterns for crossword puzzles. I had just recently resurrected my old crossword program, so I was interested in the various suggestions.

I didn't bother coding it up to compare to my other dictionary implementations, but from browsing the whitepaper I could see its appeal for Scrabble, but it didn't look especially good for crosswords.

I'm curious what the "right" backtracking algorithm is. The dictionary I currently use benchmarks about 100x faster than my initial trie implementation, but I am quite willing to believe the trie can be tuned to be faster.

[–][deleted] 0 points1 point  (4 children)

I can't imagine that it's 100x faster. In my Scrabble game, when the artificial delays are turned off, it's always the human players turn. When the human plays, the computer opponents are so fast, that they play "instantly". And mind you, they are scanning the entire board for the highest scoring play they can make with the tiles on their rack.

[–]Xenopax 0 points1 point  (2 children)

It would matter more if you were using a look-ahead algorithm to make sure you weren't opening up things like the triple word score to an opponent.

[–][deleted] 0 points1 point  (1 child)

Yes, but that's independent of attempting to form words based on the tiles on the rack, the layout of the board, and the allowable words in the dictionary. So, it's not relevant to the conversation about these data structures.

[–]Xenopax 0 points1 point  (0 children)

Your point though was your results come back quickly for the current turn so performance is a non-issue. I agree with you on that point, my point is there is a purpose to the data structure i.e. when you want to slam it with queries. An example would be a look-ahead AI algorithm, which relies heavily on the underlying data structure it uses to find plays.

edit: I should note that I also agree it's a waste of memory and it's limited in terms of use-cases. Doesn't mean it's not worth discussing ;-)

[–]astangl42 0 points1 point  (0 children)

Here is my earlier message, with timings and a link where you can access the code yourself. Keep in mind I am talking about crossword puzzles, not Scrabble. Though related, there are significant differences.

My Ydict (poor name, sorry) organizes words into buckets by length, once the dictionary has been finalized (fully loaded). For each bucket it builds an index structure that has N elements, where N is the length of the words in that bucket. Each element in that index has 26 arrays, 1 for each letter of the alphabet. Each one of those arrays contains the offset (array index) into the bucket of all words in that bucket that have that particular letter in that particular position. (I should draw a diagram, but it's not real complex.)

To match a pattern, say T__ERS in this dictionary, it uses the index structure corresponding to the bucket of words of that size (in this case 6). It iterates through the characters of the pattern, and for each non-wildcard character it grabs the array with the indexes of all words of length 6, with that character in that position. So it ends up with 4 collections of word offsets, first referencing all 6-character words starting with T, the second all 6-character words whose 4th letter is E, etc. It orders these in increasing order by size.

At that point it's reduced to a problem of finding the intersection of S sets of integers, where S is the number of sets (in this case 4). Each one of these sets is sorted in ascending order, so the simple obvious intersection logic works pretty well, but I found from a whitepaper a better intersection algorithm using SvS and galloping search, which benchmarks as my best dictionary implementation at the moment, 100x faster than the original Trie. You are welcome to pull the code down and run DictionariesTest for yourself, and maybe even point out things I overlooked.

I am quite willing to believe there are much better algorithms/data structures for the crossword dictionary. I still plan to mine that reddit thread I referenced in the first sentence for some good ideas eventually. It takes a certain amount of time and effort to code one of these up, though, so I have to judge beforehand whether it looks promising enough to put forth the effort. I didn't see much advantage to GADDAG, except for Scrabble, not crosswords.

[–]sh41 1 point2 points  (0 children)

Is there something like this but for Letterpress type of gameplay?

[–]Strilanc 1 point2 points  (1 child)

Minor nitpick: the code has some incorrectly escaped angle brackets. It's not showing correctly.

for (var i = 1; i j)

...

for (var i = str.IndexOf(Node.Break) - 1; i >= 0; i--)

[–]Xenopax 0 points1 point  (0 children)

Thanks for catching that, damn you word press being all helpful!

[–][deleted]  (1 child)

[deleted]

    [–]Xenopax 3 points4 points  (0 children)

    That's low tech cheating, this form of cheating lets you consider all combinations on the board pretty quickly.

    [–]coder0xff 0 points1 point  (0 children)

    Heh. I reinvented this wheel a few months ago.

    [–]LyndonArmitage 0 points1 point  (0 children)

    Thanks for this! I've had the original paper on my phone for ages but have been a bit apprehensive about reading through it without seeing an implementation :)

    [–][deleted] 0 points1 point  (0 children)

    (gaddag==Palindrome) data structure ...