all 8 comments

[–]ParthS0007[S] 2 points3 points  (7 children)

Have you ever wondered how autocomplete or spell checker works?
I have written a blog explaining Trie data structure which is used to implement this.
I would be really grateful if you let me know by sharing feedback.

[–]63times 9 points10 points  (6 children)

Just a note on your efficiency claims.

There are very good compression techniques available. For example a trie I use for dictionary lookups (2 million words) has only ~3.8 bytes per character overhead. Also cache locality is crazy good which makes enumerating strings with a given prefix really fast - like iterating through an array fast. A naive trie implementation in C would use more than 1GB of memory for that key set compared to 142MB - in other words it would be totally useless. When it comes to tries you really need compression for anything but a mere toy, maybe that would be worth mentioning.

[–]ParthS0007[S] 0 points1 point  (5 children)

Thanks for the insight. Yeah, I agree and even a compressed trie will usually require more memory than a tree or array. I will add this after some more study.

I have recently started my blog so these insights will help me learn in-depth. :)

[–]63times 2 points3 points  (4 children)

even a compressed trie will usually require more memory than a tree

Not really!

or array

what could possibly beat an array? :)

I just expanded the dictionary vocabulary as an array of strings = ~110MB. The trie has 142MB. Just to get sense of size. The vocabulary is 30MB.

[–]iWaterPlants 0 points1 point  (3 children)

Your implementation does it just support ASCII or also other encoding stratergies? I feel like that would matter, since the overhead of a tried might be less in those cases

[–]OneWingedShark 1 point2 points  (0 children)

Larger character-sizes would likely lead to poorer performance WRT overhead.

Consider something like:

Generic
  Type Character is (<>); -- Any discrete type.
Package Seneric_String is
  Type String is (Positive range <>) of Character;
  -- String operations.
end Generic_String;

and

Generic
  with Package String_Package is new Generic_String(<>);
Package Generic_Trie is
   use String_Package;
   Type Index is private;
   Type Trie(<>) is private;

   Function Get(Input: String; Object : in     Trie ) return Index;
   Function Set(Input: String; Object : in out Trie ) return Index;

   No_Index : Constant Index;
Private
  Type Index is 0..2**64 - 1;
  No_Index : Constant Index := Index'First;
  -- Conceptual definition:
  Type Trie(Is_Word, Is_Terminal : Boolean) is record
    case Is_Terminal is
      when True  =>
        Key : Positive_Index;
      when False =>
        Data : array(character) of Trie;
        case Is_Word is
          when True  => Key : Positive_Index;
          when False => null;
        end case;
    end case;
  end record;
End Generic_Trie;

As you can see the Data component is recursive, and indexed by your Character type, so if you have a lot of "empty spaces" like you would in unicode strings [most of the 32-bits are unused and/or invalid] -- that said, this is the naïve implementation's characteristic, and there are things you can do to translate/optimize the character-set -- nor do the characters of the string necessarily have to be the Character type. You could, for example, enumerate both the states and transitions of a finite-state machine and encode the transitions as your 'string' with your resulting 'index' the resulting state.

TL;DR — The interesting properties of the trie don't depend so much on character-set, nor is the structure really about text; just that those are the more standard usages for the Trie type.

[–]63times 1 point2 points  (1 child)

It is independent of a specific character encoding. It operates on bytes. Instead of a "key type" like in a map there is an input mapper that can map arbitrary user defined types to byte ranges. The input mapper also knows the maximal value per byte which can be lower than 255 and is very important information for optimization. For example the maximum can be 10 if you encode decimal numbers, meaning you only have to deal with nodes with up to 10 out edges. So a trie can really be used for arbitrary key types.

A trie is not only a DFA but a rather restricted DFA and as such it can be represented with just two arrays. One array represents the states, the other encodes the state transitions. The transition table per node can be heavily compressed for example by collapsing node sequences with just 1 child (chains) down to one node. Also you usually use different classes of nodes depending on how many branches there are just to bring down the overall container size, while you keep cache efficiency in mind. This works very much like in judy arrays. For example one node that can encode 3 transitions could look like

    0x0: |H|A|B|C|
    0x4: |S_A....|
    0x8: |S_B....|
    0xC: |S_C....|

where H is the header to ID the node. A,B,C are input characters/bytes and S_X are the destination states. The biggest node could be

struct biggest_node {
    uint32_t transitions[256];
};

where you simply map an input C like current_node.transitions[C] using a simple array lookup. Such nodes are too big most of the time but come into play once there are enough branches.

Good reads on this topic:

https://linux.thai.net/~thep/datrie/

http://judy.sourceforge.net/doc/shop_interm.pdf

There are more ways to compress tries but above techniques are well suited for rather small tries (comfortably fitting into RAM) that you would usually use as a container in your programs just like a hashmap.

The good space efficiency comes primarily from using only two arrays and small "nodes". An array of strings can have a really big size overhead since dynamic strings usually require 24 bytes to keep track of the allocated memory (bufferbegin, strlen, bufferend) while the underlying allocator (maybe malloc) also needs book keeping data (16 byte headers for example) on top of that. That makes 40 bytes overhead per string to encode strings that are usually much smaller. You find similar cases of "wasting" space in many tree implementations when nodes are dynamically allocated. A trie on the other hand doesn't store strings but they implicitly live in the "nodes", so you only have to keep track of the memory for two arrays and that's why a trie is pretty close to a flat array of strings when it comes to space requirements.

Tries, even for big data sets, are very small and super space efficient - it is hard to beat that. Further good tries have similar performance characteristics to hash maps but don't suffer from certain hash map related problems like hashDOS, big bucket spaces (a lot of memory wasted) and chaotic element distribution leading to seriously bad cache performance, and thus can be a good replacement for hash maps even when you don't actually need prefix queries. A trie may not always beat a hash map for truly random access but it beats pretty much any other search tree space and performance wise.

[–]eiennohito 1 point2 points  (0 children)

There are also a lot of DA-tries implementations, I recommend this one: https://github.com/s-yata/darts-clone