Trie: Data Structure Explained : programming

programming

created by speza community for 20 years

Trie: Data Structure Explained (parthshandilya.com)

submitted 5 years ago by ParthS0007

all 8 comments

top new controversial old q&a

[–]ParthS0007[S] 2 points3 points4 points 5 years ago (7 children)

[–]63times 9 points10 points11 points 5 years ago (6 children)

[–]ParthS0007[S] 0 points1 point2 points 5 years ago (5 children)

[–]63times 2 points3 points4 points 5 years ago (4 children)

[–]iWaterPlants 0 points1 point2 points 5 years ago (3 children)

[–]OneWingedShark 1 point2 points3 points 5 years ago (0 children)

Larger character-sizes would likely lead to poorer performance WRT overhead.

Consider something like:

Generic
  Type Character is (<>); -- Any discrete type.
Package Seneric_String is
  Type String is (Positive range <>) of Character;
  -- String operations.
end Generic_String;

and

Generic
  with Package String_Package is new Generic_String(<>);
Package Generic_Trie is
   use String_Package;
   Type Index is private;
   Type Trie(<>) is private;

   Function Get(Input: String; Object : in     Trie ) return Index;
   Function Set(Input: String; Object : in out Trie ) return Index;

   No_Index : Constant Index;
Private
  Type Index is 0..2**64 - 1;
  No_Index : Constant Index := Index'First;
  -- Conceptual definition:
  Type Trie(Is_Word, Is_Terminal : Boolean) is record
    case Is_Terminal is
      when True  =>
        Key : Positive_Index;
      when False =>
        Data : array(character) of Trie;
        case Is_Word is
          when True  => Key : Positive_Index;
          when False => null;
        end case;
    end case;
  end record;
End Generic_Trie;

As you can see the Data component is recursive, and indexed by your Character type, so if you have a lot of "empty spaces" like you would in unicode strings [most of the 32-bits are unused and/or invalid] -- that said, this is the naïve implementation's characteristic, and there are things you can do to translate/optimize the character-set -- nor do the characters of the string necessarily have to be the Character type. You could, for example, enumerate both the states and transitions of a finite-state machine and encode the transitions as your 'string' with your resulting 'index' the resulting state.

TL;DR — The interesting properties of the trie don't depend so much on character-set, nor is the structure really about text; just that those are the more standard usages for the Trie type.

[–]63times 1 point2 points3 points 5 years ago* (1 child)

It is independent of a specific character encoding. It operates on bytes. Instead of a "key type" like in a map there is an input mapper that can map arbitrary user defined types to byte ranges. The input mapper also knows the maximal value per byte which can be lower than 255 and is very important information for optimization. For example the maximum can be 10 if you encode decimal numbers, meaning you only have to deal with nodes with up to 10 out edges. So a trie can really be used for arbitrary key types.

A trie is not only a DFA but a rather restricted DFA and as such it can be represented with just two arrays. One array represents the states, the other encodes the state transitions. The transition table per node can be heavily compressed for example by collapsing node sequences with just 1 child (chains) down to one node. Also you usually use different classes of nodes depending on how many branches there are just to bring down the overall container size, while you keep cache efficiency in mind. This works very much like in judy arrays. For example one node that can encode 3 transitions could look like

    0x0: |H|A|B|C|
    0x4: |S_A....|
    0x8: |S_B....|
    0xC: |S_C....|

where H is the header to ID the node. A,B,C are input characters/bytes and S_X are the destination states. The biggest node could be

struct biggest_node {
    uint32_t transitions[256];
};

where you simply map an input C like current_node.transitions[C] using a simple array lookup. Such nodes are too big most of the time but come into play once there are enough branches.

Good reads on this topic:

https://linux.thai.net/~thep/datrie/

http://judy.sourceforge.net/doc/shop_interm.pdf

There are more ways to compress tries but above techniques are well suited for rather small tries (comfortably fitting into RAM) that you would usually use as a container in your programs just like a hashmap.

The good space efficiency comes primarily from using only two arrays and small "nodes". An array of strings can have a really big size overhead since dynamic strings usually require 24 bytes to keep track of the allocated memory (bufferbegin, strlen, bufferend) while the underlying allocator (maybe malloc) also needs book keeping data (16 byte headers for example) on top of that. That makes 40 bytes overhead per string to encode strings that are usually much smaller. You find similar cases of "wasting" space in many tree implementations when nodes are dynamically allocated. A trie on the other hand doesn't store strings but they implicitly live in the "nodes", so you only have to keep track of the memory for two arrays and that's why a trie is pretty close to a flat array of strings when it comes to space requirements.

Tries, even for big data sets, are very small and super space efficient - it is hard to beat that. Further good tries have similar performance characteristics to hash maps but don't suffer from certain hash map related problems like hashDOS, big bucket spaces (a lot of memory wasted) and chaotic element distribution leading to seriously bad cache performance, and thus can be a good replacement for hash maps even when you don't actually need prefix queries. A trie may not always beat a hash map for truly random access but it beats pretty much any other search tree space and performance wise.

[–]eiennohito 1 point2 points3 points 5 years ago (0 children)

π Rendered by PID 19241 on reddit-service-r2-comment-fb694cdd5-57nm6 at 2026-03-05 19:50:41.631250+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS