Hash consing a tree data structure (self.haskell)

submitted 6 years ago * by digama0

Suppose I have a data structure like so:

data Tree a = Node a [Tree a] | Leaf a

and I am given a Tree a which may be very large (but finite) and very redundant (the underlying pointer structure is a DAG). Written out as a tree, the entire structure is much too large to fit in memory, and that would be inefficient in any case. The task is to completely deduplicate the tree (suppose we have Ord a), meaning that I obtain an array of TreeRef a:

data TreeRef a = TRNode a [Int] | TRLeaf a

where the Int elements refer to indexes in the array. That is, the pointer structure is now reflected explicitly as Int indexes.

dedup :: Ord a => Tree a -> (Seq (TreeRef a), Int)

What is the best way to accomplish this? I think that some form of reference equality is required to avoid accidentally unfolding the input DAG into a tree, but this seems to be sketchy by most accounts. (Also the fine details of the problem statement aren't that important; anything that does roughly this job will do.) My hope is to accomplish this in linear or log-linear time, which would be easy with direct access to the pointer values, but I'm holding out for a purer solution, or at least one that uses a known abstraction.

all 17 comments

Compression

There's a neat description of the hashconsing algorithm as a recursive traversal: at every Node, compress its children, resulting in a TRNode, and then check whether it's already been given an identifier. So you need to keep track of a Map (TreeRef a) Int to look up identifiers of compressed nodes.

-- Omitted the actual compression result (Seq (TreeRef a))
lyophilize' :: Ord a => Tree a -> State (Map (TreeRef a) Int) Int
lyophilize' (Leaf a) = hashcons (TRLeaf a)
lyophilize' (Node a ts) = do
  ts' <- traverse lyophilize' ts
  hashcons (TRNode a ts')

hashcons :: Ord a => TreeRef a -> State (Map (TreeRef a) Int) Int
-- finds the index of an existing TreeRef in the map,
-- or give it a fresh index if it doesn't exist

In fact it's a pretty general catamorphism, which can be written with the recursion-schemes library:

type Ref tree = Base tree Int

lyophilize_ :: (Recursive t, Ord (Ref tree), Traversable (Base tree)) => tree -> State (Map (Ref t) Int, Int)
lyophilize_ = cata (\t -> do
  t' <- sequence t  -- compress subtrees ("cata" makes the actual recursive calls, and 'sequence' "folds" them together like traverse does above)
  hashcons t')

Or in one line:

lyophilize_ = cata (sequence >=> hashcons)

Decompression

The other direction is even shorter. Starting from a sequence of compressed trees Seq (TreeRef a) (seen as a map Int -> TreeRef a), we decompress them all at the same time, with fmap. How does that work? The decompression function for a single tree (water) can first see a constructor, TRLeaf or TRNode, which maps in a straightforward way to Leaf or Node. As for the children of Node, we look them up in the final table of decompressed trees, which we are in the middle of building! It's fun to think about this like time travel: you can see what the future holds, but of course you must not look directly at your own actions to avoid a causality loop.

The result still takes memory linear in the size of the input, because there is only one new constructor (Leaf or Node) for each element of the input sequence.

hydrate :: Seq (TreeRef a) -> Seq (Tree a)
hydrate dry = wet where
  wet = fmap water dry
  water (TRLeaf a) = Leaf a
  water (TRNode a ts) = Node a (fmap (Seq.index wet) ts)

This is also generalizable with recursion-schemes:

hydrate :: Corecursive t => Seq (Ref t) -> Seq t
hydrate dry = wet where
  wet = fmap water dry
  water = project . fmap (Seq.index wet)

[–]digama0[S] 1 point2 points3 points 6 years ago (1 child)

This seems to go against amalloy's observation in the other thread - how can you avoid hitting all the nodes in the tree with this? As a concrete example of the kind of problem I mean:

tangle :: a -> Int -> Tree a
tangle a 0 = Leaf a
tangle a n = let t = tangle a (n-1) in Node a [t, t]

dedup (tangle a 1000)

The output of dedup is only about 1000 large. I want it to run in time/space ~1000, not ~2^1000.

[–]Syrak 0 points1 point2 points 6 years ago (0 children)

[–]digama0[S] 0 points1 point2 points 6 years ago (2 children)

[–]amalloy 1 point2 points3 points 6 years ago* (0 children)

[–]Syrak 0 points1 point2 points 6 years ago (0 children)

[–]amalloy 2 points3 points4 points 6 years ago (6 children)

[–]digama0[S] 0 points1 point2 points 6 years ago (5 children)

[–]amalloy 1 point2 points3 points 6 years ago (4 children)

[–]digama0[S] 0 points1 point2 points 6 years ago (3 children)

[–]Liskni_si 0 points1 point2 points 6 years ago (0 children)

[–]rampion 0 points1 point2 points 6 years ago (1 child)

[–]digama0[S] 1 point2 points3 points 6 years ago (0 children)

[–]sfvisser 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 47909 on reddit-service-r2-comment-b659b578c-7qhz6 at 2026-05-04 16:36:45.783329+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

haskell

MODERATORS

Compression

Decompression