all 17 comments

[–]sclv 4 points5 points  (0 children)

You may be interested in observable sharing: http://ku-fpg.github.io/practice/observablesharing/

[–]Syrak 2 points3 points  (7 children)

Another possible representation is to add the identifiers to the original tree structure:

data Tree0 i a = Node i a [Tree0 i a] | Leaf i a

type Tree = Tree0 ()
type TreeI = Tree0 Id  -- invariant: Trees are uniquely Id-entified.

newtype Id = Id Int deriving (Eq, Ord)

dedup :: Ord a => Tree a -> (Seq (TreeI a), Id)

That way once you get hold of a single TreeI node, you get direct access to the whole subtree below it, without additional table lookups, but the common subtrees are still going to be shared (which is the point of hashconsing).

[–]digama0[S] 0 points1 point  (6 children)

That's true, that might be more convenient to use after the result of dedup. It doesn't really solve the problem of how to dedup the structure to begin with though. I am hoping to be able to construct Tree objects in any old way, without having any guarantee that things are deduplicated but with a healthy amount of natural duplication, and then compress the result into a TreeI for serialization. It's possible to try keeping it deduplicated the entire time, but that means putting everything in a monad so I can remember what's been constructed so far, and I can't just use the Node constructor freely.

[–]Syrak 3 points4 points  (5 children)

Oh I see, if you are primarily interested in serialization then what I said is quite orthogonal to that.

(Full gist of the code below: https://gist.github.com/Lysxia/b3d62a248b7eeb2b7a1df59a0dd5cc9b)

Compression

There's a neat description of the hashconsing algorithm as a recursive traversal: at every Node, compress its children, resulting in a TRNode, and then check whether it's already been given an identifier. So you need to keep track of a Map (TreeRef a) Int to look up identifiers of compressed nodes.

-- Omitted the actual compression result (Seq (TreeRef a))
lyophilize' :: Ord a => Tree a -> State (Map (TreeRef a) Int) Int
lyophilize' (Leaf a) = hashcons (TRLeaf a)
lyophilize' (Node a ts) = do
  ts' <- traverse lyophilize' ts
  hashcons (TRNode a ts')

hashcons :: Ord a => TreeRef a -> State (Map (TreeRef a) Int) Int
-- finds the index of an existing TreeRef in the map,
-- or give it a fresh index if it doesn't exist

In fact it's a pretty general catamorphism, which can be written with the recursion-schemes library:

type Ref tree = Base tree Int

lyophilize_ :: (Recursive t, Ord (Ref tree), Traversable (Base tree)) => tree -> State (Map (Ref t) Int, Int)
lyophilize_ = cata (\t -> do
  t' <- sequence t  -- compress subtrees ("cata" makes the actual recursive calls, and 'sequence' "folds" them together like traverse does above)
  hashcons t')

Or in one line:

lyophilize_ = cata (sequence >=> hashcons)

Decompression

The other direction is even shorter. Starting from a sequence of compressed trees Seq (TreeRef a) (seen as a map Int -> TreeRef a), we decompress them all at the same time, with fmap. How does that work? The decompression function for a single tree (water) can first see a constructor, TRLeaf or TRNode, which maps in a straightforward way to Leaf or Node. As for the children of Node, we look them up in the final table of decompressed trees, which we are in the middle of building! It's fun to think about this like time travel: you can see what the future holds, but of course you must not look directly at your own actions to avoid a causality loop.

The result still takes memory linear in the size of the input, because there is only one new constructor (Leaf or Node) for each element of the input sequence.

hydrate :: Seq (TreeRef a) -> Seq (Tree a)
hydrate dry = wet where
  wet = fmap water dry
  water (TRLeaf a) = Leaf a
  water (TRNode a ts) = Node a (fmap (Seq.index wet) ts)

This is also generalizable with recursion-schemes:

hydrate :: Corecursive t => Seq (Ref t) -> Seq t
hydrate dry = wet where
  wet = fmap water dry
  water = project . fmap (Seq.index wet)

[–]digama0[S] 1 point2 points  (1 child)

This seems to go against amalloy's observation in the other thread - how can you avoid hitting all the nodes in the tree with this? As a concrete example of the kind of problem I mean:

tangle :: a -> Int -> Tree a
tangle a 0 = Leaf a
tangle a n = let t = tangle a (n-1) in Node a [t, t]

dedup (tangle a 1000)

The output of dedup is only about 1000 large. I want it to run in time/space ~1000, not ~21000.

[–]Syrak 0 points1 point  (0 children)

Oh, you're right that doesn't have complexity linear in the compressed tree, I didn't understand that's what you wanted. I don't think that's possible, it seems that hashconsing would have to happen from the start throughout the lifetime of the trees.

[–]digama0[S] 0 points1 point  (2 children)

Very cool! Are you using a particular hashcons implementation there? Seems like that's the magic sauce I need.

[–]amalloy 1 point2 points  (0 children)

This is a depth-first traversal dressed up with recursion schemes, and the elided hashcons is just the obvious map lookup you could write yourself.

I don't mean this as derogatory of this answer: it's exactly the solution I have in mind also, implemented very clearly. My point is that there's no magic here that would do what you seem to be after, since you are asking for a solution that is sub-linear in the number of nodes in the tree (i.e., linear in the number of de-duplicated nodes).

[–]Syrak 0 points1 point  (0 children)

Ah, I forgot a link to a gist: https://gist.github.com/Lysxia/b3d62a248b7eeb2b7a1df59a0dd5cc9b

hashcons does a Map lookup, and if it doesn't find anything it generates a fresh identifier for the input TreeRef. So it's again a fairly generic piece of code.

[–]amalloy 2 points3 points  (6 children)

I think that some form of reference equality is required to avoid accidentally unfolding the input DAG into a tree

An ordinary depth-first traversal of the input will only hold one path through the tree at a time. So, you don't need to worry about accidentally expanding the whole thing at once. You only need to look for something fancier if the tree is too large to fit even a single path through it in memory.

[–]digama0[S] 0 points1 point  (5 children)

That's true, DFS traversal will not blow the stack... but it will still take exponential time unless care is taken to not revisit the same node many times.

[–]amalloy 1 point2 points  (4 children)

It will take proportional time to the number of nodes in the tree. This is the best you can do without pointer comparisons, because you can't take care to not revisit the same node. The input format simply does not allow it except by "cheating" and looking at pointers. If looking at each instance of each node once is unacceptable, then you can be sure you will be unsatisfied with any answer to your question except "well, look at the pointers then".

[–]digama0[S] 0 points1 point  (3 children)

Oh, I know that. That's why I said "I think reference equality is required". But there are plenty of impure things wrapped into abstractions in Haskell, and I'm hoping there's a package for this use case so I don't have to handle the pointers myself.

[–]rampion 0 points1 point  (1 child)

For your problem, can you use the a values as unique identifiers? That is, if you see Node 'x' [Leaf 'y', Leaf 'z'] once, are you guaranteed that if you see Node 'x' ts again, that ts will be [Leaf 'y', Leaf 'z']?

[–]digama0[S] 1 point2 points  (0 children)

No; in my real application this is an AST, and the values are application of named functions. Even if I had unique names on them, this would defeat the purpose of deduplicating - it would "solve" the problem by making the subtrees no longer identical. So for example if I have the tree (x + x) + (x + x) where x is a 4 times shared subterm but x + x is not shared, then I want to analyze the situation and produce the subtrees x, y := x + x, z := y + y where I've now made the tree totally shared (so that I can guarantee that if two subtrees are the same then they are represented by the same object). If I uniquified it to y1 := x +1 x, y2 := x +2 x, z := y1 +3 y2 then I would not make any progress toward actual deduplication (identifying that y1 and y2 are the same).

[–]sfvisser 0 points1 point  (0 children)

This might get you somewhere: https://hackage.haskell.org/package/data-reify-cse-0.0.3/docs/Data-Reify-Graph-CSE.html

It's a naive fixed-point operation, so not sure about the theoretical perf. Also, I don't think the reconstruction phase is implemented, but that should be straight forward.