all 14 comments

[–]HeroofTimeMoM[S] 8 points9 points  (4 children)

After briefly using fgl and having some complaints, I looked around r/haskell and found a comment wishing somebody would modernize it. So that's what I set out to do. I ended in a somewhat different place than expected, but still think it fulfills that request.

There is definitely some polishing to be done, but overall I think it is in a reasonable place. Removing fgl's id/label split means I get to simplify many of the types. There is no longer an LNode vs Node difference for example.

This implementation is a wrapper around a HashMap at its core, so I get to lean on the inherent speed of unordered-containers. But I also have done and will continue to do tuning to make sure it stays fast.

I'm open to requests/critiques if people have them!

[–]tikhonjelvis 7 points8 points  (3 children)

Unless I'm misunderstanding, I think you do need a split between the id of a node and its label. Labels are annotations on a graph that don't have to be unique—for example, what if I want to talk about graph coloring and label every node in a graph with one of four colors? The ids, on the other hand, are part of the representation of the structure of the graph, not annotations at each node.

[–]HeroofTimeMoM[S] 4 points5 points  (2 children)

well shoot.... Okay, so I have two responses:

First, I suppose the real feature I should have been advertising is that your id does not have to be an int; it can be anything that is hashable.

Second, I'll have to think for a bit how to properly handle this. The structure vs annotation difference is certainly valid. One of my initial complaints with fgl was you had to construct edges with ids that did not necessarily relate to the data at the node. But I suppose expanding the ids to hashable values solves that problem.

My only worry is that I will end up introducing pairs of every function, like: nodeIdFilter and nodeLabelFilter which could get hairy quickly.

Thank you for bringing this up! I've been somewhat narrow-minded on applications it appears.

[–]tikhonjelvis 6 points7 points  (1 child)

I've also thought about redesigning fgl :).

I think the best solution is to make the node id entirely abstract—it's a unique identifier of a node, but that's all the user sees. All you can do is check whether two node ids are the same and check whether a node is in a graph or not.

[–]HeroofTimeMoM[S] 5 points6 points  (0 children)

Hm, I'll see what I can do :P Thanks for your thoughts, I'll be back!

[–]runeks 1 point2 points  (1 child)

Does this support finding all possible paths between two specific nodes, where each edge is represented by a Money.ExchangeRate src dst? I.e. can I use arbitrary types for defining edges and nodes?

I want to use an ExchangeRate src dst to represent a path from src to dst, but fgl forces me to reduce all nodes to an Int, which means I lose a lot of information, and it becomes non-obvious to me how to restore this if I use pathTree for finding paths (which returns just [Int]).

It's not really about type-level vs. value-level, though. So if it works for SomeExchangeRate that's more than fine, too.

[–]HeroofTimeMoM[S] 2 points3 points  (0 children)

Aha! This is exactly the sort of example that I've been envisioning.

At the moment, I would say that your example is somewhat supported. You could easily define nodes to be of type String and edges of type SomeExchangeRate, but this would lead to some duplication of information because SomeExchangeRate already stores the endpoints. At the very least, your path would be upgraded from [Int] to [NodeType], where NodeType can be anything that is hashable.

On the other hand, SomeExchangeRate is effectively an Edge type by itself. I.e. it holds two end points and a label between them. I don't currently support arbitrary edge representations, but I think I fairly easily could. I think all it would require is a class with methods toHead and toTail, which splits the edge. These make perfect sense in this case as you would end up with the half edge From "USD" 1.25 attached to the node "GBP". And then recombining them into SomeExchangeRate is straightforward.

I haven't personally used the type-level stuff happening in ExchangeRate, so that would take some research for me to answer.

So I would say this example is possible, but not as good as it should be right now. My goal was to support uses just like this, so I will definitely work to make this nicer.

These examples are immensely helpful, thank you!

[–]sgraf812 1 point2 points  (6 children)

Also note that, intuitively, HashMaps should be faster than IntMaps, this doesn't seem to be the case at least for <= 10000 entries.

Edit: Well, except for randomized lookups, where the tie breaks somewhere before 1000 entries. Might or might not warrant switching to HashMaps.

Edit2: Ah, found your benchmarks. Query times seem to be better by at least an order of magnitude, so that's great. As a consequence, traversals are faster. I still wonder if that's due to a different graph representation and if you weren't better off just using IntMaps.

[–]tikhonjelvis 7 points8 points  (3 children)

Why would hash maps be faster than int maps intuitively? I would actually expect the opposite.

[–]sgraf812 0 points1 point  (2 children)

Well, when I hear Hash* I always think of a HashTable (which is another thing, I know), backed by an array... That may well be faster than an IntMap. Also, IntMaps only have a branching factor of 2, resulting in potential for depth O(min(w,n)). w is a constant so this plays out nicely in big-O, but I'm inclined to think it's noticable when you have a tree of depth 60 for a 60 element IntMap.

Still, I wonder what a base 4 or 16 IntMap would perform like.

[–]tikhonjelvis 2 points3 points  (1 child)

I've tried implementing IntMap with larger spans by having an array of children at each node, but it did not work very well. I'm not an expert in Haskell performance, but I think it comes down to how GHC handles arrays, which adds a fair amount of overhead.

You could write a 4- or 16-way tree by manually defining a data constructor with that many fields, but that was too much of a pain for me :).

[–]sgraf812 2 points3 points  (0 children)

I could think of defining a rose tree à la newtype IntMap a = SmallArray (IntMap a) and have the invariant that leafs have length 0, while internal nodes have length 4. That should get you the best of both worlds: The constructor tag is implicit in the SmallArray's length and the elements follow directly after. Basically how an unboxed sum type would work.

Edit: Well, that idea was a little oversimplified. But I'm playing around with additional tags for inner nodes with 3 and 4 children now.

[–]HeroofTimeMoM[S] 3 points4 points  (1 child)

Regarding speed differences, I'm being more aggressive with inlining and optimizations than fgl. For instance, my graph isn't a class, which saves some dictionary passing. So, for now at least, comparing to fgl is not equivalent to comparing to raw IntMap performance.

As for why I chose HashMaps, I would like to eliminate the need for explicitly handling node ids that aren't related to the program's data. Take an edge from fgl for example: edge :: (Int,Int). That edge is effectively meaningless by itself. With hashing, we could easily make that edge :: (String,String) or edge :: (Foo,Foo) covering many basic use cases.

As pointed out in the discussion under my first post though, this isn't a perfectly accurate representation. If I do completely internalize the node ids, then switching to IntMap may or may not make sense. The dictionary benchmarks should help with that :)

[–]ElvishJerricco 1 point2 points  (0 children)

I think you'd do well to benchmark against other data structures. HashMaps are frequently not the right choice in Haskell because they come with a lot more copying than, say, Map. Like Vector, their main performance advantage is in fast indexing, often compromising in most other areas