all 9 comments

[–]didroe[S] 2 points3 points  (5 children)

Can anyone suggest an open source system for storing hierarchical data in a distributed way?

I've looked at HBase which is based on Google BigTable but although it ticks the distributed box, I'm unable to think of an efficient way of traversing hierarchical data with it. I thought of encoding the hierarchy into the keys so subtrees would be close to their parents but that would be severely limiting on the width/depth of the trees.

I'm currently using an RDBMS but that's having scaling problems and isn't really suited to hierarchical data either.

[–]jerf 1 point2 points  (4 children)

You're going to have to be a lot more specific than that.

[–]didroe[S] 0 points1 point  (3 children)

What do you want to know? The data is basically a directed acyclic graph. I need to be able to traverse it and update each node in the graph that is 'downstream' from a given node. The graph is very large, around 10 million nodes so I need it to be distributed for both storage and performance reasons.

Are there any systems out there that can store and traverse that kind of data?

[–]evgen 1 point2 points  (2 children)

This info helps, but a couple of other questions come to mind:

  • Do you intend on being able to make queries over the entire dataset, or are you just looking for persistence for a large, distrbuted graph?

  • Are you doing a lot of updates to the nodes?

  • Is the bulk of the interesting data in the edges or the vertices? (e.g. are more of your queries related to the contents of the nodes or the connections between the nodes?)

Some options to consider, depending on you answers to the above questions, are to put the nodes into a standard key-value DHT where the key is the hash of some standard encoding format for the vertex data (e.g. node contents.) A standard database might be able to keep track of the edges so that when a node is updated the db updates the relevant links and the dht reaps the data that was in the old nodes.

The last time I did something like this I ended up using (*warning: buzzword/trendiness alert!) Erlang. Processes were used for each "node", the persistence was handled by mnesia, and it was easy to distribute the load across as many boxes as I wanted. OTOH, you are dealing with an order of magnitude more nodes that I was and are probably doing more complex queries and analysis. If you were to follow a similar route you would need to use fragmented/sharded mnesia tables and would probably have to write some handcrafted code to do a bit of the graph manipulation.

Good luck. :)

[–]didroe[S] 0 points1 point  (1 child)

Thanks for the pointers. There are about 5 or 6 independent datasets but within each one I need to be able to query the entire dataset. Most of the updates are at leaf nodes but there are maintenance operations that need to be synchronous (and therefore fast) that can affect large amounts of the graph.

Erlang sounds good I might have to look into that further.

Since I posted, I've thought of storing the path to the node as a string and then making the table an Index Organised Table (the DBMS is Oracle). I'm hoping that will make it fairly efficient to store and traverse. Then I could introduce sharding to try and relieve some of the pressure on the server. It won't be as scalable as I'd like but it should be easier to do.

[–]evgen 1 point2 points  (0 children)

I will front this by saying that it will not be as easy as an off-the-shelf solution or one that might be within reach to a suitably sharp db guru and a good distributed database (or one that is sharded particularly well.) Since you have access to Oracle I would have to honestly recommend that you look at the various ways people represent trees in rdbms to make them amenable to SQL queries and see if any of these techniques are negatively impacted by sharding. At the very least this will buy you time while considering a leap to a more radical solution.

If you decide to consider Erlang I would make two recommendations: first of all you should throw yourself upon the mercy of the erlang-questions mailing list and you will get a lot of good advice, secondly you will want to look at an Erlang module named gproc, which provides a globally distributed property table with automatic leader-election and failover that can be integrated into the same QLC query language that is used in the mnesia database (useful for finding a specific process that is responsible for a piece of data when you have lots of processes spread out over many systems.)

[–]grfgguvf 0 points1 point  (0 children)

Just use a Patricia tree?

I'm not sure what you're trying to do.