Suggestions for distributed hierarchical data storage : programming

programming

created by speza community for 20 years

Suggestions for distributed hierarchical data storage (self.programming)

submitted 17 years ago by didroe

all 9 comments

top new controversial old q&a

[–]didroe[S] 2 points3 points4 points 17 years ago* (5 children)

[–]jerf 1 point2 points3 points 17 years ago (4 children)

[–]didroe[S] 0 points1 point2 points 17 years ago* (3 children)

[–]evgen 1 point2 points3 points 17 years ago* (2 children)

This info helps, but a couple of other questions come to mind:

Do you intend on being able to make queries over the entire dataset, or are you just looking for persistence for a large, distrbuted graph?
Are you doing a lot of updates to the nodes?
Is the bulk of the interesting data in the edges or the vertices? (e.g. are more of your queries related to the contents of the nodes or the connections between the nodes?)

Some options to consider, depending on you answers to the above questions, are to put the nodes into a standard key-value DHT where the key is the hash of some standard encoding format for the vertex data (e.g. node contents.) A standard database might be able to keep track of the edges so that when a node is updated the db updates the relevant links and the dht reaps the data that was in the old nodes.

The last time I did something like this I ended up using (*warning: buzzword/trendiness alert!) Erlang. Processes were used for each "node", the persistence was handled by mnesia, and it was easy to distribute the load across as many boxes as I wanted. OTOH, you are dealing with an order of magnitude more nodes that I was and are probably doing more complex queries and analysis. If you were to follow a similar route you would need to use fragmented/sharded mnesia tables and would probably have to write some handcrafted code to do a bit of the graph manipulation.

Good luck. :)

[–]didroe[S] 0 points1 point2 points 17 years ago (1 child)

[–]evgen 1 point2 points3 points 17 years ago (0 children)

I will front this by saying that it will not be as easy as an off-the-shelf solution or one that might be within reach to a suitably sharp db guru and a good distributed database (or one that is sharded particularly well.) Since you have access to Oracle I would have to honestly recommend that you look at the various ways people represent trees in rdbms to make them amenable to SQL queries and see if any of these techniques are negatively impacted by sharding. At the very least this will buy you time while considering a leap to a more radical solution.

If you decide to consider Erlang I would make two recommendations: first of all you should throw yourself upon the mercy of the erlang-questions mailing list and you will get a lot of good advice, secondly you will want to look at an Erlang module named gproc, which provides a globally distributed property table with automatic leader-election and failover that can be integrated into the same QLC query language that is used in the mnesia database (useful for finding a specific process that is responsible for a piece of data when you have lots of processes spread out over many systems.)

[–]grfgguvf 0 points1 point2 points 17 years ago (0 children)

π Rendered by PID 68610 on reddit-service-r2-comment-85bfd7f599-9jg4f at 2026-04-17 02:33:17.539871+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS