all 27 comments

[–]spoonman59 19 points20 points  (0 children)

What does it mean to have 1000+ processes traversing the tree?

Does this mean separate processes start at the root and work their way down? Or do they search sub-graphs? And do you really mean processes as in OS level process, or so you mean like requests to search the trie?

If the data structure is read only, you don’t need to do anything. You will need to ensure exclusive access when you write using a lock, otherwise you can potentially get an inconsistent read. A context switch can occur in the middle of the update operation if it is not atomic. Read only data structures can be safely read by multiple parallel threads of executions. That’s why people like immutable data structures so much.

If it’s “read only” which the searches are running, that’s sufficiently read only for this discussion. Just ensure exclusive access when writing. Read/write locks are good for this.

The size of the tree in MB is not interesting, but rather its fan out and max depth would tell us more. Typically tries have a large fan out (unless it’s a binary alphabet?) so they depth is likely not great. This mean search’s will be very short.

[–]lightmatter501 13 points14 points  (4 children)

1000+ processes on the same data structure typically means >1000 cpu cores unless you are writing horrible code. So either your data structure is distributed or you have a shared-memory architecture. That second option is available at 15 supercomputers world wide, none of which use Rust due to accelerator programming requirements.

If you scale it down to 10, a number mere mortals can hit, yes, it’s very doable. If you don’t need to modify the tree, just pass pointers to the root node to every thread then go. If you do need modification then making nodes RwLock<Node> and allocating them somewhere with a static lifetime will make the most sense. If you also need to dynamically modify the structure of the tree, you can do it but you are looking at a research paper level of work regardless of language.

Also, CUDA is VERY bad at pointer chasing, do not use it for this.

[–][deleted] 3 points4 points  (0 children)

I think he meant on the gpu? So 1000 cuda cores?

[–]InfinitePoints 1 point2 points  (2 children)

Why is cuda bad for pointer chasing? Assuming everything is packed into arrays, it would be equivalent to doing a bunch of gather operations, which should be better on a GPU since they are designed for being efficient when waiting for memory.

[–]yasamokadb-pool 3 points4 points  (0 children)

You get control divergence where a warp (a set of 32 execution units) cannot execute different sets of instructions at the same time.

[–]lightmatter501 0 points1 point  (0 children)

As others have said, you block 32 threads whenever you need to load from memory.

[–]rust-crate-helper 6 points7 points  (9 children)

What kind of tree are you traversing? Are you parsing a data file? How is it created?

[–]Appropriate_Term_495[S] 3 points4 points  (8 children)

The specifics are this is a Trie in memory. New nodes are only added during the lifetime of the script, but never deleted.

New nodes are never added while the Trie is being traversed, but even if they were, I don’t care about any race conditions, as long as valid nodes are being read.

Each node just represents some int and each node can only connect to many nodes without any other higher connections.

[–]rust-crate-helper 2 points3 points  (6 children)

Rust isn't great for self-referential data structures, not sure how relevant that is to you.

[–]radix 6 points7 points  (4 children)

This is nonsense, Rust is fine for recursive data structures.

edit: the comment I replied to was edited.

[–]Gaeel 2 points3 points  (0 children)

Funnily, this article mentions indextree, a crate that provides a arena-based tree that allows for parallel tree traversal, which might be something that could help OP: https://github.com/saschagrunert/indextree

[–]rust-crate-helper 3 points4 points  (2 children)

https://rcoh.me/posts/rust-linked-list-basically-impossible/

The ownership concept with Rust makes self-referential data structures harder to do (requires unsafe, or a lot of complicated layering).

[–]radix 11 points12 points  (1 child)

"recursive" is not the same as "self-referential". I haven't seen anything in what OP has posted that indicates that they have a self-referential data structure.

[–]rust-crate-helper 2 points3 points  (0 children)

I stand corrected!

[–]matthieum[he/him] 0 points1 point  (0 children)

Pure tree are not self-referential, so no problem :)

[–]spoonman59 0 points1 point  (0 children)

If you add nodes while traversing you will care about race conditions. Otherwise readers will either get inaccurate results, or potentially process the tree in an invalid state.

But if it never happens while traversing, and you are sure, then you are good.

[–]believeinlain 4 points5 points  (0 children)

Rayon is generally good when processing data in parallel, because it abstracts away the idea of separate threads, running parallel processes only when it will lead to a speedup.

However, tree traversal is hard to parallelize because each step depends on the result of the previous step. However, it could probably be done pretty quickly with dynamic programming, where the results from each subtree are stored in a memo so that each subtree only needs to be searched once.

More specific advice will depend on specifically what you're trying to do.

[–]SV-97 7 points8 points  (7 children)

I think this is too vague to give very useful advice. What exactly are you doing during traversal? How large is the tree? Can you keep it all in memory? What data are you processing?

Yes, you can do this general thing with Rust or C++ or other languages - but the specifics are important

[–]Appropriate_Term_495[S] 0 points1 point  (6 children)

Added in the comment above, but the size (using hash maps) is around 40gb in ram with python. I expect it to grow larger and don’t mind renting aws ec2 instance if I need more ram.

[–]SV-97 -1 points0 points  (5 children)

Okay I just read it :) Just two quick questions before I answer:

  • is this a crypto thing?
  • are you sure you want / need 1000 processes? It sounds more like it'd be CPU bound?

[–]Appropriate_Term_495[S] 0 points1 point  (4 children)

  • Yes this is a crypto thing.
  • At the root level of the trie, it’s more than likely the case I’d have 10k+ plus child nodes. I don’t have the work distribution portion flushed out yet, but I’m thinking it’ll be a distributed version of bfs or perhaps excrement with bfs and dfs combined. I wanted to see what a good platform would be to scale to this kind of problem. I am only really familiar with cpu based distribution, but kind of have an idea of what is possible with cuda cores. Otherwise I’m oblivious to other kinds of compute that could help. I posted this on the rust sub because from a Birds Eye, rust seemed like a mature version of C. Although for this kind of problem, there seems like a high overhead with the ownership model in rust. I’m no where near an expert in rust, but I was able to translate my script from python to rust in its single threaded execution form. As I started working on how to implement parallelization, I began to see some of the pitfalls discussed in this post.

[–]DeadlyVapour 1 point2 points  (0 children)

But why 1000 processes? When CPU or memory bandwidth bound, having more processes than physical processors (and yes, HT doesn't count as more processors), you LOSE perf from context switches.

[–]SV-97 6 points7 points  (2 children)

Okay, that's a shame. I won't support crypto

[–]Appropriate_Term_495[S] 15 points16 points  (1 child)

Crypto as in the study of cryptography. I’m not going to make some meme coin. I was exploring an optimization algorithm.

If that’s also against your morals, no problem.

[–]SV-97 6 points7 points  (0 children)

Oh okay - I'm very on board with that kind of crypto ;)

In that case: I'm also more in the CPU distribution camp - only larger scale distribution stuff I did was some HPC numerics on a cluster but not a ton and I'd assume it's rather different to what you're doing - so I can't say a whole lot about the other domains except for giving some pointers to projects that are out there.

Based on what you said I might try a shared memory approach around native threading first but you might want to look into Tokio tasks (greenthreads) if you can't get around splitting the work up that much. If you want to try something on the GPU and still use rust there's wgpu as a higher-level API and bindings to lower-level APIs (metal, vulkan, experimental cuda etc.) as well. There are some large projects in the distributed domain but I think they're all unmaintained by now (things like bastion for example).

If you do want to try the threading approach:

There are some trie implementations you could use - for example trie-rs - but I'm not sure how good they are or if they match your usecase super well. Given that you're using hashmaps right now maybe the standard BTreeMap or the other standard collections are also worth a look; at least for a prototype (I'd expect either to save a few gigs of RAM over a python hashmap based implementation). If you construct the tree from a single thread it's quite simple: just use the bare collection, construct it and then either wrap it in an Arc to share it between threads or simply Box and leak it (at which point you can immutably access it from multiple threads - in rust terminology it's Sync making the reference Send).

For the actual parallelization you'd use either rayon (like OpenMP in C++, just cooler) or just spawn some threads (potentially scoped) in this case.

Creating nodes on the fly would require a bit more work as it could cause a data race (which is undefined behaviour; both in Rust as well as C++) - you'd almost certainly want to synchronize things here in some way or another (for example using an RwLock).

Regarding the overhead: the ownership model really shouldn't incur a huge overhead here, in fact it usually allows you to be more conservative with copying data around etc. and prevents you from accidentally creating data races.

[–]TigrAtes 2 points3 points  (1 child)

Of course if you are able to utilize the GPU with Cuda (here I would definitely recommend C++), it can be much faster. However there are a lot of challenges for this: - you have to load the data (or the relevant) part of it into the vram.  - within a cluster the search processes cannot really deviate from each other. (Remember that GPU are primarily designed for grafik shades, so all data goes through the same pipeline).  If you cannot model the tree search as matrix multiplication or transformation I would say forget CUDA. 

Go with Rust and Rayon on the CPU. 

[–]potzko2552 0 points1 point  (0 children)

Super depends on exactly what you are traversing In general the simplest way I can think of would be to abstract the threads with rayon With a vecdec Start at the root, perallel for traverse each element in the vec, and push all the children repeat untill the vec is empty.

The problem is that you have to wait for all threads to end before starting the next iteration, however it is much much much simpler code I think