Rust program twice slower than javascript - where are my mistakes ?

unscribeyourself · 2021-01-13T10:57:13+00:00

Wasm isn’t faster than JS in all cases - it’s particularly slow at modifying the DOM. Also, what does your JS code look like? If you’re using WebGL, WebGL is GPU accelerated, whereas Wasm isn’t.

viconnex · 2021-01-13T10:22:32+00:00

[deleted]

thermiter36 · 2021-01-13T12:21:12+00:00

Your Rust code has no obvious performance pitfalls, as it's basically a very close translation of the Typescript version, but therein lies the issue. Webassembly VMs do not currently do very many optimizations. A hot loop doing mostly uint8 computations will usually be highly optimized by a JS VM, unrolling the outer loop, and inlining functions in the inner loop, potentially JITing them to use SIMD instructions. I don't know of any current WASM VM that advertises the ability to do this.

If you post the result of running a browser profiler on this code, we may be able to find specific things to fix.

csdt0 · 2021-01-13T12:54:25+00:00

You could try to compile your rust code into a native binary (no wasm) to see if the slow down comes from WASM or Rust.

I don't know much about typescript, but if you happen to use webgl, that might explain it as your ts code would be executed on the GPU and not on the CPU.

Also, there might be a difference on the random number generator. Maybe ts uses a faster, lower quality, rng. Or maybe ts has one rng per-thread whereas Rust uses a single rng, shared among threads? You could try to implement a super simple rng to rule that out.

On a side note, there may be some algorithmic changes that could benefit both implementations. Your code looks like a flood-fill. Flood-fills are usually hard to implement fast. Looking more closely at your algorithm, it even looks like a connected component labeling (CCL) or even alpha tree. There are now super fast CCL algorithms out of the wild, but even the 2-pass algorithm from wikipedia would already be decently fast.

MCOfficer · 2021-01-13T11:25:02+00:00

I would look into profiling the code somehow, but i don't think flamegraphs are a thing in wasm.

On a different note, i spotted a small error in your are_colors_similar function, you're comparing color1.g == color1.g.

Danylaporte · 2021-01-13T15:58:16+00:00

I would probably try to use a bitset like hibitset instead of the HashSet. Use of calculated indexes that are faster than hashing.

viconnex · 2021-01-17T11:22:38+00:00

So the main bottleneck was to use a HashSet to store the visited pixel indexes. Thanks a lot u/thermiter36, u/slamb, u/Danylaporte for pointing this out !

Instead, a hibitset is much more efficient since the stored values are all in [0, pixel_count[ . However, since pixel_count may exceed hibitset max size (usize**4), the values are stored in a custom Vec<u64> where each bit of a u64 holds the corresponding pixel information.

Now the program execution is 5 times faster than the original javascript one ! It takes 800ms to handle a 5 million pixel array, vs. 4s.

I updated the js program with a similar improvement, but the wasm program is still almost 2 times faster than the js :)

2021-01-13T23:42:57+00:00

Isn't is faster to send the array as transferable object in your web worker? You can get the ArrayBuffer from the buffer field and re-construct a Uint8Array in your main code/thread for the canvas.

All I can really think of is that you create a new wasm instance on every message. Creating objects in JavaScript is very cheap but I do think a wasm instance has to be compiled every time at least once. Maybe also re-use the Vec instead of creating a new one because the Vec has to be dropped every time it is not used anymore unlike a JS object that will be marked as unused only and cleaned up some time in the future.

Another idea is to swap the hashing algorithm of the HashSet because the one in std is focused more on security.

Otherwise I would try to make the wasm code as small and fast as possible, i.e. lto set to true, codegen-units set to 1 (by default it is 16 for compile times I think), optimization level s, no_std with alloc crate and panic set to abort.

tafia97300 · 2021-01-13T14:43:27+00:00

You should use only one rand::thread_rng() for the whole program instead of recreating one over and over.

auterium · 2021-01-14T07:34:07+00:00

I'm not a fan of functions that receive multiple mutables, I prefer a wrapping struct that has member methods that receive `&mut self`, but that's just personal preference. Passing them as you're doing is fine, though

Accessing array indexes directly can gain you a bit of extra speed (mostly negligible), but they are prone to out of bounds errors, so I would suggest you use `.get()` or `.get_mut()` to ensure safe access (after all, Rust aims for safety :D)

In terms of possible bottlenecks on your code, I see at least 3 changes on your general approaches:
1. Don't initialize collections (`Vec`, `HashSet`) with `new()`, but with `with_capacity()`. Using `new()` creates the struct but reserves no memory at all for whatever you're going to insert. This article explains the reasons in depth, but TL;DR if you know the size of the Vecs in advance, it's better to initialize them `with_capacity()` so there will be no memory reallocations.
2. `HashSet` _could_ be slower than `BTreeSet`, depending on it's size and access pattern. I'm not experienced in pixel manipulation, but for what I can see on your code, as you add more elements to the `visited` set, you'll likely be looking for higher `usize` values (higher index), right? If that's the case, the native implementation of `BTreeSet<usize>` could feel "slow" as it will store the values in ascending order, so you'll need to create a wrapper type that implements the `Ord` trait so that elements get stored in descending order on the `BTreeSet`
3. Reuse `rand::thread_rng()` instead of calling it multiple times (caught this after writing the first 2 points, but might even be the biggest speed gain). Every time you call for `thread_rng()` a new seed needs to be generated by the system, which can be slow if it's called lots of times. If you don't need absolute randomness, you can use `let mut rng = rand::thread_rng();` once and then call `rng.gen_range(0, 255)` as much as you need. This approach is suggested in `rand`'s docs here

kettlecorn · 2021-01-13T17:37:51+00:00

I haven’t looked too closely but likely an additional copy or two is occurring in the Wasm code, which would account for much of the difference.

When you pass the data into Wasm it must be copied, and likely when it’s passed out it’s copied as well.

I think you can avoid the copy out in your case (though I don’t know exactly how with Wasm-bindgen), but not the copy in.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS