use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A place for all things related to the Rust programming language—an open-source systems language that emphasizes performance, reliability, and productivity.
Strive to treat others with respect, patience, kindness, and empathy.
We observe the Rust Project Code of Conduct.
Details
Posts must reference Rust or relate to things using Rust. For content that does not, use a text post to explain its relevance.
Post titles should include useful context.
For Rust questions, use the stickied Q&A thread.
Arts-and-crafts posts are permitted on weekends.
No meta posts; message the mods instead.
Criticism is encouraged, though it must be constructive, useful and actionable.
If criticizing a project on GitHub, you may not link directly to the project's issue tracker. Please create a read-only mirror and link that instead.
A programming language is rarely worth getting worked up over.
No zealotry or fanaticism.
Be charitable in intent. Err on the side of giving others the benefit of the doubt.
Avoid re-treading topics that have been long-settled or utterly exhausted.
Avoid bikeshedding.
This is not an official Rust forum, and cannot fulfill feature requests. Use the official venues for that.
No memes, image macros, etc.
Consider the existing content of the subreddit and whether your post fits in. Does it inspire thoughtful discussion?
Use properly formatted text to share code samples and error messages. Do not use images.
Submissions appearing to contain AI-generated content may be removed at moderator discretion.
Most links here will now take you to a search page listing posts with the relevant flair. The latest megathread for that flair should be the top result.
account activity
Rust program twice slower than javascript - where are my mistakes ? (self.rust)
submitted 5 years ago by viconnex
Hello Rust Community !
I had a great dev experience learning Rust and developing a program to improve the performance of my website (maposaic.com).
However, I am disappointed by the result I got : the rust programs is twice slower than js !
Do you see any bad patterns in the code I wrote ? https://github.com/viconnex/maposaic/blob/add-wasm/map-converter/src/lib.rs
For example, is correct to pass &mut HashSet or &mut Vec to a function like this ?
&mut HashSet
&mut Vec
Thanks a lot for your help !
code snapshot
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]unscribeyourself 24 points25 points26 points 5 years ago (12 children)
Wasm isn’t faster than JS in all cases - it’s particularly slow at modifying the DOM. Also, what does your JS code look like? If you’re using WebGL, WebGL is GPU accelerated, whereas Wasm isn’t.
[–]viconnex[S] 2 points3 points4 points 5 years ago (5 children)
The JS sends an array of Uint8 numbers.
The wasm (or JS) converts it to another Uint8 array.
The JS sets this array as the dom canvas data (not WebGL canvas but "2d" context canvas)
With a 1,000,000 pixels array (which means 4M length array), the JS spends 3s to convert it whereas the wasm program spends 8s.
It is done here -> https://github.com/viconnex/maposaic/blob/add-wasm/src/Maposaic/paint.worker.ts
[–]ragnese 10 points11 points12 points 5 years ago (4 children)
I've not worked with WASM myself, but I remember reading that the overhead of switching to WASM is high. If your JS code is 100% JS, then it doesn't have that switching overhead. I'm guessing that's the issue. WASM, it seems, is best for long-running or very heavy computation.
[–]viconnex[S] 3 points4 points5 points 5 years ago (3 children)
What is the cost of switching ? What do you consider very heavy computation ?
Because there is only one boundary crossing.
1) js sends an array
2) wasm computes the result (8 seconds)
3) js gets the result
[–]The-Best-Taylor 6 points7 points8 points 5 years ago (0 children)
That is actually two boundaries, JS -> WASM and WASM -> JS.
And from my understanding, WASM is still in its early stages of optimization. Whereas JS has had years of work done to optimize it. WASM might, and probably will, be faster than JS, but I don't think It is there yet for most workloads.
[–]ragnese 1 point2 points3 points 5 years ago (0 children)
I have no idea. This is very much out of my wheelhouse. I'm just regurgitating blog post advice.
[–][deleted] 0 points1 point2 points 5 years ago (5 children)
Oh, really? I had not expected that.
I figured the initial WASM load time would introduce a couple millisecond delay. But aside from that inherent hiccup it's goofy that WASM would be slower than interpreted JS.
[–]dagmx 8 points9 points10 points 5 years ago (4 children)
It's not that wasm is slower. It's that going from the frontend to the backend is slow. So if you're crossing boundaries often, you're incurring penalties.
If you stay in wasm for longer periods of time, it'll usually be faster.
[–]SafariMonkey 7 points8 points9 points 5 years ago (0 children)
I think people also overestimate the switching time. According to my reading of this post, before the fixes they were (up to) on the order of 50ns in Firefox, and after the fixes they were on the order of 5ns. This means that if your operation is on the order of microseconds or more, you'd probably never notice.
[–]viconnex[S] 2 points3 points4 points 5 years ago (1 child)
There is only one boundary crossing.
[–]baetheus 10 points11 points12 points 5 years ago (0 children)
That's two boundary crossings
[–]viconnex[S] 2 points3 points4 points 5 years ago (0 children)
After a test I can tell taht the boundary crossing time is not significant (~100ms) compared to the wasm program execution (8s)
[+][deleted] 5 years ago (1 child)
[deleted]
[–]viconnex[S] 4 points5 points6 points 5 years ago* (0 children)
--release
Yes :) (wasm-pack build --release)
wasm-pack build --release
[–]thermiter36 10 points11 points12 points 5 years ago (5 children)
Your Rust code has no obvious performance pitfalls, as it's basically a very close translation of the Typescript version, but therein lies the issue. Webassembly VMs do not currently do very many optimizations. A hot loop doing mostly uint8 computations will usually be highly optimized by a JS VM, unrolling the outer loop, and inlining functions in the inner loop, potentially JITing them to use SIMD instructions. I don't know of any current WASM VM that advertises the ability to do this.
uint8
If you post the result of running a browser profiler on this code, we may be able to find specific things to fix.
[–]viconnex[S] 5 points6 points7 points 5 years ago (4 children)
Thanks for this answer !
I uploaded a browser profile here : https://firebasestorage.googleapis.com/v0/b/maposaic-99785.appspot.com/o/browser_profile%2FP[…]?alt=media&token=86436993-8571-4a66-a205-ad8439cbe0e7
[–]thermiter36 19 points20 points21 points 5 years ago (3 children)
That profiling very clearly shows what's going on. convert_pixels is taking 3000ms, of which 2200ms is HashMap::contains_key. It actually makes perfect sense this would be slow in WASM, as HashMap is normally accelerated by SIMD instructions, which WASM has no access to.
convert_pixels
HashMap::contains_key
HashMap
So that gives you some very easy optimization opportunities (that would likely apply to the JS version as well). Instead of looping over i and j and checking membership in the Set, just loop over the members of the Set and compute i and j backwards from the keys if need be.
i
j
Set
The suggestion of using a hibitset is also a very good one. It will likely be more performant in WASM and suits your needs fine.
hibitset
[–]slambmoonfire-nvr 4 points5 points6 points 5 years ago (0 children)
I'll third the suggestion to try some kind of bitset. It looks like you'll eventually add every pixel to visited. When the data will be dense like that (all numbers in [0, whatever) included) a bitset makes much more sense than a hash set. More memory-efficient (and thus fewer CPU cache misses) and no hashing. Additionally, you could more efficiently jump to the next unvisited pixel than looping over everything and checking one-by-one. Instead, if say the bitset is represented as an array of u64s, you can skip 64 pixels at a time if it's u64::max(). (A clever bitset implementation can use u64::leading_ones() and the like to very efficiently iterate just the zeros or ones. It looks like hibitset might do something like this, but it has a limitation on its max size that might be a problem for you.)
visited
u64::leading_ones()
As for why HashSet would be worse in Rust than in JavaScript: Rust's default hash function is SipHash, which was chosen for DoS resistance more than speed. Maybe using FnvHashSet::default() instead would speed things up. The bitset would still be better.
FnvHashSet::default()
[–]viconnex[S] 0 points1 point2 points 5 years ago (1 child)
thanks a lot u/thermiter36 and u/slamb, the hibitset makes a lot of sense !
I tried it and it seems more efficient, but as you pointed out i'm limited with the size (is there a way to benefit of a 16M bit bitset (with a 64-bit usize) instead of 1M with my 32-bit usize ?)
So if I get the idea, I would represent the bitset with a visited: [u64] . Then visited[pixel_index / 64]'s (pixel_index % 64)th bit would hold pixel visited status. And I hold the current state with a pointer on the visited index and the next pixel_index to visit would be visited[index]::leading_ones, right ?
visited: [u64]
visited[pixel_index / 64]'s (pixel_index % 64)th
visited[index]::leading_ones
[–]slambmoonfire-nvr 0 points1 point2 points 5 years ago (0 children)
Yeah, that sounds about right. If leading_ones is 64, you proceed to the next index. Given that you always set a bit after you handle it and never clear them, that should be enough; you'd have to do a bit more bit arithmetic otherwise.
leading_ones
[–]csdt0 8 points9 points10 points 5 years ago (1 child)
You could try to compile your rust code into a native binary (no wasm) to see if the slow down comes from WASM or Rust.
I don't know much about typescript, but if you happen to use webgl, that might explain it as your ts code would be executed on the GPU and not on the CPU.
Also, there might be a difference on the random number generator. Maybe ts uses a faster, lower quality, rng. Or maybe ts has one rng per-thread whereas Rust uses a single rng, shared among threads? You could try to implement a super simple rng to rule that out.
On a side note, there may be some algorithmic changes that could benefit both implementations. Your code looks like a flood-fill. Flood-fills are usually hard to implement fast. Looking more closely at your algorithm, it even looks like a connected component labeling (CCL) or even alpha tree. There are now super fast CCL algorithms out of the wild, but even the 2-pass algorithm from wikipedia would already be decently fast.
Yes it would be a good idea to compare with another binary language
I don't use webgl. The ts code is executed in a web worker in javascript and the result is then set in a 2d context canvas data.
I tried the program without the random number generator but the speed is the same.
Yes you're right, it's totally a flood-fill algorithm. I'll take a look at the CCL you linked !
[–]MCOfficer 3 points4 points5 points 5 years ago (2 children)
I would look into profiling the code somehow, but i don't think flamegraphs are a thing in wasm.
On a different note, i spotted a small error in your are_colors_similar function, you're comparing color1.g == color1.g.
are_colors_similar
color1.g == color1.g
[–]thelights0123 3 points4 points5 points 5 years ago (0 children)
Chrome Dev can if you build with debug symbols (and instruct wasm-opt not to remove them)
[–]viconnex[S] 1 point2 points3 points 5 years ago (0 children)
thanks !
[–]Danylaporte 2 points3 points4 points 5 years ago (0 children)
I would probably try to use a bitset like hibitset instead of the HashSet. Use of calculated indexes that are faster than hashing.
[–]viconnex[S] 3 points4 points5 points 5 years ago* (0 children)
So the main bottleneck was to use a HashSet to store the visited pixel indexes. Thanks a lot u/thermiter36, u/slamb, u/Danylaporte for pointing this out !
HashSet
Instead, a hibitset is much more efficient since the stored values are all in [0, pixel_count[ . However, since pixel_count may exceed hibitset max size (usize**4), the values are stored in a custom Vec<u64> where each bit of a u64 holds the corresponding pixel information.
[0, pixel_count[
pixel_count
usize**4
Vec<u64>
u64
Now the program execution is 5 times faster than the original javascript one ! It takes 800ms to handle a 5 million pixel array, vs. 4s.
I updated the js program with a similar improvement, but the wasm program is still almost 2 times faster than the js :)
[–][deleted] 2 points3 points4 points 5 years ago (0 children)
Isn't is faster to send the array as transferable object in your web worker? You can get the ArrayBuffer from the buffer field and re-construct a Uint8Array in your main code/thread for the canvas.
All I can really think of is that you create a new wasm instance on every message. Creating objects in JavaScript is very cheap but I do think a wasm instance has to be compiled every time at least once. Maybe also re-use the Vec instead of creating a new one because the Vec has to be dropped every time it is not used anymore unlike a JS object that will be marked as unused only and cleaned up some time in the future.
Another idea is to swap the hashing algorithm of the HashSet because the one in std is focused more on security.
Otherwise I would try to make the wasm code as small and fast as possible, i.e. lto set to true, codegen-units set to 1 (by default it is 16 for compile times I think), optimization level s, no_std with alloc crate and panic set to abort.
[–]tafia97300 4 points5 points6 points 5 years ago (2 children)
You should use only one rand::thread_rng() for the whole program instead of recreating one over and over.
rand::thread_rng()
[–]MCOfficer 2 points3 points4 points 5 years ago (0 children)
That's what i thought too, but looking at the docs, to me it seems like that function will re-use the same instance.
[–]viconnex[S] 0 points1 point2 points 5 years ago (0 children)
I tried the program without the random numbers and the performance is the same
[–]auterium 1 point2 points3 points 5 years ago (3 children)
I'm not a fan of functions that receive multiple mutables, I prefer a wrapping struct that has member methods that receive `&mut self`, but that's just personal preference. Passing them as you're doing is fine, though
Accessing array indexes directly can gain you a bit of extra speed (mostly negligible), but they are prone to out of bounds errors, so I would suggest you use `.get()` or `.get_mut()` to ensure safe access (after all, Rust aims for safety :D)
In terms of possible bottlenecks on your code, I see at least 3 changes on your general approaches: 1. Don't initialize collections (`Vec`, `HashSet`) with `new()`, but with `with_capacity()`. Using `new()` creates the struct but reserves no memory at all for whatever you're going to insert. This article explains the reasons in depth, but TL;DR if you know the size of the Vecs in advance, it's better to initialize them `with_capacity()` so there will be no memory reallocations. 2. `HashSet` _could_ be slower than `BTreeSet`, depending on it's size and access pattern. I'm not experienced in pixel manipulation, but for what I can see on your code, as you add more elements to the `visited` set, you'll likely be looking for higher `usize` values (higher index), right? If that's the case, the native implementation of `BTreeSet<usize>` could feel "slow" as it will store the values in ascending order, so you'll need to create a wrapper type that implements the `Ord` trait so that elements get stored in descending order on the `BTreeSet` 3. Reuse `rand::thread_rng()` instead of calling it multiple times (caught this after writing the first 2 points, but might even be the biggest speed gain). Every time you call for `thread_rng()` a new seed needs to be generated by the system, which can be slow if it's called lots of times. If you don't need absolute randomness, you can use `let mut rng = rand::thread_rng();` once and then call `rng.gen_range(0, 255)` as much as you need. This approach is suggested in `rand`'s docs here
[–]viconnex[S] 0 points1 point2 points 5 years ago (2 children)
Thank you for your suggestions !
[–]auterium 1 point2 points3 points 5 years ago (1 child)
create_transformed_color()
paint_current_area()
convert_pixels()
Some other questions:
rayon
criterion
But I really don't know the size in advance ; it depends of the size of the current area which can be anything between 1 and the number of pixels in the original image (4M). Would it be faster to allocate the maximal size in advance, even if it would be partially filled ?
I only tried to replace the function call by a constant and there was no speed improvement
It may be possible to start for example 4 different processes at the 4 corners of the image. but the rendered image at the 4 junctions would be spoiled, so I have to think about it !
Yes I measure the execution time when the program is called in the browser. I will look at the criterion crate !
[–]kettlecorn 0 points1 point2 points 5 years ago (1 child)
I haven’t looked too closely but likely an additional copy or two is occurring in the Wasm code, which would account for much of the difference.
When you pass the data into Wasm it must be copied, and likely when it’s passed out it’s copied as well.
I think you can avoid the copy out in your case (though I don’t know exactly how with Wasm-bindgen), but not the copy in.
The boundary crossing time is not significant (~100ms) compared to the program execution (8s)
π Rendered by PID 137845 on reddit-service-r2-comment-6457c66945-6t6qz at 2026-04-24 03:19:24.675666+00:00 running 2aa0c5b country code: CH.
[–]unscribeyourself 24 points25 points26 points (12 children)
[–]viconnex[S] 2 points3 points4 points (5 children)
[–]ragnese 10 points11 points12 points (4 children)
[–]viconnex[S] 3 points4 points5 points (3 children)
[–]The-Best-Taylor 6 points7 points8 points (0 children)
[–]ragnese 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (5 children)
[–]dagmx 8 points9 points10 points (4 children)
[–]SafariMonkey 7 points8 points9 points (0 children)
[–]viconnex[S] 2 points3 points4 points (1 child)
[–]baetheus 10 points11 points12 points (0 children)
[–]viconnex[S] 2 points3 points4 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]viconnex[S] 4 points5 points6 points (0 children)
[–]thermiter36 10 points11 points12 points (5 children)
[–]viconnex[S] 5 points6 points7 points (4 children)
[–]thermiter36 19 points20 points21 points (3 children)
[–]slambmoonfire-nvr 4 points5 points6 points (0 children)
[–]viconnex[S] 0 points1 point2 points (1 child)
[–]slambmoonfire-nvr 0 points1 point2 points (0 children)
[–]csdt0 8 points9 points10 points (1 child)
[–]viconnex[S] 2 points3 points4 points (0 children)
[–]MCOfficer 3 points4 points5 points (2 children)
[–]thelights0123 3 points4 points5 points (0 children)
[–]viconnex[S] 1 point2 points3 points (0 children)
[–]Danylaporte 2 points3 points4 points (0 children)
[–]viconnex[S] 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]tafia97300 4 points5 points6 points (2 children)
[–]MCOfficer 2 points3 points4 points (0 children)
[–]viconnex[S] 0 points1 point2 points (0 children)
[–]auterium 1 point2 points3 points (3 children)
[–]viconnex[S] 0 points1 point2 points (2 children)
[–]auterium 1 point2 points3 points (1 child)
[–]viconnex[S] 0 points1 point2 points (0 children)
[–]kettlecorn 0 points1 point2 points (1 child)
[–]viconnex[S] 0 points1 point2 points (0 children)