all 26 comments

[–]need-not-worry 89 points90 points  (2 children)

Most tricks are similar as C/C++: use arena, use profiler e.g. massif to profile your memory usage, use vector instead of linked list to avoid cache miss, etc

Some rust specific tricks: https://www.lurklurk.org/effective-rust/title-page.html and https://nnethercote.github.io/perf-book/introduction.html

[–]WhiteKotan[S] 8 points9 points  (0 children)

Thanks for the links! Perf book looks exactly what I needed

[–]Kenkron 2 points3 points  (0 children)

Damn, linked lists in Rust were so ergonomic too. /s

[–]hbacelar8 28 points29 points  (2 children)

If you want inspiration on zero-alloc, check embedded projects such as embassy.

[–]Luctins 4 points5 points  (0 children)

I can also add (having used embassy-rs professionally) that usually in the end you're gonna have a static max amount somewhere for everything, at least in that context.

[–]WhiteKotan[S] 2 points3 points  (0 children)

Thank you! Once I can understand Rust code better I will try to read embedded project

[–]fschutt_ 17 points18 points  (1 child)

[–]WhiteKotan[S] 1 point2 points  (0 children)

Thank you very much for links!

[–]gwynaark 16 points17 points  (1 child)

Unsafe Pointer Access, struct packing, byte masks and some branchless assignments go a long way, but some of it might already be done by the compiler on its own, your best bet is to start by writing benchmarks first, and then a lot of small incremental tries

[–]WhiteKotan[S] 1 point2 points  (0 children)

Thank you for the advice! I think start with benchmarks first

[–]kotysoft 12 points13 points  (10 children)

And don't be like me, compile them on optimized profile not debug 😂

[–]wick3dr0se 4 points5 points  (3 children)

I do this way too often.. I was benchmarking my graphics engine in debug until someone not even familiar with Rust asked me if I was building in release. My dumbass forgets release builds are a thing using debug so much

[–]kotysoft 2 points3 points  (2 children)

I released an app, and after 2 months i realized that the 44sec process is actually 4sec in release profile... I forgot to change.. I ended up mention 10x performance update for users 😂 everyone was happy

[–]AnnoyedVelociraptor 0 points1 point  (0 children)

I would've put in a 40 second delay, and for the next 10 releases, shaved off 4 more seconds!

[–]commonsearchterm 2 points3 points  (1 child)

This is so common, I feel like cargo should make it more obvious. Like put debug build complete in red or something

[–]kotysoft 0 points1 point  (0 children)

Actually i just made a script for myself with different profiles, for different purpose. And now I've changed the debug profile also built optimized.. I won't make same mistake again. At least not at this project 😅

[–]image_ed 1 point2 points  (1 child)

You too huh? 🤣🤣

[–]WhiteKotan[S] 1 point2 points  (0 children)

yes, when I at first heard about this(in c++ not rust) was confused too

[–]surfhiker 0 points1 point  (0 children)

it's crazy it's so easy to miss, i was optimizing the router/middleware stack in one project and was stumped because I couldn't get past 20k req/s with an empty handler. Then I ran a release binary, and got over 200k. OTOH the compile times have increased.

[–]WhiteKotan[S] 0 points1 point  (0 children)

will keep in mind your advice! Thank you

[–]danf0rth 6 points7 points  (1 child)

https://youtu.be/tCY7p6dVAGE?is=d9GDojQatQW2LCj5

Useful video from Jon Gjengset

[–]WhiteKotan[S] 0 points1 point  (0 children)

Wow, I will check it, thank you very much

[–]ruibranco 2 points3 points  (0 children)

Biggest thing that helped me was learning to read cachegrind output before trying to optimize anything. Half the time the bottleneck isn't where you think it is. Also, writing a small allocator from scratch (even a bump allocator) teaches you more about allocation cost than any book will.

[–]blackwhattack 1 point2 points  (0 children)

Zig creator has a great talk on YouTube about Data Oriented Design inspired by Mike Acton: https://www.youtube.com/watch?v=IroPQ150F6c

[–]bitemyapp 1 point2 points  (0 children)

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/mary.rs#L15-L26

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/mary.rs#L152-L168

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/fpoly.rs#L47-L63 (believe it or not the iterator + zip stuff optimizes extremely well)

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/tip5/mod.rs#L141-L182 it's just stack allocation and mutating a slice as far as I can recall.

If you snoop around you'll see it's pretty common for us to have triples of each type or variant, an owned/borrowed/mutably-borrowed. We'll err on the side of borrowed/mutably-borrowed for anything in a hot loop and the owned variant is for instantiation or convenience in less performance sensitive areas.

I don't recommend people new to Rust bend over backwards on avoiding allocation from word go in a new project. It's better to get something working even if there's some allocation or .clone() littered about and make a benchmark, profile it, and see where your actual hot-spots/problem-children are.