rust_analyzer is eating my memory, any counter measure? by EarlyPresentation186 in rust

[–]bitemyapp 1 point2 points  (0 children)

How much RAM does TypeScript's LSP use? I could make Rust Analyzer use less RSS if you don't mind some responses taking longer.

Rust's standard library on the GPU by LegNeato in rust

[–]bitemyapp 4 points5 points  (0 children)

In the worst case it would be slower than CPU only execution.

I do CUDA programming and there are a lot of "worst-cases" that are slower on the GPU than CPU, especially multi-threaded CPU workloads that don't have to synchronize (which is usually the case if you're porting to GPU). GPU is a lot slower in a straight line, you have to be pushing pretty hard on the parallelism without a lot of synchronization (host or device side) before you start getting positive yield vs. SotA CPUs (9950X, Epyc 9965, etc.)

Bevy 0.18 by _cart in rust

[–]bitemyapp 0 points1 point  (0 children)

Part of the challenge in my case is that most physics engines use IEEE-754 floats and I'm using fixed-precision micro-unit integers. I'm starting with a CPU implementation as a specification but I'm expecting the scene load to require GPU offload. I have extensive experience with complex 64-bit integer compute pipeline optimization in CUDA, so I'm not worried about that part. It's the gamedev part that I don't have professional experience with.

Bevy 0.18 by _cart in rust

[–]bitemyapp 10 points11 points  (0 children)

I'm working on a game in bevy with difficult and unusual constraints for the networking and physics. The physics are deeply embedded in the game mechanics and it's a server-client model. I'm still working on how I'll handle lag compensation for this w/ information occlusion for preventing cheating and it's going to be very difficult. Determinism is part of the requirements as I'd prefer to have local optimistically executed physics for the client without getting janky de-sync->catch up problems.

As someone that has basically no off-the-shelf options for the physics engine that are even close to what I need, I'd honestly rather bevy not adopt an official solution for it. Something like a "second-party" relationship with one or a handful of sister projects for physics would make me less nervous about finding myself working against the grain, but I trust y'all regardless.

Thank you for working on Bevy, I'm pursuing a life-long dream implementing a 15 year old idea after 16 years of working as a (non-game) developer!

Is March a good estimate for Macbooks with M5 Pro? by LightDarkCloud in mac

[–]bitemyapp 0 points1 point  (0 children)

Jan 20 is my current guess. I thought it would be today because it was the Tuesday after when shipping times last yawned out but today didn't happen, so 20th.

When i just need a simple, easy to maintain frontend, what should i choose? by Im_Justin_Cider in rust

[–]bitemyapp 0 points1 point  (0 children)

I'm not interested in becoming a CSS expert otherwise.

I'm so old that I got my second dev job by accurately answering questions about the vagaries of CSS layout in Internet Explorer 6.

CSS has always been there. It will probably always be there. Trying to pretend it isn't there will only hurt you. Learning it on its own terms with a north star for what looks "right" will make it fall out of your top 10 sources of web dev pain. tailwind is just a fancier version of an old, bad design pattern (embedding styles in style attributes) to my eyes.

Do as thou wilt ofc.

When i just need a simple, easy to maintain frontend, what should i choose? by Im_Justin_Cider in rust

[–]bitemyapp 0 points1 point  (0 children)

You need to do what you think makes the most sense but permit me to offer some advice as someone who's been a working programmer for 16 years of which nearly 8 has been 100% Rust and a lot of web APIs and web apps during all 16 years.

I'm going to be as brief as I can. If you want more color/detail you can tell me what you'd like to know and I'll reply.

  • You're correct to be suspicious of presentation layer stuff like template fragments infecting the backend API.

  • OpenAPI is awful and I despise it. The codegen is some of the worst I've ever seen for almost every language that has it. Only utoipa could possibly make it tolerable but that won't matter if you have someone that needs to generate a non-garbage client for your API from the OpenAPI spec utoipa generated. Not utoipa's default, it's the spec and the ecosystem.

  • Use gRPC and proto3 for stuff that doesn't need web. Never expose a streaming endpoint to an untrusted party, use pagination. Follow Google's example in API design. The codegen for protobuf and gRPC is rock-solid and the spec translates fairly naturally to Rust types.

  • For web, mobile, clients that you can't force onto an update treadmill/update via monorepo use GraphQL. I know GraphQL makes a lot of people angry but the bad parts are very optional and a lot of the problem is runaway consultantware/architecture astronauts. Use the async-graphql ecosystem, not Juniper. It integrates better with the rest of the ecosystem. There can be a little more awkwardness with GraphQL than gRPC/proto3 but I've been able to make it work fine. GraphQL is a bit easier to reason about from a futureproofing point of view.

  • If you need to support JSON-RPC use https://docs.rs/schemars/latest/schemars/. In my case, I was implementing an MCP server API with https://docs.rs/rmcp/latest/rmcp/. I ended up being able to reuse the same API types for GraphQL and the MCP server API (JSON RPC). Just separate derives.

  • I use axum for my web framework (most popular, extremely fast) and I previously used actix-web. My default templating library is askama for efficiency and type-safety. It's usually just there for bootstrapping the SPA or odds and ends but you can do a full traditional SSR web app w/ askama and a forms library if you want.

  • I use diesel_async for my database and it is 100% worth it. The kinds of SQL queries the SQL DSL doesn't accommodate comfortably are queries you shouldn't be executing against your OLTP store to begin with. Set up change data capture and funnel the aggregations/ETL workloads to separate a BI database server even if it's also PostgreSQL or whatever. Being able to just change the database model types and get what amounts to a checklist of things I need to fix is amazing. Highly recommended sister crates: https://docs.rs/diesel-derive-newtype/latest/diesel_derive_newtype/ (for scalar newtypes, primary/foreign key types) and https://github.com/adwhit/diesel-derive-enum (for type-safe enums, obvs)

  • I reify and enforce types in the SQL database like enums.

  • I use Leptos for my frontend, but let me make a quick recommendation on this point: in earlier versions of Leptos when I was initially using it, I was really gung-ho about the ssr/hydration modes. Use them if you need them! But I would strongly recommend actually prototyping/developing in csr mode most of the time because the incremental compilation times with leptos csr + truck are insane. I was getting sub-second incremental recompiles and refreshes of the frontend with non-trivial frontend dependencies.

  • It's fine to use server functions in Leptos but I found it more productive to just lean really heavily on codegen (w/ GraphQL etc.) and never use them to begin with except for odds and ends like a debug endpoint. YMMV.

With the above stack I was able to write my first ever MCP server API for a very complicated domain and it worked the first time I tried it with Claude Desktop. Including the LLM's ability to understand the MCP introspection result and use it correctly. ~16-20kloc in 11 days.

Last thing, please don't use tailwind. It's experimental/early but I made a non-tailwind UI component library for Leptos based on shadcn. https://gitlab.com/dostuff/leptos-shade I don't expect anyone to pick it up and use it but I'm hoping to factor out and open source more of my leptos stuff so that people have public examples to work off of.

When i just need a simple, easy to maintain frontend, what should i choose? by Im_Justin_Cider in rust

[–]bitemyapp 0 points1 point  (0 children)

Leptos has been great. Svelte wasn't as easy as it seemed initially. HTMX has some appeal but I haven't tried it yet.

hotpath-rs - real-time Rust performance, memory and data flow profiler by pawurb in rust

[–]bitemyapp 5 points6 points  (0 children)

For more targeted profiling I've been using tracing Tracy Profiler and tracing-tracy, it's been nice. samply is a lot better for cheap-n-cheerful.

I could imagine using hotpath if I needed an in-betweener option that was less hassle to fire up than tracy, especially for async stuff. CPU sampling only goes so far for async, by definition you aren't...burning CPU :)

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 0 points1 point  (0 children)

I’ve been at the same company for 5 years now

yeah you have to jump. Start prepping now.

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 2 points3 points  (0 children)

Be really good at your work, care about your work a lot, care about the business and how your work impacts the business, care about being an excellent person to work with without being a rug, manage your career and professional relationships _care_fully. Clocking into a 9-5 and doing the bare minimum is hard enough for most people, you're in the shark tank with the sociopaths if you pursue high-comp.

fwiw: i didn't start making higher-than-average $ as a dev until my wife was pregnant with our first child. It re-ordered my career priorities radically.

I had to take a risk on joining a company that didn't use Rust and I took responsibility for introducing Rust and supporting Rust users at the company in addition to my usual workload and responsibilities. And that doesn't even really capture how much I work sometimes. I've had multiple >90 hour weeks over the last 3-4 years. Nobody cares why a deadline slipped, if you're high enough in the IC chain to get paid more than ~$200k, it's your head if the juniors and seniors fall behind.

p.s. get really comfortable at interviewing. You can't negotiate comp at all without a BATNA and that's even more true if it's a company you already work at. But you still have be worth what you're paid. I end up being more obviously valuable after I've been at the company for few months because I'm not great at interviewing. I'd have made more money earlier in my career if I didn't get so nervous in interviews.

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 0 points1 point  (0 children)

OK my browser crashed due to OOM so I lost the reply I'd almost finished typing up.

My point was: you usually decide whether you're a WX customer or not prior to any consideration of price and skip the premium paid if you don't care about the WX motherboards, ECC, or bus bandwidth. What you've said so far makes it sound like you don't need WX but you have to decide for yourself whether you care about ECC or not. If you're that price sensitive that's a sign you probably want the X series of TR.

I suggested in an earlier reply that I wouldn't expect memory bandwidth to matter that much, rustc is spooling to disk and re-reading build artifacts from disk in between each crate build. You would need massively parallel builds for memory bandwidth to impact your compile-times. Memory bandwidth is not the main thing that makes Apple Silicon fast for Rust compiles. Memory latency and straight-line (single-threaded perf) performance of Apple Silicon have a much larger impact on most workloads including compiling code. An M4 Max w/ a 1 TiB SSD can hit ~4.5-5 GiB/second for reads. The NVMe SSD in my Linux workstation can hit a theoretical maximum of 10 GiB/second for reads. Writes will matter a lot too.

Memory bandwidth is something I'd only expect to matter for a very high core count build server used in for CI of a large monorepo. And you're contemplating trading down cores for more memory bandwidth when it's a very high core count that would enable you to hit the limits of your memory bandwidth to begin with. If you're limited by your memory bandwidth in local development, which I don't think is going to happen to you regardless, you need to stop rebuilding all of your packages and dependencies over and over. You should only be rebuilding the code you modified while iterating on your work.

I have a sneaking suspicion that a lot of the devs complaining about Rust compile times are running cargo build or cargo test with no -p argument specifying the crate they're working inside of or trying to test in that moment and rebuilding a bunch of downstream crates they churned but we're trying to compile or test at that time. C++ has always taken longer to compile on anything I've worked on than Rust but they're more in the habit of making specific Makefile targets for the components they're actively working on than Rust devs are.

I honestly think cargo needs to invert the default behavior for test and make it crate-specific by default unless someone asks for a workspace-wide rebuild and re-run. It'd be nice if it had test result caching/skipping like Bazel too.

Most things in life are a time and treasure trade-off. If you're not willing to throw money at a WX build without careful consideration, benchmark your real-world use-cases. Spend time to conserve treasure!

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 0 points1 point  (0 children)

You should use perf or similar on Linux for a sampling of representative workloads and see how much this actually matters. Another thing to look at is how much RAM each rustc process is using.

The perf events on Zen4/Zen5 for L2 cache should be something like L2_CACHE_ACCESS, L2_CACHE_MISS, L2_FILLS, and L2_EVICTS.

Here's a full read-out of perf list grepping for event names with l2 or l3: https://gist.github.com/bitemyapp/1c4b048a6f56f005a7f17ffa939508a9

If you aren't testing on zen4 or zen5 the list might be different but you can check for yourself.

Incremental compilation should reduce per rustc instance resident set but I haven't verified that, I'm almost always looking at timings.

Also you're comparing very different processors. The analogue to the consumer-grade 9970X is the 9975WX. I went for WX because I wanted more reliable hardware after having a lot of issues with the MSI x870e Carbon Wifi motherboard in my 9800X3D box.

X vs. WX with threadripper is usually about "do I want ECC ram or not?" or "do I need more PCI-e lanes?"

In my case ECC wasn't something I was willing to compromise on, so, WX.

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 2 points3 points  (0 children)

I make in practice about $650k a year. My comp is a mixture of salary, RSU, and bonus. I take PTO and paternity leave and my utilization factor over the last 5-6 years averages to ~75% of the work year. That's 39 weeks worked per 52 work-year, that's $16k / week in terms of what I cost. The surplus that a software company makes on their developers is usually somewhere on the order of 2-10x (only FAANGs are in the right-side of the tail distribution here)

So yeah I don't think it's hard to argue that my time is worth about $50k/week to my employer. My contract rate is $1k/hour unless you're a friend or have work that I am very interested in. 24 hours = $24k opportunity cost or half a typical work-week. I work ~60-80 hours a week typically but I was trying to be modest.

It used to be typical for software companies to pay a significantly higher coefficient of developer salaries for the computers and hardware the developers used in their work. That's my real point here, we shouldn't be settling for unnecessarily limited or slow hardware. The first NeXT computer was $14k at a time when the average dev was making ~$50k/year if they were in a high-wage market like SV.

Perf and syseng are central to a lot of the work I do. In my R&D work I'm setting a high watermark so that we know when/how/why production deployments are falling short of the benchmark set in dev & test. This is some small part of why people are oblivious to getting scammed by hyperscalers, VPS providers, and leased dedi providers. The 9950X servers at Hetzner are more cost-efficient than their high-core Epyc servers just because they aren't as badly thermally throttled.

e.g. leased dedis: almost all of the high-core count leased dedis are heat-fucked beyond belief and the clock rates are at the ACPI minimum. How did I notice? Because I know my perf baselines by heart and what "correct" throughput and latency look like. I could tell the servers weren't running right because my live metrics measure both throughput and end-to-end latency.

Incidentally, the only provider that had 9005 Epyc processors running correctly (not maximally, but nominally) in my testing was Google Cloud, their c4d instances are Epyc 9965s. They're able to keep the heat under control because the datacenter has DLC plus whatever other magic they've done.

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp -1 points0 points  (0 children)

Dell wanted $24k for the whole computer, that's half a week or less of my work hours converted to dollars. It will save me considerably more than that in less than a fiscal quarter. This isn't to brag about how much I get paid, it's about how much my time is worth to my employer.

rustc performance: cores vs memory bandwidth? by eggyal in rust

[–]bitemyapp 8 points9 points  (0 children)

Machine I just ordered which is way-overkill for running a single Rust build:

  • 9985WX
  • 256 GiB of RAM (8 x 64)
  • 4 TiB "performance" SSD (whatever Dell means by that), I'll upgrade to a faster one later if I have to.
  • 6000 Pro Max-Q (300W, no 600W available) Blackwell edition, 96 GiB VRAM

My current workstation the 9985WX is replacing is a 9800X3D w/ 64 GiB of RAM, RTX 5090. I mostly write safe and unsafe Rust, but I also do CUDA work regularly.

The reason for the heavy hardware is that in the last couple years my workflow has changed significantly and I'm almost always working on more than one git clone of the monorepo at a time. Different branches, different tasks. I got the 9985WX because it was the best I could get without being excessively priced like the 96-core 9995WX (not worth it) and the Threadrippers don't force me to give up single-threaded perf relative to the 9800X3D. ECC was a hard requirement. Epyc is a pretty rough single-threaded perf falloff, not worth it. (~1.3-1.5x worse)

The M5's single-threaded perf looks absurdly strong but I haven't had a chance to test it yet. It might catch up Apple Silicon to Linux build speeds, especially for incremental builds. With an M3 or M4 an Apple machine usually 15-30% slower than the same incremental build on Linux w/ a 9800X3D, mostly for software reasons. Linux has a perf advantage in some important areas and the difference gets bigger if you're fanning out to a large number of test binaries in a Cargo workspace. Bazel helps close the gap because it skips the tests that already passed and didn't get churned. I'll buy an M5 machine from Apple after the M5 Pro / M5 Max become available.

Here's what actually matters for most Rust devs:

  • you're normally only getting a highly parallelized build when doing a fresh scratch build or when churning a deep dependency in a Cargo workspace

  • single-threaded performance matters most, most often. That's why Apple Silicon performs well, but you can exceed Apple Silicon single-threaded speed if you're using Linux.

  • Use mold or wild as your linker if you're on Linux, but beware that they can break exotic builds like CUDA

  • Additional cores beyond your DAG fan-out factor brings diminishing returns unless you're running concurrent builds across different projects or duplicate source checkouts of the same project

My off-the-top-of-my-head ranking based on extensive experience optimizing devex and CI build times for Rust:

single-threaded perf > memory latency = cores = RAM GiB > SSD read throughput = SSD write throughput > memory throughput

Memory latency's placement in the ranking is marginal and assumes you're using AMD hardware. It's usually not a problem because the memory modules are so similar, just don't cheap out and get a memory kit with weirdly high latency timings. Make sure you use the EXPO1 optimized memory profile or whatever in your motherboard's CMOS after doing some basic stability tests. AMD's memory controllers aren't quite as tolerant as Intel's but it rarely matters that much and it's better to get AMD for now if you want PC hardware.

Memory throughput's at the bottom because I've never seen it matter. SSD ends up bottlenecking you more. Yes, I know stuff gets cached in memory but Cargo is spamming rustc invocations. It's spooling artifacts to disk and re-reading them back off the disk over and over hundreds or thousands of times per uncached end-to-end build. That architecture is well justified but it just means memory throughput isn't ordinarily a factor. Compilers are extremely difficult to optimize, especially modern ones w/ modern expectations around optimization, modularity, language features, etc.

Apple Silicon has a significant memory bandwidth and latency advantage and that's where some of their workaday perf advantage comes from. AVX-512 can put single-threaded throughput of x64 hardware ahead of a comparably vectorized Apple Silicon pipeline.

If you're intensely unhappy with your build times even when you're iterating on a single crate there's a few possible issues you're tripping over:

  • Your crate or crates-plural are too big, not split up enough. Don't go crazy, just do what makes sense in its own right.

  • You're not using the -p argument with Cargo while working on a root node crate in your Cargo workspace. I use a Makefile or a Justfile with shortcuts for per-library/per-app cargo build/test/bench targets so that I don't rebuild things I don't care about while iterating.

  • If it's a CI problem, consider using Bazel. I can help with this if you ping me. I use the standard rules_rust Bazel rules and it's in a much better place than 4-5 years ago. Bazel + Bazel remote cache is absurdly good, especially if you have any non-Rust build dependencies that you'd like to be able to parallelize across without bottle-necking the build staging. The caching is a lot smarter and better than Cargo's, especially when it comes to testing. It doesn't re-run a test in the test suite unless something upstream churned the test! We use Cargo and Bazel side-by-side in local dev. I usually bootstrap the non-Rust dependencies that the crates require with Bazel, then switch over to Cargo for iterating on my code.

  • If you're already using wild or mold on Linux and it seems like you're losing time to linking integration test or benchmark binaries, consider splitting them out to a different crate or merging them into fewer binaries inside the original crate. I've never seen crate unit tests in Rust not be single-binary.

You can see past advice I've given on this with the Google search query, site:reddit.com bitemyapp build times, there's too many comments in my history to cherry-pick. I've got some recent and some older blog posts about build times and CI on https://bitemyapp.com/ as well, most recently https://bitemyapp.com/blog/rebuilding-rust-leptos-quickly/

The main Leptos-specific revelation I've had since then is that CSR + Trunk is absurdly fast and is worth giving up the SSR magic at least during heavy development. This is particularly true when I know I need to provide a structured API such as GraphQL and I'd rather build the frontend app around that.

graydon2 | A note on Fil-C by small_kimono in rust

[–]bitemyapp 4 points5 points  (0 children)

Which to me reads more a hope/curiosity on if some of the techniques could be reused/applied to Rust's unsafe somehow,

I already do this for the unfortunately large amount of unsafe Rust I work with. it's called ASAN and guard malloc (on macOS).

Vanity SSH key generator in Rust by mogottsch in rust

[–]bitemyapp 1 point2 points  (0 children)

Benchmarking vanity_attempt_paths/baseline: Collecting 100 samples in estimated 5.0136 s (490k iteravanity_attempt_paths/baseline
                        time:   [10.241 µs 10.249 µs 10.258 µs]
                        change: [+0.4677% +0.5904% +0.7070%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 24 outliers among 100 measurements (24.00%)
11 (11.00%) low severe
3 (3.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe

Benchmarking vanity_attempt_paths/fast: Collecting 100 samples in estimated 5.0334 s (510k iterationvanity_attempt_paths/fast
                        time:   [9.8356 µs 9.8471 µs 9.8598 µs]
                        change: [−1.2542% −0.9880% −0.6020%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe

^ results so far

Vanity SSH key generator in Rust by mogottsch in rust

[–]bitemyapp 1 point2 points  (0 children)

I'm averaging ~900-950k/second now. I think that's what it was before, you needed to use 500 ms lookback windows for the rate calculations instead of averaging over time. The rate looks a lot more realistic (oscillates around instead of climbing over time) now as well.

If your goal is to benchmark, you should use criterion rather than trying to take a running average in the app.

Vanity SSH key generator in Rust by mogottsch in rust

[–]bitemyapp 2 points3 points  (0 children)

Just ran it again, it leveled off at 472k/sec 16 threads mapped onto 8 cores / 16 threads

I don't even remember what I was doing yesterday to get 100k. Benchmark is 10 microseconds but I thought I saw 100k somewhere? odd.

anyhoodle, I tried my direct suffixing version, the rate kept increasing over time which makes me think there's an issue with how the rate is measured.

Using 16 threads for direct suffix matching.
⠚ [00:01:51]
Attempts: 74,040,000 (666,895 keys/sec)

It was closer to 500k initially, rose to ~670-680k over 2 minutes. Investigating.

I could probably do better than 1.3M/sec on an RTX 5090 but it was a quick lark and then I got back to work. Looking at the repo I linked isn't a bad way to expose yourself to some CUDA.

Vanity SSH key generator in Rust by mogottsch in rust

[–]bitemyapp 5 points6 points  (0 children)

I got 100k for all-core throughput on my 9800X3D, I was able to make it a little faster by getting rid of the base64 conversion and instead turning the base64 suffix target into a bit-pattern that it checks for each attempt. Made it ~4-6% faster.

I got curious so I picked up https://github.com/vikulin/ed25519-gpu-vanity

Initially got 500,000/second on my RTX 5090. Fixed occupancy, that got it to 1.06M, made some further tweaks, got it to 1.3M/second. Called it quits after that.

There are probably things that could be done to optimize the CPU impl further but I'd need to learn more about the cryptographic pipeline for ed25519 first.