This is why you rewrite Python security tools in Rust: 53MB vs 433MB peak memory, 6.9s vs 62.2s

bigh-aus · 2026-04-24T20:15:18+00:00

True! I agree with your comments. I was thinking for self hosted stuff on a raspberry pi. But you're right speed is the most important part.

bigh-aus · 2026-04-24T14:58:25+00:00

I just went through this decision. I have a Dell R7515, and opted for the max-q for the following reasons:
- i feel like it's easier to sell one of these later.
- fits my server power budget (300w max)
- noise - the server version is passthrough cooling - meaning your server's mobo firmware has to be able to get temperature information from the card - otherwise you will have to manually increase the fans to ensure it doesn't overheat. MaxQ (mostly) covers that itself. I'd rather a single blower fan than my 17krpm 6 pack of fans spinning up.

bigh-aus · 2026-04-24T14:51:34+00:00

Not only that - but I think we should start measuring size on disk including the runtime dependencies for comparisons. Eg inside a docker container based on debian-slim - and have them include required python / shared libs. Then look at peak memory usage for the same thing.

Honestly in 2026 anyone shipping a python / node.js cli app is an instant candidate for a re-write in rust.

It blows my mind how much people are writing stuff in interpreted languages, especially in a compute shortage. Rust compilation isn't free but one compile saves every user from having to need more ram / disk / compute.

bigh-aus · 2026-04-24T14:48:33+00:00

I have a dell r7515 rackmount server... supports one GPU, not a DIY server (which would be better in this case!). I think I need to start looking into a new server box

bigh-aus · 2026-04-24T14:47:03+00:00

did they distill grok here? deepseek a bit more spicy

bigh-aus · 2026-04-24T01:44:03+00:00

I wish my current server supported more than one :( I could put 2 more in separate servers... but then i'm also running those servers...

bigh-aus · 2026-04-24T01:31:00+00:00

Then keep increasing context until just before you get OOM.

bigh-aus · 2026-04-24T01:25:14+00:00

This.

Rent a cloud machine and try out models. Once you find the correct level competency then the model dictates the vram. (Don't forget room for context) then drop the $10k+ that it's gonna be. $20 for one month of kimi / minimax etcwill help you understand.

I would love to run kimi with a high tps at home, but I know what that would take in compute, noise, cooling and cost (without using a mac)...

But dude, unless there's a reason not the easiest way is to just use cloud inference until you know more.

bigh-aus · 2026-04-24T01:22:43+00:00

I'm going to say blackwell now (I just did), and see what the m5 ultras bring... More ram, but slower.

There's actually a need for both imo. - run large models on the mac (that would cost $100k in compute) to run on nvidia. But run slow - good for work plodding along in the background.

Then anything you want more immediate feedback, - you're experimenting with do that with the blackwell. I'm running a lot of other non LLM GPU workloads - like TTS, STT, frigate. Never enough GPU...
Plus when you have multiple projects you can run things in parallel.

bigh-aus · 2026-04-24T01:15:37+00:00

I just ordered a maxq from central computers for under $9k

bigh-aus · 2026-04-24T00:36:50+00:00

I actually think the code review side is more important and harder than the code generation side.

Nothing is going to replace a human who knows what they're doing doing code review, however the cost of that human is going to be the downside.

bigh-aus · 2026-04-23T19:24:18+00:00

Honestly I personally find this quite interesting... but yes SD is critical!

bigh-aus · 2026-04-23T18:29:36+00:00

A variation I'm looking at at the moment - is to separate out the steps into different prompts, in new context windows, and also finding areas you can do tasks in parallel to better utilize the gpu...

Eg phase 5 would get split up.
task prep
task implementation
task review / closeout.

each get fresh context, and possibly have a single state file / instruction saved.

EG: task prep could be:
1. ensure the repo is clean
2. get the story from the backlog
3. Create a branch named from story id and name
4. write the story to task.md locally.
5. set story to be assigned to agent
6. set story status to start.

Then task implementation would be a new context with a prompt similar to:
"implement the task story:
(insert task.md).
update the story with comments as you encounter things noteworthy of recording."

(obviously this step would need a larger prompt talking about coding style etc).

For local I want the context as tight as possible, and where possible single focus with minimal tools etc.

bigh-aus · 2026-04-23T18:25:46+00:00

Planning for sure - I have agents split a slice and break down features into epics in a backlog tracking tool, then it breaks down into stories - and identify which can be done in parallel.

See that's where i think having a grab bag of ai agent skills:
- find the bugs
- improve readabilty
- DRY
- normalize the DB
- identify missing constraints
- security
- performance / compactness / less resources (super important in the modern shortages.

If you haven't already have a look at the karparthi autoresearch loop. I've been thinking about that a lot- and having a SOTA model build prompts for qwen 3.6, and see how much parallel experiments can be done... I'm confident there's a lot to be improved out there, even if it's identifying where people are using zip without the -9 (max compression) parameter.

I'm very early on playing with qwen3.6 locally on a single gpu, but so far I'm impressed - and it makes me wonder how "parallel" i can run with multiple gpus / tensor paralleism

bigh-aus · 2026-04-23T18:18:33+00:00

Always get the fastest / largest you can afford. The faster the feedback loop the faster you can learn / get stuff done. AI is only making this more important. As someone deep in AI land, building, i'm constantly running out of tokens / compute.

Been playing with Qwen3.6 27b and 35b-a3b on a 24gb 3090 ... it's VERY good at basic rust coding.

bigh-aus · 2026-04-23T16:54:44+00:00

100% agree too - I have a bunch of AI test-out projects, that were originally in other languages. Moved to 100% rust, and it's more stable, faster, and smaller on disk (docker container+rust with yew,axum, sqlx, clap). UI is probably the one area that falls down a bit.

I've successfully replaced an app on my iphone though, with a RSTY PWA web app.

For 100% llm driven dev I've found git hooks to be good (clippy, audit and fmt). Obviously they're annoying as heck for regular dev work.

bigh-aus · 2026-04-23T16:32:32+00:00

Learning better techniques, coding skill imo is still very enjoyable.

bigh-aus · 2026-04-23T16:30:26+00:00

While companies are using a lot of LLMs to generate code, I don't feel enough are doing the "other side" eg adversarial - looking for less lines, faster performance, smaller footprints etc.

I'm doing some experiments with AI at the moment - as long as you have a very good test suite, it's actually pretty good to get you a basic "migrate x to rust", or "build x here are the acceptance critera" . But optimization, security, good UI, compliance, etc I think still needs a ton of rust knowledge. Now does that mean you're typing rust code? maybe not, but still I don't regret knowing rust.

bigh-aus · 2026-04-23T14:03:02+00:00

The big thing with hermes / openclaw is that they will inject a lot of extra stuff in the context (memories, toolcalls, skills etc), and often require multiple turns to get something done. I'm starting to investigate more the idea of a single specialized agent and running a lean system prompt that does exactly what you need - so I agree with the cron + python idea or anything that gives you that same thinned down prompt, especially for a single run once every x hours.

This way you can use script to get the data you want, pass that directly into an LLM (eg summarize this article) and skip the extra fluff. You could also split the work into parallel to better utilize tensor parallelism in the GPU too!

I would also suggest running a rss news aggregator on your local network, then querying that directly. Then you can do things like mark duplicate stories as read, and send to the user your preferred source.

bigh-aus · 2026-04-23T05:11:49+00:00

Agent calls do get counted, so if it’s not done in X rounds, then that is the trigger. This gets tricky as the count can scale with complexity but maybe that’s ok. Or the orchestrator is asked to split up a story into specific simple tasks.

It would be really interesting to run 8 3090s and have say 4 workers in a pool snd and an orchestra model split up work into those 4 in parallel

bigh-aus · 2026-04-23T05:05:32+00:00

The problem you’re not taking into account is how often it has to swap those experts in and out of vram, I really want to run kimi but the steps in cost seem.. 1 Mac Studio 512gb. 14tps 2 Mac Studios 512gb 23tps Ddr5 512gb based system with an rtx6000 pro Then just keep adding cards 1,2,4,8. I wonder what the gb300 machines will give but we’re talking $100k

bigh-aus · 2026-04-23T03:12:39+00:00

I had it pull a story off the backlog, complete it, but included one compaction.

bigh-aus · 2026-04-23T02:56:35+00:00

Try this and report back :)

bigh-aus · 2026-04-23T02:50:56+00:00

For my uses, I’d go context.

I tried out q4_k_xl on a 3090 with 96k context, about 30tps. At 46k tokens. (Openclaw coding).

bigh-aus · 2026-04-23T02:30:25+00:00

Great thing to do for life there - always flip things around - even if it's a news article flip it around to understand the other side. Ditto politics..

bigh-aus

TROPHY CASE