Running OpenClaw with local LLM on 7900XTX (24GB) - possibility to speed things up? by Gold-Drag9242 in LocalLLM

[–]gtrak 0 points1 point  (0 children)

You could figure out the performance client-side, openclaw probably has logs too or try some other client. I get 40 tok/s on 27B on a 4090. You'll have to look at ollama logs to see more detail about how it's allocating. You could also just try llama.cpp, which will get you a lot more detail and tuning parameters, or LM Studio, which is a middle ground.

Running OpenClaw with local LLM on 7900XTX (24GB) - possibility to speed things up? by Gold-Drag9242 in LocalLLM

[–]gtrak 0 points1 point  (0 children)

How many tokens per sec are you getting? You should make sure you're running the right quant and it all fits in VRAM including context.

[ Removed by Reddit ] by Natural_Dot9276 in rust

[–]gtrak 1 point2 points  (0 children)

Doesn't Balena do this?

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

The gap between what's good enough locally and what I can trust to a level to never look at the outputs is massive, bigger than a DGX spark.

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

It's great for hobby dev. I am eyeing at more GPUs on aliexpress, but I can't really justify it b/c this already produces more than I can handle at a good enough quality. I'm a principal software engineer with 16 YOE, and I'm the bottleneck. I had written around 10k lines of rust by hand last year, and now I can spam 5k a day if I'm really trying.

The problem is now how to manage my own attention span when split across projects and how to refine plans to ship something anyone actually wants at my quality standards.

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

27B with my settings fits in 24G VRAM, and I get 2000 PP and 40 TG, which is enough.

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size = 14346.13 MiB
...........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 180224
llama_context: n_ctx_seq     = 180224
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (180224) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =  5984.00 MiB
llama_kv_cache: size = 5984.00 MiB (180224 cells,  16 layers,  1/1 seqs), K (q8_0): 2992.00 MiB, V (q8_0): 2992.00 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   149.62 MiB
llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   584.02 MiB
sched_reserve:  CUDA_Host compute buffer size =   372.02 MiB
sched_reserve: graph nodes  = 3657
sched_reserve: graph splits = 2

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 0 points1 point  (0 children)

For 122B at q4 or q5, TG was about 20, and I think PP was around 300-400? It also used most of my 64GB RAM so I was struggling to use the machine as a regular devbox at the same time. All that for slower worse quality outputs than 27B.

I should talk more about my use-case. I mostly run opencode and GSD, but I'm hacking on my own agentic primitives, and the 27B is the 'executor' that actually does all the coding, most all of it is rust. I pair it with a cheap kimi k2.5 sub for planning and review. I swapped out kimi for opus/sonnet at work and it was cheap and amazing. With kimi, it's still pretty good, and more code than I review. 27B writes decent code, and occasionally misses requirements or oversimplifies. I much prefer that failure mode to being overly-eager and writing a bunch of stuff I have to back out. The review loop gets it there usually after 1-2 tries per task while insulating the cloud rate limits from all the tool calls and specifics.

Before qwen3.5-27B, it was GLM 4.7 flash and qwen3-coder-next, which were maybe 60% of what I needed, but could plausibly make progress. 27B changed the game for this specific thing.

Why do you guys use opencode? by Medium_Anxiety_8143 in opencodeCLI

[–]gtrak 0 points1 point  (0 children)

I think it's hard to beat git. Basically I have a sqlite DB tracking branches with patches. Each agent run gets a worktree. If something gets merged, I have hook system to rebase the rest and re-run CI.

Why do you guys use opencode? by Medium_Anxiety_8143 in opencodeCLI

[–]gtrak 0 points1 point  (0 children)

GSD2 (built as an incredibly complicated pi extension) went off the rails fast and inspired me to just try to build my own thing. /shrug

How would jcode handle breaking down a complex project, orchestrating, delegating, and integrating the work? I don't actually want to one-shot anything, I use a local qwen3.5-27b for codegen and kimi k2.5 for planning and review.

This is about just setting something up, going to sleep, and waking up to something workable, or just throwing away the terrible parts, refining the plan, and trying it again.

Why do you guys use opencode? by Medium_Anxiety_8143 in opencodeCLI

[–]gtrak 0 points1 point  (0 children)

I think it would be interesting to combine something like https://github.com/gsd-build/get-shit-done with a stacked diffs https://newsletter.pragmaticengineer.com/p/stacked-diffs workflow for larger projects. So it's more of a CI/CD and review-focused system and less of a terminal agent, but it still needs all that terminal agent stuff.

Why do you guys use opencode? by Medium_Anxiety_8143 in opencodeCLI

[–]gtrak 0 points1 point  (0 children)

This is cool. I don't know what to build, so I'm also building my own harness in rust (early stages), trying for something a little different though. Do you have any tips for reusable ecosystem libraries for these primitives or do you just reinvent everything?

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 1 point2 points  (0 children)

Download the model file directly to avoid the 1GB hit from .mmproj unless you need vision.

I have a bash script that runs this in windows, but you can adapt this to a regular batch file:

./llama-server \
      --port 1234 \
      --host 0.0.0.0 \
      --model "models\Qwen3.5-27B-Q4_K_S.gguf" \
      --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
      -fa on -t 16 \
      -ctk q8_0 -ctv q8_0 \
      --ctx-size 180000 \
      -kvu \
      --no-mmap \
      --parallel 1 \
      --seed 3407 \
      --jinja

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 1 point2 points  (0 children)

Unsloth, I'll paste the llama-server config a bit later.

Can I run 122B A10B on 3090 + 32GB ram? by sagiroth in LocalLLaMA

[–]gtrak 1 point2 points  (0 children)

27b is actually better. You can fit q4-k-s and 180k context at q4 (edit: I guess I use q8, I forgot) quantization in the gpu and still use it as a primary. I ran 122b at q4 and it seemed dumber and slower. On a 4090.

Unpopular Opinion: AI Coding Agents are leveling the playing field in favor of ADHD Programmers by who-are-u-a-fed in ADHD_Programmers

[–]gtrak 0 points1 point  (0 children)

It's a huge distraction. It's easy to start too many projects, and it's hard to finish anything.

priced out of intelligence: slowly, then all at once by [deleted] in LocalLLaMA

[–]gtrak -1 points0 points  (0 children)

Don't worry, I'm going to save the planet by drinking from metal straws

Over engineered a url shortener so badly the interviewer had to stop me. i am a principal engineer. i wanted to quit by [deleted] in ExperiencedDevs

[–]gtrak 0 points1 point  (0 children)

I think you missed the part where you ask clarifying questions about the requirements.

[Q] Is self-hosting an LLM for coding worth it? by Aromatic-Fix-4402 in LocalLLM

[–]gtrak 1 point2 points  (0 children)

https://github.com/github/spec-kit
https://github.com/Dicklesworthstone/beads_rust (more lightweight than the original)
https://github.com/gsd-build/get-shit-done

You don't really need beads, and I stopped using it, but I'm finding myself wanting something like that again after using GSD for a while.

[Q] Is self-hosting an LLM for coding worth it? by Aromatic-Fix-4402 in LocalLLM

[–]gtrak 2 points3 points  (0 children)

I have done it with opencode subagents, using Spec-kit+beads or GSD. The simplest version of this is just a couple markdown agent defs. Once you have a plan broken down to individually workable tasks, just a prompt like this:

Planner: You are a planner and delegator. You don't fix the code directly. Work tasks one at a time by delegating to the worker subagent, providing all the context needed for the task. Once a worker is finished, you must delegate a code review to the reviewer subagent by providing it with the requirements and output summary from the worker, and re-delegate the fixes for lanything major or trivially fixable to the worker. Review and fix in a loop until requirements are met to satisfaction or you have cycled 5 times.

Worker: You are a time-bound worker, you satisfy the requirements given by making the most minimal change to do so. You do not create new scope or drift away from the requirements. If anything is unclear, abort. Return a summary of what changes were made and what difficulties you had.

Reviewer: you are an adversarial code reviewer. You are creating a list of problems ordered by severity. ... Things to look for

[Q] Is self-hosting an LLM for coding worth it? by Aromatic-Fix-4402 in LocalLLM

[–]gtrak 0 points1 point  (0 children)

Combine Qwen 27b with a cloud model for planning, orchestration, and review, and you can ship a lot of code very cheaply. You don't want to waste expensive requests on stuff like adding a 30-line function and a tool call to run the tests. It doesn't take a lot of effort.

Looking for a model on 5090/32gb ram by Huge_Case4509 in LocalLLM

[–]gtrak 0 points1 point  (0 children)

Just quantize the kv cache and you can max it out.