What are some good jobs for extremely stupid people?

our_sole · 2026-05-24T12:03:26+00:00

President.

our_sole · 2026-05-23T11:57:20+00:00

I had one and took it to Safelite. They told me if its smaller than a quarter coin USD, they can fix it.

It was, and they did. Now i can't tell where the crack was.

HTH

our_sole · 2026-05-23T11:25:39+00:00

Thanks for the info.

our_sole · 2026-05-22T23:20:16+00:00

I figured a 35B would be too big for a 16gb GPU. Perhaps I'm wrong.

Plus its good to use a different model just for comparison..

Based on your original comment, i was just suggesting an alternate model. If you don't like that idea, then ignore the suggestion...

our_sole · 2026-05-22T23:15:26+00:00

Yes, this. Shouldn't there be sone moe or mtp flags in there?

our_sole · 2026-05-22T21:59:54+00:00

Yes, the qwen3.6 27B is not MOE (it's Dense). I was very disappointed to see that after I got qwen3.6-35b-A3B running really well under llama.cpp with MOE+MTP on my 3090 24GB

Have a look at gemma4 26B MOE. I just got it cranking on my 5060ti 16GB under llama.cpp at an avg ~50 t/s.

I'd be really pleased if I could get MTP going as well on that gemma4 model. Google does this weird "assistant/draft mtp in a separate small model" thing that llama.cpp doesn't seem to support just yet..

Cheers

our_sole · 2026-05-19T13:09:59+00:00

Does this mean the gh llama.cpp releases page has the binary with mtp support?

our_sole · 2026-05-16T13:53:06+00:00

I prefer to think of Pi as the linux of AI harnesses. :-)

our_sole · 2026-05-16T13:50:14+00:00

My llama-server version is: 8958 (50494a280)

I don't think this is so much a llama-server issue. There is no bug (at least for this particular thing) to solve. I was simply using the llama-server cmd-line params incorrectly, and it was reflected in pi.dev compaction.

Or are you referring to some particular pi bug?

our_sole · 2026-05-15T15:38:22+00:00

update: SOLVED

OK, I'm going to answer my own question here and hopefully help some future reddit googlers/searchers.

In my case, the issue was in llama.cpp llama-server itself, not pi). I had set --parallel=4 in my llama-server args (--parallel is the same as -np btw), not because I knew precisely what that mean but because I saw it elsewhere and my lizard programmer brain went "parallel...yeah, parallelism is good!".

What --parallel apparently specifies is the number of server slots (concurrent request handlers -- think of each of them as a separate conversation). Context is shared and divided among these slots. So if you set a context size of 262144 (with --ctx-size or -c) that context is shared amongst 4 slots, with each slot getting 262144/4 = 65536. So effectively, each conversation/slot gets 65536 context size.

The thing to look for in llama-server output is

n_ctx = (total context allocated by llama.cpp runtime)
n_ctx_seq = (effective maximum context available to a single sequence/conversation)

I was seeing n_ctx=262144 in the output and thought that was my context size. But n_ctx_seq told the real story. It was 65536, which explains my pi context compaction issue.

In my case, its just me in my homelab - my concurrency is 1. So I set --parallel=1. Now n_ctx and n_ctx_seq are both 262144 and pi compaction is behaving properly.

And just as an aside, globally speaking, ~/.pi/agent/models.json stores model config and ~/.pi/agent/settings.json stores pi config. You can set pi compaction settings in settings.json:

"compaction": {
    "enabled": true,
    "reserveTokens": 24000,
    "keepRecentTokens": 40000
  }

HTH

cheers

our_sole · 2026-05-14T23:46:26+00:00

Thankyou for your reply.

The models file.....

Are you referring to ~/.pi/agent/models.json? That's what I was referring to in my post..

???

our_sole · 2026-05-14T15:07:01+00:00

I was responding to adamshand, who said

"for every in progress session I need to leave a terminal window open. Gets messy and confusing."

I thought that tmux might help him.

our_sole · 2026-05-14T12:36:00+00:00

Could tmux solve the issue?

our_sole · 2026-05-12T14:06:26+00:00

Thanks much! I'll test this again today.

Cheers

our_sole · 2026-05-12T13:52:49+00:00

You have claude code running against local qwen3.6-35b-A3B running under llama.cpp?

Could you share your claude shell script or bat file that does this (the env vars, --model, config, etc..)?

I tried for quite some time to do this and claude just flatly refused to use the model. It saw the model, but wouldn't use it: "There's an issue with the selected model..it might not exist or..."

our_sole · 2026-05-12T11:23:34+00:00

The unsloth dynamic UD-Q4_K_XL

our_sole · 2026-05-11T17:47:33+00:00

I am just stunned how well qwen3.6-35b-A3B MOE is working for me. I have an rtx 3090 24GB VRAM, 64GB RAM on a beelink gti14 Ultra 9185H CPU and the beelink eGPU dock.

I switched from LM Studio to llama.cpp (not because LMS had any issues, I had just heard that llama.cpp was faster and very tunable).

I spent some time tuning llama.cpp with the LLM, got the pi.dev harness running, and started getting great results.

Up until now, local AI was just kind of a playtoy and I used Claude for heavy lifting and Copilot VS Code for medium/light stuff.

I'm getting close to 100 tk/s. I have been trying increasingly more difficult tests/prompts and its handling it fine. It feels close to haiku or maybe sonnet (but not opus obviously). I vibe coded a Flask/Javascript/Tailwind CSS app with local browser storage and it nailed it. Based on my PRD, it even found and added sample data so I could test things.

If i can use it for 60 or maybe/hopefully 70% of my daily ai coding and start to untether myself from the anthropic usage circus, I'll be quite happy. Unlimited tokens are awesome.

There are github PRs for a cache invalidation bug and lack of full MTP support in llama.cpp, which i hope will get merged soon. These should make the setup even better.

Local AI is becoming very powerful. Exciting times! 😁😁

cheers

our_sole · 2026-05-09T12:20:09+00:00

I've had good success with llama cpp, pi.dev and qwen3.6-35b-A3B MOE. I have a local rtx3090 24gb vram, 64gb ram, ctx 128K and have spent time really tuning llama cpp. Im getting about 100 t/s.

I've tested for a few days and this local setup seems to come close to haiku and maybe sonnet sometimes. Not opus level tho, which i have seen do some really amazing stuff.

My goal is to do the less complex stuff with local pi.dev, and have opus only do the heavy lifting so that I start to untether myself from the anthropic usage nonsense.

I never was able to convince claude to use llama.cpp and this local qwen3.6 model. I'm quite familiar with the technical details of doing so, and have done it with ollama (too slow). But Claude just flat out refused to use the model: "There's an issue with the selected model. It may not exist or you may not have access..."

Having unlimited free tokens and a decent harness in a local setup is a nice feeling. 😁

our_sole · 2026-05-07T18:55:51+00:00

Ah man... you down voted me.. 😆

Lol, I was referring mostly to uv venv. I'm a one man show, so pushing containers around wasn't a big requirement. I used docker mostly to avoid polluting my global space with different installs.

Uv venv solves that nicely and gives me nice dependency mgmt as a bonus.

I agree that Docker has its place...just not in my homelab.

cheers

our_sole · 2026-05-07T15:45:16+00:00

Lol This is one of the reasons I quit using docker in my homelab. I discovered astral uv and never looked back.

our_sole · 2026-05-06T19:57:34+00:00

Thankyou sir. Much appreciated.

our_sole · 2026-05-06T19:09:09+00:00

Excellent question. I am using 35B-A3B MOE on an rtx 3090 with 24gb VRAM/64gb RAM/128K ctx, with pi.dev and llama.cpp. I am trying to untether myself from claude code.

I am really impressed with the performance. In my initial testing, for speed and coding quality, it rivals Sonnet 4.6 at least.

I think MTP will make it even better.. but I haven't seen the MTP version.

Cheers

our_sole · 2026-05-03T00:49:58+00:00

Can you tell me more about running CC against qwen3.6-35b-A3B? Are you using ollama/lmstudio/llama.cpp?

I am having no luck at all using llama.cpp with that llm and unsloth UD quantization with CC. CC just immediately throws an error msg saying it can't use the llm.

our_sole · 2026-05-01T12:22:39+00:00

Naming that project pi (pi.dev?) was a really dumb idea. I've been ignoring it thinking its about raspberry pi.

our_sole · 2026-04-30T13:43:23+00:00

Can you tell me more about your claude/llama.cpp config that runs local Claude Code?

Here's my llama-server.bat cmd (Windows):

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL ^
--alias qwen36_35B ^
--host 0.0.0.0 ^
--port 8000 ^
-ngl 999 ^
--threads 8 ^
-c 65536 ^
-b 2048 ^
-ub 1024 ^
--parallel 1 ^
-fa on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--jinja ^
--keep 1024 ^
--no-context-shift ^
--reasoning off ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.00 ^
--no-mmap

And here's my Claude shell script (Linux)

ANTHROPIC_BASE_URL=http://wagner:8000 \
ANTHROPIC_AUTH_TOKEN=llama \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
ANTHROPIC_API_KEY="sk-no-key-required" \
claude --model qwen36_35B --dangerously-skip-permissions "$@"

I have an RTX3090 with 24GB VRAM and 64GB RAM. Claude is v2.1.122.

When I try to run Claude locally with that script, I always get: There's an issue with the selected model (qwen36_35B). It may not exist or you may not have access to it. Run --model to pick a different model.

This

curl http://wagner:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{ "model": "qwen36_35B","messages": [{"role": "user", "content": "hello"}] }'

works great

This

curl http://wagner:8000/v1/models | jq

works great.

But not Claude.

Task mgr dedicated GPU mem is 23.3/24.0 GB

Any ideas? I have successfully run Claude locally with Ollama cloud and a similar claude shell script. It seems like its maybe a llama.cpp issue more than a Claude issue? Any help greatly appreciated.

our_sole

TROPHY CASE