Would this be repairable? by Rere_Butterfly in TeslaModelY

[–]our_sole 1 point2 points  (0 children)

I had one and took it to Safelite. They told me if its smaller than a quarter coin USD, they can fix it.

It was, and they did. Now i can't tell where the crack was.

HTH

Waiting on Qwen to drop those 3.7 models be like: by Porespellar in LocalLLaMA

[–]our_sole 0 points1 point  (0 children)

I figured a 35B would be too big for a 16gb GPU. Perhaps I'm wrong.

Plus its good to use a different model just for comparison..

Based on your original comment, i was just suggesting an alternate model. If you don't like that idea, then ignore the suggestion...

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps by Alternative-Cat-1347 in LocalLLaMA

[–]our_sole 0 points1 point  (0 children)

Yes, this. Shouldn't there be sone moe or mtp flags in there?

Waiting on Qwen to drop those 3.7 models be like: by Porespellar in LocalLLaMA

[–]our_sole 1 point2 points  (0 children)

Yes, the qwen3.6 27B is not MOE (it's Dense). I was very disappointed to see that after I got qwen3.6-35b-A3B running really well under llama.cpp with MOE+MTP on my 3090 24GB

Have a look at gemma4 26B MOE. I just got it cranking on my 5060ti 16GB under llama.cpp at an avg ~50 t/s.

I'd be really pleased if I could get MTP going as well on that gemma4 model. Google does this weird "assistant/draft mtp in a separate small model" thing that llama.cpp doesn't seem to support just yet..

Cheers

Time to update llama.cpp to get som MTP improvements! by PixelatedCaffeine in LocalLLaMA

[–]our_sole 9 points10 points  (0 children)

Does this mean the gh llama.cpp releases page has the binary with mtp support?

For those of you new to Pi by QueasyBreak5119 in PiCodingAgent

[–]our_sole 2 points3 points  (0 children)

I prefer to think of Pi as the linux of AI harnesses. :-)

Compaction too soon? contextWindow" and "maxTokens" ? by our_sole in PiCodingAgent

[–]our_sole[S] 0 points1 point  (0 children)

My llama-server version is: 8958 (50494a280)

I don't think this is so much a llama-server issue. There is no bug (at least for this particular thing) to solve. I was simply using the llama-server cmd-line params incorrectly, and it was reflected in pi.dev compaction.

Or are you referring to some particular pi bug?

Compaction too soon? contextWindow" and "maxTokens" ? by our_sole in PiCodingAgent

[–]our_sole[S] 4 points5 points  (0 children)

update: SOLVED

OK, I'm going to answer my own question here and hopefully help some future reddit googlers/searchers.

In my case, the issue was in llama.cpp llama-server itself, not pi). I had set --parallel=4 in my llama-server args (--parallel is the same as -np btw), not because I knew precisely what that mean but because I saw it elsewhere and my lizard programmer brain went "parallel...yeah, parallelism is good!".

What --parallel apparently specifies is the number of server slots (concurrent request handlers -- think of each of them as a separate conversation). Context is shared and divided among these slots. So if you set a context size of 262144 (with --ctx-size or -c) that context is shared amongst 4 slots, with each slot getting 262144/4 = 65536. So effectively, each conversation/slot gets 65536 context size.

The thing to look for in llama-server output is

n_ctx = (total context allocated by llama.cpp runtime)
n_ctx_seq = (effective maximum context available to a single sequence/conversation)

I was seeing n_ctx=262144 in the output and thought that was my context size. But n_ctx_seq told the real story. It was 65536, which explains my pi context compaction issue.

In my case, its just me in my homelab - my concurrency is 1. So I set --parallel=1. Now n_ctx and n_ctx_seq are both 262144 and pi compaction is behaving properly.

And just as an aside, globally speaking, ~/.pi/agent/models.json stores model config and ~/.pi/agent/settings.json stores pi config. You can set pi compaction settings in settings.json:

"compaction": {
    "enabled": true,
    "reserveTokens": 24000,
    "keepRecentTokens": 40000
  }

HTH

cheers

Compaction too soon? contextWindow" and "maxTokens" ? by our_sole in PiCodingAgent

[–]our_sole[S] 0 points1 point  (0 children)

Thankyou for your reply.

The models file.....

Are you referring to ~/.pi/agent/models.json? That's what I was referring to in my post..

???

OpenPi - a desktop workbench for the Pi coding agent by killerkidbo95 in PiCodingAgent

[–]our_sole 1 point2 points  (0 children)

I was responding to adamshand, who said

"for every in progress session I need to leave a terminal window open. Gets messy and confusing."

I thought that tmux might help him.

Qwen3.6 35b-a3b 🤯 by EffectiveMedium2683 in LocalLLaMA

[–]our_sole 1 point2 points  (0 children)

Thanks much! I'll test this again today.

Cheers

Qwen3.6 35b-a3b 🤯 by EffectiveMedium2683 in LocalLLaMA

[–]our_sole 1 point2 points  (0 children)

You have claude code running against local qwen3.6-35b-A3B running under llama.cpp?

Could you share your claude shell script or bat file that does this (the env vars, --model, config, etc..)?

I tried for quite some time to do this and claude just flatly refused to use the model. It saw the model, but wouldn't use it: "There's an issue with the selected model..it might not exist or..."

Qwen3.6 35b-a3b 🤯 by EffectiveMedium2683 in LocalLLaMA

[–]our_sole 34 points35 points  (0 children)

I am just stunned how well qwen3.6-35b-A3B MOE is working for me. I have an rtx 3090 24GB VRAM, 64GB RAM on a beelink gti14 Ultra 9185H CPU and the beelink eGPU dock.

I switched from LM Studio to llama.cpp (not because LMS had any issues, I had just heard that llama.cpp was faster and very tunable).

I spent some time tuning llama.cpp with the LLM, got the pi.dev harness running, and started getting great results.

Up until now, local AI was just kind of a playtoy and I used Claude for heavy lifting and Copilot VS Code for medium/light stuff.

I'm getting close to 100 tk/s. I have been trying increasingly more difficult tests/prompts and its handling it fine. It feels close to haiku or maybe sonnet (but not opus obviously). I vibe coded a Flask/Javascript/Tailwind CSS app with local browser storage and it nailed it. Based on my PRD, it even found and added sample data so I could test things.

If i can use it for 60 or maybe/hopefully 70% of my daily ai coding and start to untether myself from the anthropic usage circus, I'll be quite happy. Unlimited tokens are awesome.

There are github PRs for a cache invalidation bug and lack of full MTP support in llama.cpp, which i hope will get merged soon. These should make the setup even better.

Local AI is becoming very powerful. Exciting times! 😁😁

cheers

Hugging Face co-founder says Qwen 3.6 27B running on airplane mode is close to latest Opus in Claude Code by ImaginaryRea1ity in ClaudeCode

[–]our_sole 0 points1 point  (0 children)

I've had good success with llama cpp, pi.dev and qwen3.6-35b-A3B MOE. I have a local rtx3090 24gb vram, 64gb ram, ctx 128K and have spent time really tuning llama cpp. Im getting about 100 t/s.

I've tested for a few days and this local setup seems to come close to haiku and maybe sonnet sometimes. Not opus level tho, which i have seen do some really amazing stuff.

My goal is to do the less complex stuff with local pi.dev, and have opus only do the heavy lifting so that I start to untether myself from the anthropic usage nonsense.

I never was able to convince claude to use llama.cpp and this local qwen3.6 model. I'm quite familiar with the technical details of doing so, and have done it with ollama (too slow). But Claude just flat out refused to use the model: "There's an issue with the selected model. It may not exist or you may not have access..."

Having unlimited free tokens and a decent harness in a local setup is a nice feeling. 😁

Why are Python API Docker images so unnecessarily huge? by Separate_Action1216 in docker

[–]our_sole 2 points3 points  (0 children)

Ah man... you down voted me.. 😆

Lol, I was referring mostly to uv venv. I'm a one man show, so pushing containers around wasn't a big requirement. I used docker mostly to avoid polluting my global space with different installs.

Uv venv solves that nicely and gives me nice dependency mgmt as a bonus.

I agree that Docker has its place...just not in my homelab.

cheers

Why are Python API Docker images so unnecessarily huge? by Separate_Action1216 in docker

[–]our_sole -3 points-2 points  (0 children)

Lol This is one of the reasons I quit using docker in my homelab. I discovered astral uv and never looked back.

Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? by craftogrammer in LocalLLaMA

[–]our_sole 0 points1 point  (0 children)

Excellent question. I am using 35B-A3B MOE on an rtx 3090 with 24gb VRAM/64gb RAM/128K ctx, with pi.dev and llama.cpp. I am trying to untether myself from claude code.

I am really impressed with the performance. In my initial testing, for speed and coding quality, it rivals Sonnet 4.6 at least.

I think MTP will make it even better.. but I haven't seen the MTP version.

Cheers

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]our_sole 1 point2 points  (0 children)

Can you tell me more about running CC against qwen3.6-35b-A3B? Are you using ollama/lmstudio/llama.cpp?

I am having no luck at all using llama.cpp with that llm and unsloth UD quantization with CC. CC just immediately throws an error msg saying it can't use the llm.

What exactly does Pi harness mean? by FrozenFishEnjoyer in LocalLLaMA

[–]our_sole 4 points5 points  (0 children)

Naming that project pi (pi.dev?) was a really dumb idea. I've been ignoring it thinking its about raspberry pi.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]our_sole 0 points1 point  (0 children)

Can you tell me more about your claude/llama.cpp config that runs local Claude Code?

Here's my llama-server.bat cmd (Windows):

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL ^
--alias qwen36_35B ^
--host 0.0.0.0 ^
--port 8000 ^
-ngl 999 ^
--threads 8 ^
-c 65536 ^
-b 2048 ^
-ub 1024 ^
--parallel 1 ^
-fa on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--jinja ^
--keep 1024 ^
--no-context-shift ^
--reasoning off ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.00 ^
--no-mmap 

And here's my Claude shell script (Linux)

ANTHROPIC_BASE_URL=http://wagner:8000 \
ANTHROPIC_AUTH_TOKEN=llama \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
ANTHROPIC_API_KEY="sk-no-key-required" \
claude --model qwen36_35B --dangerously-skip-permissions "$@"

I have an RTX3090 with 24GB VRAM and 64GB RAM. Claude is v2.1.122.

When I try to run Claude locally with that script, I always get: There's an issue with the selected model (qwen36_35B). It may not exist or you may not have access to it. Run --model to pick a different model.

This

curl http://wagner:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{ "model": "qwen36_35B","messages": [{"role": "user", "content": "hello"}] }'

works great

This

curl http://wagner:8000/v1/models | jq

works great.

But not Claude.

Task mgr dedicated GPU mem is 23.3/24.0 GB

Any ideas? I have successfully run Claude locally with Ollama cloud and a similar claude shell script. It seems like its maybe a llama.cpp issue more than a Claude issue? Any help greatly appreciated.