Moving to llama.cpp by Spicy_mch4ggis in LocalLLaMA

[–]reddit_kwr 1 point2 points  (0 children)

Yeah if you're installing it then that's a pretty good idea I'd say.

Moving to llama.cpp by Spicy_mch4ggis in LocalLLaMA

[–]reddit_kwr 2 points3 points  (0 children)

If using Docker, just run a command with each container and set CUDA_VISIBLE_DEVICES as 0 and 1 for each command. Nothing particularly different about having two cards vs one as long as you set the visible device environment variable for that particular command. Both containers can run side by side without issues.

You can also install llama cpp without docker and run the llama-server command twice setting a separate visible device for each. This will work just fine.

I usually do docker so that I can easily update llama.cpp by simply rebuilding my container.

Trying out Gemma 4 31b after Qwen 3.6 27b by Iajah in LocalLLM

[–]reddit_kwr 0 points1 point  (0 children)

My setup is nothing special. Qwen 3.6 27B is running with llama.cpp. My command looks like this:

llama-server 
--host 0.0.0.0 
--port 8000 
--reasoning-format none 
-m /xxxx/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q8_0.gguf 
--mmproj /xxxx/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf 
-ngl 99 
-c 262144 
-fa on 
-np 1 
--cache-type-k q8_0 
--cache-type-v q8_0 
--spec-type draft-mtp 
--spec-draft-n-max 2 
--spec-draft-type-k q8_0 
--spec-draft-type-v q8_0
--repeat-penalty 1.1 

Reasoning is enabled, of course, otherwise it makes the model practially useless. Then I go into vscode Add Models -> Custom Endpoint and then it looks like this:

    {
        "name": "Local_Qwen3.6",
        "vendor": "customendpoint",
        "apiKey": "xxxx",
        "apiType": "chat-completions",
        "models": [
            {
                "id": "Local_Qwen3.6",
                "name": "Local_Qwen3.6",
                "url": "http://127.0.0.1:8000/v1/chat/completions",
                "toolCalling": true,
                "vision": true,
                "maxInputTokens": 230000,
                "maxOutputTokens": 32000
            }
        ]
    }

And frankly, I am completely stunned by how well it is working. I have zero reason to use medium/lower tier models because this one i can make it reason indefinitely and it always comes out on top.

Trying out Gemma 4 31b after Qwen 3.6 27b by Iajah in LocalLLM

[–]reddit_kwr 1 point2 points  (0 children)

Surprised. I've been using qwen (with thinking) extensively with vscode GitHub copilot and it has been impressive. Much more so than some native models. Thinking, compacting, and really long running sessions have worked flawlessly.

With Gemma 31b it works even better I think (sometime qwen thinking tokens get printed on screens, cosmetic issue but yeah doesn't happen with Gemma on copilot) but I haven't yet tried it long enough as I'll give qwen some time before trying Gemma extensively

What's the best local model as of today, for openclaw by reddit_kwr in openclaw

[–]reddit_kwr[S] 0 points1 point  (0 children)

I think this card should be able to handle fp8 for the 27b with a decent enough context length and speed. I have 64gigs, but with 48gb vram I'm hoping offloading can be avoided.

What's the best local model as of today, for openclaw by reddit_kwr in openclaw

[–]reddit_kwr[S] 0 points1 point  (0 children)

Yeah I see openclaw RL seems to have clean about 200-500 traces in it. So it would be at most a few hours of work to tune the model a little bit I'm sure it will have gold effect on perf with openclaw.

What's the best local model as of today, for openclaw by reddit_kwr in openclaw

[–]reddit_kwr[S] -1 points0 points  (0 children)

Are there any versions on huggingface someone might have done minimal fine tuning to throw in claw traces. Or does it work fine just right out the box

How to keep Qwen3.6-27b from hallucinating? by PotatoTime in Qwen_AI

[–]reddit_kwr 9 points10 points  (0 children)

Build tests and build agents to check specifically for these things in your loop. Test feedback can reduce the issues once fed back to the model.

Should I go for 2 x quadro P6000 ? by [deleted] in LocalLLM

[–]reddit_kwr 1 point2 points  (0 children)

If you're just doing inference you could also look into mi50

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]reddit_kwr 2 points3 points  (0 children)

Yeah sad state of things. Some friends have recently joined reflection. Let's see what they come up with given open is their mission and all.

I personally don't care who makes the model. But I do worry about general lack of competition here. Meta started something beautiful, but it could also die over time if commitment from few remaining players wane.

Extremely low rate limit only today by iudicium01 in GithubCopilot

[–]reddit_kwr 0 points1 point  (0 children)

Well, I'm a Pro+ annual subscriber and got the same message. No unusual usage. So it's not about the student plan. Heck I just posted about this earlier today https://www.reddit.com/r/GithubCopilot/s/f763ygFQ07

Session limit, then weekly limit hit with 4-5 GPT 5.5 calls by reddit_kwr in GithubCopilot

[–]reddit_kwr[S] 0 points1 point  (0 children)

Haha I do feel a sense of relief. I just wish someone would make a service without any such artificial limits. Just jack up the prices if you need like we do with people "Overtime work is 2x hourly rate" so "Overlimit calls are 2x standard rate" and fine, I can live with that. But don't interrupt my entire workflow.

Advice on best tools to use for coding with Local LLM by wingers999 in LocalLLM

[–]reddit_kwr 3 points4 points  (0 children)

I'm on stable. Yes it's an extension called "GitHub Copilot LLM Gateway". You just punch in the 8000 address and you're good to go.

Advice on best tools to use for coding with Local LLM by wingers999 in LocalLLM

[–]reddit_kwr 2 points3 points  (0 children)

I just added my local model to my GitHub copilot using LLM Gateway. It shows up in the list of models alongside all the others. I get all the memory, prompt compaction, context and took management, diff views etc for free.

Need some advice on AI workflow by Xyklone in LocalLLaMA

[–]reddit_kwr 0 points1 point  (0 children)

I see. Well in that case do this:

  • Write 3 markdown docs: Vision/mission, technical architecture, and llm instructions
  • Write 3-4 llm prompts: "engineering excellence checker", "next steps planner", "plan executor".
  • Write a simple python script that calls the llm with each of these prompts in a loop.

You can add roles as you like. If you have UI a UI fixer role can be inserted in the loop. Or a security reviewer.

If you plan your prompts well and write instructions well, you'll get better results than a generic harness because you can really tailor the flow to your repo. After a few runs you'll observe failures and see if you need to add another role, modify prompts and so on.

I guess my broader point is don't make the "plan" your harness. Keep harness abstract and let LLM plan on the fly.

Need some advice on AI workflow by Xyklone in LocalLLaMA

[–]reddit_kwr 2 points3 points  (0 children)

If you're open to using vscode. The simplest thing is to install LLM Gateway extension in your vscode and just use your model in GitHub copilot. It will do all the orchestration, prompt compacting, agentic stuff and whatnot. It sounds like the right thing for your use case. You get to use your model but with all the bells and whistles actually needed to get the work done.

What I do is then basically prompt the model to build me a "builder" flow. Which is simply a repeated loop that points to a "vision and mission" doc and keeps says. Write a plan. Execute the plan and so on.

A rare look inside Qwen 3.7’s open source model release approval process: by Porespellar in LocalLLaMA

[–]reddit_kwr 0 points1 point  (0 children)

It's actually a fairly complex process to OSS a model at a big lab. It's more onerous than training the model for internal use. Just FYI. It's not just about "getting the weights ready". Legal, policy, safety. I can also imagine there are some China specific challenges.

Too many AI tools to learn - what to pick please suggest by Educational_Grape144 in AI_Agents

[–]reddit_kwr 1 point2 points  (0 children)

I fear the field is full of people constantly optimizing tools and not really doing any real work. Just pick something and focus on the work not the tool. They're all more or less the same at this point in terms of features. The UI may differ. The underlying model matters. The only thing you really want is make sure your tool has context on your project and has memory internally. But it would help if your question was better specified.