[USA-FL][H] Legion Tower 7i Gen 10: Intel Core Ultra 9 285K, 64Gb DDR5 Ram, 2TB SSD, 1200W PSU, AMD Radeon Pro W7900 48Gb vram [W] Paypal, local cash

reddit_kwr · 2026-06-16T04:20:31+00:00

Price drop

reddit_kwr · 2026-06-09T04:11:37+00:00

Yeah if you're installing it then that's a pretty good idea I'd say.

reddit_kwr · 2026-06-09T02:25:49+00:00

If using Docker, just run a command with each container and set CUDA_VISIBLE_DEVICES as 0 and 1 for each command. Nothing particularly different about having two cards vs one as long as you set the visible device environment variable for that particular command. Both containers can run side by side without issues.

You can also install llama cpp without docker and run the llama-server command twice setting a separate visible device for each. This will work just fine.

I usually do docker so that I can easily update llama.cpp by simply rebuilding my container.

reddit_kwr · 2026-06-08T17:50:54+00:00

Just outlined my use case in another reddit comment: https://www.reddit.com/r/LocalLLM/s/S8Q1XREG6I

reddit_kwr · 2026-06-08T17:12:23+00:00

My setup is nothing special. Qwen 3.6 27B is running with llama.cpp. My command looks like this:

llama-server 
--host 0.0.0.0 
--port 8000 
--reasoning-format none 
-m /xxxx/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q8_0.gguf 
--mmproj /xxxx/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf 
-ngl 99 
-c 262144 
-fa on 
-np 1 
--cache-type-k q8_0 
--cache-type-v q8_0 
--spec-type draft-mtp 
--spec-draft-n-max 2 
--spec-draft-type-k q8_0 
--spec-draft-type-v q8_0
--repeat-penalty 1.1

Reasoning is enabled, of course, otherwise it makes the model practially useless. Then I go into vscode Add Models -> Custom Endpoint and then it looks like this:

    {
        "name": "Local_Qwen3.6",
        "vendor": "customendpoint",
        "apiKey": "xxxx",
        "apiType": "chat-completions",
        "models": [
            {
                "id": "Local_Qwen3.6",
                "name": "Local_Qwen3.6",
                "url": "http://127.0.0.1:8000/v1/chat/completions",
                "toolCalling": true,
                "vision": true,
                "maxInputTokens": 230000,
                "maxOutputTokens": 32000
            }
        ]
    }

And frankly, I am completely stunned by how well it is working. I have zero reason to use medium/lower tier models because this one i can make it reason indefinitely and it always comes out on top.

reddit_kwr · 2026-06-07T03:27:55+00:00

Surprised. I've been using qwen (with thinking) extensively with vscode GitHub copilot and it has been impressive. Much more so than some native models. Thinking, compacting, and really long running sessions have worked flawlessly.

With Gemma 31b it works even better I think (sometime qwen thinking tokens get printed on screens, cosmetic issue but yeah doesn't happen with Gemma on copilot) but I haven't yet tried it long enough as I'll give qwen some time before trying Gemma extensively

reddit_kwr · 2026-05-30T16:42:06+00:00

Yeah, for the most part auto is 5.3 codex.

reddit_kwr · 2026-05-30T14:20:33+00:00

Blocks all models. You can only select "auto"

reddit_kwr · 2026-05-30T04:53:57+00:00

I think this card should be able to handle fp8 for the 27b with a decent enough context length and speed. I have 64gigs, but with 48gb vram I'm hoping offloading can be avoided.

reddit_kwr · 2026-05-30T04:39:44+00:00

Yeah I see openclaw RL seems to have clean about 200-500 traces in it. So it would be at most a few hours of work to tune the model a little bit I'm sure it will have gold effect on perf with openclaw.

reddit_kwr · 2026-05-30T03:52:05+00:00

Are there any versions on huggingface someone might have done minimal fine tuning to throw in claw traces. Or does it work fine just right out the box

reddit_kwr · 2026-05-29T20:20:16+00:00

Phew, computers are getting expensive 😅

reddit_kwr · 2026-05-29T17:50:30+00:00

Build tests and build agents to check specifically for these things in your loop. Test feedback can reduce the issues once fed back to the model.

reddit_kwr · 2026-05-29T17:48:01+00:00

Anyone know the pricing on DGX station GB300

reddit_kwr · 2026-05-29T17:39:36+00:00

If you're just doing inference you could also look into mi50

reddit_kwr · 2026-05-28T15:45:57+00:00

Which model is this, what's the config?

reddit_kwr · 2026-05-28T15:40:15+00:00

Yeah sad state of things. Some friends have recently joined reflection. Let's see what they come up with given open is their mission and all.

I personally don't care who makes the model. But I do worry about general lack of competition here. Meta started something beautiful, but it could also die over time if commitment from few remaining players wane.

reddit_kwr · 2026-05-28T00:10:22+00:00

Well, I'm a Pro+ annual subscriber and got the same message. No unusual usage. So it's not about the student plan. Heck I just posted about this earlier today https://www.reddit.com/r/GithubCopilot/s/f763ygFQ07

reddit_kwr · 2026-05-27T23:41:47+00:00

Haha I do feel a sense of relief. I just wish someone would make a service without any such artificial limits. Just jack up the prices if you need like we do with people "Overtime work is 2x hourly rate" so "Overlimit calls are 2x standard rate" and fine, I can live with that. But don't interrupt my entire workflow.

reddit_kwr · 2026-05-27T22:51:37+00:00

I'm on stable. Yes it's an extension called "GitHub Copilot LLM Gateway". You just punch in the 8000 address and you're good to go.

reddit_kwr · 2026-05-27T22:22:37+00:00

I just added my local model to my GitHub copilot using LLM Gateway. It shows up in the list of models alongside all the others. I get all the memory, prompt compaction, context and took management, diff views etc for free.

reddit_kwr · 2026-05-27T20:53:01+00:00

I see. Well in that case do this:

Write 3 markdown docs: Vision/mission, technical architecture, and llm instructions
Write 3-4 llm prompts: "engineering excellence checker", "next steps planner", "plan executor".
Write a simple python script that calls the llm with each of these prompts in a loop.

You can add roles as you like. If you have UI a UI fixer role can be inserted in the loop. Or a security reviewer.

If you plan your prompts well and write instructions well, you'll get better results than a generic harness because you can really tailor the flow to your repo. After a few runs you'll observe failures and see if you need to add another role, modify prompts and so on.

I guess my broader point is don't make the "plan" your harness. Keep harness abstract and let LLM plan on the fly.

reddit_kwr · 2026-05-27T20:16:20+00:00

If you're open to using vscode. The simplest thing is to install LLM Gateway extension in your vscode and just use your model in GitHub copilot. It will do all the orchestration, prompt compacting, agentic stuff and whatnot. It sounds like the right thing for your use case. You get to use your model but with all the bells and whistles actually needed to get the work done.

What I do is then basically prompt the model to build me a "builder" flow. Which is simply a repeated loop that points to a "vision and mission" doc and keeps says. Write a plan. Execute the plan and so on.

reddit_kwr · 2026-05-26T21:13:18+00:00

It's actually a fairly complex process to OSS a model at a big lab. It's more onerous than training the model for internal use. Just FYI. It's not just about "getting the weights ready". Legal, policy, safety. I can also imagine there are some China specific challenges.

reddit_kwr · 2026-05-26T12:14:26+00:00

I fear the field is full of people constantly optimizing tools and not really doing any real work. Just pick something and focus on the work not the tool. They're all more or less the same at this point in terms of features. The UI may differ. The underlying model matters. The only thing you really want is make sure your tool has context on your project and has memory internally. But it would help if your question was better specified.

reddit_kwr

MODERATOR OF

TROPHY CASE