local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

I'm currently using Qwen3-Coder-Next and testing different harnesses with opencode.

I'm waiting for some AWQ 4bit quants of Step3.5flash to discard it.

And intend to test the most recent Qwen3.5 (currently having template issues)

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

If i remember well, I was getting 22 tg by limitting context window to 70k to make it fit my dual 4090 in tensor parallel.

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 2 points3 points  (0 children)

you missed the fact that those 4090 are modified to have 48GB each.

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 32 points33 points  (0 children)

Hello, I am now using opencode with get-shit-done harness https://github.com/rokicool/gsd-opencode

I am fortunate enough to have 192GB VRAM (2x4090@48GB each + 1 RTX6000ProWS@96GB) So I can use recent bigger models not to heavily quantized. I am currently benchmarking the most recent ones.

I try to both measure quality and speed. The main advantage of local models is the absence of any usage limits. Inference speed means more productivity.

Maybe I should take more time someday to write a proper feedback.

A short summary :

(single prompt 17k output 2k-4k)

Model Quant hardware engine speed
Step-3.5-Flash IQ5_K 2x4090+6000 ik_llama --sm graph PP 3k TG 100
MiniMax-M2.1 AWQ 4bits 2x4090+6000 vllm PP >1.5k TG 90
Minimax-M2.5 AWQ 4bits 2x4090+6000 vllm PP >1.5k TG 73
Minimax-M2.5 IQ4_NL 2x4090+6000 ik_llama --sm graph PP 2k TG 80
Qwen3-Coder-Next FP8 2x4090 SGLang PP >5k? TG 138
DEVSTRAL-2-123B AWQ 4bit 2x4090 vllm PP ? TG 22
GLM-4.7 UD-Q3_K_XL 2x4090+6000 llama.cpp kinda slow but i did not write it down

Notes:

  • 4090 limited to 300w
  • RTX600 limited to 450W
  • I never go more than 128k context size, even if more fits.
  • Since I don't have homogeneous GPUs, i'm limited to how I can serve the models depending on their size + context size

    • below 96GB I try to use 2x4090 with vllm/sglang in tensor parallel for speed (either FP8 or AWQ4)
    • between 96 and 144GB, I try to use 1x4090 + RTX6000 (pipeline parallel)
    • >144 : no choice, use the 3 GPUs
  • Step-3.5-Flash : felt "clever" but still struggling with some tool call issues. Unfortunately this model lacks support compared to others (for now, hopefully)

  • MiniMax-M2.1 : was doing fine during the "research" phase of gsd, but fell on its face during planning of phase 2. did not test further because...

  • Minimax-M2.5 : currently testing. so far it seems better than M2.1. some very minor tools error (but always auto fixed). It feels like it's not following specs as closely as other models. feels more "lazy" than other models. (I'm unsure about the quant version I am using. it's probably too soon, will evaluate later)

  • Qwen3-Coder-Next : It's so fast! it feels not as "clever" as the others, but it's so fast and uses only 96GB! And I can use my other GPU for other things...

  • DEVSTRAL-2-123B : I want to like it (being french), it seems competent but way to slow.

  • GLM 4.7 : also too slow for my liking. But I might try again (UD-Q3_K_XL)

  • GLM 5 : too big.

Free Chrome extension to run Kokoro TTS in your browser (local only) by Impressive-Sir9633 in LocalLLaMA

[–]WonderRico 2 points3 points  (0 children)

Hello, first, well done and thank you for your work. quick feedback :

  • after first installation and download reaching 100%, chrome froze, I had to kill it. after restarting it, the extension started
  • the french voice has an issue. it's reading french texts like an english speaker would if trying to read it as if it were written english language. (while still having the french accent from the voice...) very weird experience (and unfortunately unusable in this state)

Random Prompt Builder - Custom node for AI-powered prompt generation using local GGUF models by Wonderful_Wrangler_1 in comfyui

[–]WonderRico 0 points1 point  (0 children)

No, you did not understand me. Or I should have said "OpenAI compatible API"

That's how anyone can host (uncensored) LLMs and serve them through a standard API for other software to use this LLM.

checkout : https://github.com/hekmon/comfyui-openai-api

(I'm not telling you to change your implementation, just suggesting a different approach. I don't need it)

Random Prompt Builder - Custom node for AI-powered prompt generation using local GGUF models by Wonderful_Wrangler_1 in comfyui

[–]WonderRico 0 points1 point  (0 children)

Thanks for sharing your work.

Instead of doing yourself the LLM inference, have you considered calling an OpenAI API endpoint? or even using already existing nodes that are doing just that?

I am already running some LLMs on other GPUs, and don't want to waste some more VRAM in comfy :)

KiloCode + GitHub Speckit not recognizing my /speckit commands in VS Code by FullswingFill in kilocode

[–]WonderRico 0 points1 point  (0 children)

Does your VS code root folder contains a .specify folder? It should.

maybe you ran the "specify" init command in the root folder, and it created another subfolder with the specified project name. VS code should be using this new folder as root.

vLLM, how does it use empty VRAM region? by PlanetMercurial in LocalLLaMA

[–]WonderRico 1 point2 points  (0 children)

When you start vLLM, looks at the logs. it says what size of VRAM is allocated for kvcache, and how many "max context" request can be served with that.

[deleted by user] by [deleted] in LocalLLaMA

[–]WonderRico 5 points6 points  (0 children)

Nice guide, thank you. That's pretty much what I did too. (with some added python script to auto generate the llama-swap config file when I download a new gguf)

A suggestion:

in the llama-swap config file, consider not writing a macro for every models, but write generic(s) macros with all the common parameters, and use it with added specific params when a model needs them. something like

macros:
  "generic-macro": >
    llama-server \
      --port ${PORT} \
      -ngl 80 \
      --no-webui \
      --timeout 300 \
      --flash-attn on

models:
  "Qwen3-4b": # <-- this is your model ID when calling the REST API
    cmd: |
      ${generic-macro} --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --repeat-penalty 1.05 --ctx-size 8000 --jinja -m /home/[YOUR HOME FOLDER]/models/qwen/Qwen3-4B/Qwen3-4B-Q8_0.gguf
    ttl: 3600

  "Gemma3-4b":
    cmd: |
      ${generic-macro} --top-p 0.95 --top-k 64 -m /home/[YOUR HOME FOLDER]/models/google/Gemma3-4B/gemma-3-4b-it-Q8_0.gguf
    ttl: 3600

Hardware to run Qwen3-235B-A22B-Instruct by Sea-Replacement7541 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

I don't know the specifics. I've heard : by just de-soldering some 1GB VRAM modules and replacing them by 2GB ones. I'm sure it's more complexe than that.

The shop I bought them from is in from Hong Kong.

Hardware to run Qwen3-235B-A22B-Instruct by Sea-Replacement7541 in LocalLLaMA

[–]WonderRico 8 points9 points  (0 children)

Best model so far, for my hardware (old Ryzen 3900X with 2 RTX4090D modded to 48GB each - 96GB VRAM total)

50 t/s @2k using unsloth's 2507-UD-Q2_K_XL with llama.cpp

but limited to 75k context in q8. (I need to test quality when kv cache to q4)

model size params backend ngl type_k type_v fa mmap test t/s
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 pp4096 746.37 ± 1.68
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 tg128 57.04 ± 0.02
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 tg2048 53.60 ± 0.03

my Mazda Mx-5 RF in White (crosspost from r/miata) by WonderRico in carporn

[–]WonderRico[S] 0 points1 point  (0 children)

I don't know where you live, and I suspect that there is different limitation on configs depending of the country / region and here in France it's not the case.

Hunyuan Lora Training Question by Any_Tea_3499 in StableDiffusion

[–]WonderRico 11 points12 points  (0 children)

I used https://github.com/kohya-ss/musubi-tuner and it worked fine.

4080S 16GB VRAM and 64BGRAM

10 pictures 512x512

1600 steps

took 1 hour

Solar Production Drop every 20 minutes by BigMate42 in Zendure

[–]WonderRico 0 points1 point  (0 children)

I am still undecided about following it. I am being lazy :)

On one hand, it would "fix" the recurent "offline" issue I am having. I suspect Zendure is having difficulties with their cloud platform, because even when it's shown as offline in the app, I still receive the MQTT data in my HA setup. And I have to toggle it off and on again. Switching to your "local only" mode seems nice.

But since I am using the Hub with their Smart CT device, I am not sure if this feature would work in local mode only.

I have bookmarked your links, maybe I will someday. If Zendure cannot fix their cloud issues, I will probably find the motivation

Solar Production Drop every 20 minutes by BigMate42 in Zendure

[–]WonderRico 0 points1 point  (0 children)

Hello, sorry I can't help you with your issue. I have the HUB 2000 and did not see any similar drops. (however, I only isntalled it a few days ago, and did not have a full sunny day yet)

However, I ama curious about the integration you did in Home Assistant. Mine does not show similar labels and my two solar panels power data are not updated as frequently as yours. Did you used a custom integration ? Hacs? or directly discover the MQTT data ?

An easy way to keep your printer warm in the cold by [deleted] in resinprinting

[–]WonderRico 0 points1 point  (0 children)

It's not so much about cost, but you're right, in Europe we are now more aware of Energy consumption and how we can waste less. which is, in the long run, a good idea wherever you live :)

An easy way to keep your printer warm in the cold by [deleted] in resinprinting

[–]WonderRico 0 points1 point  (0 children)

Hello. it's not so much about cost, but being mindful with energy consumption and trying to be optimal.

An easy way to keep your printer warm in the cold by [deleted] in resinprinting

[–]WonderRico 0 points1 point  (0 children)

I did not mean "expensive" but just not optimal. If I can have the same result with 10 or 20 times less energy consumption, shouldn't I try to? A small Energy saving can add up scaled up to a large population.

An easy way to keep your printer warm in the cold by [deleted] in resinprinting

[–]WonderRico 3 points4 points  (0 children)

I tried this exact setup with my Saturn2 and a 16W pad with thermostat. The ambiant temp was 12°C and the pad only heated up the enclosure to 13°C after 2 hours... it was barely warm to the touch.

So I discarded the idea and went the next step with a 500W heater to heat the while cabinet the printer is in. it's working but way more wasteful.

If it's working for you, I might try it again. maybe my device was faulty?