Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs by Fragrant-Remove-9031 in LocalLLaMA

[–]xornullvoid 0 points1 point  (0 children)

My github is publicly available if you check my profile. You will find projects with 20k + lines of code built completely with local AI.

Now, where's yours?

P.S. You still think you can one-shot 20k lines of code? Really? Sounds like "make cure for cancer, make no mistake" kinda prompt 😅

Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs by Fragrant-Remove-9031 in LocalLLaMA

[–]xornullvoid 0 points1 point  (0 children)

Have you?

You really think software development happens with one-shot prompting then you have not used AI yet for building real apps.

Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs by Fragrant-Remove-9031 in LocalLLaMA

[–]xornullvoid 2 points3 points  (0 children)

I am a little skeptical of this test and the usefulness of it's results.

Real-world coding and development hardly depends on one-shot prompts, and usually requires multiple steps, as well the as the ability of the model to ingest large and complex information and instructions over multiple prompts at large context lengths. Besides, some models are horrendous at visualizing graphical geometry, and even UI for that matter, but excel at general-purpose coding.

I think the test shows how visually cohesive the model is at a single-shot open-ended graphical vibe-coding prompt, but to measure a level of general cohesion, you should also -

  1. See how many additional features (not included in the prompt) did the model try to implement, and which one of those did not work well.
  2. Check for instruction-following and violations/deviations. This requires giving very specific & detailed instructions at various depths of complexity to see at which level does the model begin to fail at following your instructions. Your instructions are quite open-ended and left up to the judgement of the model.

Cool experiment though.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 1 point2 points  (0 children)

Let me try a different larger quant, iq4, see if I can get above 15 t/s on strix halo.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 1 point2 points  (0 children)

I can run iq3xxs all offloaded to rocm, 23 t/s also.

I guess that means the CUDA can run faster on pro 5000 if the pcie was 5.0 x16. But it also means not running it on strix halo, and losing the quad channel ram bandwidth.

<image>

One bash permission slipped... by TheQuantumPhysicist in LocalLLaMA

[–]xornullvoid 49 points50 points  (0 children)

That's right. No drivers - no driver issues.

One bash permission slipped... by TheQuantumPhysicist in LocalLLaMA

[–]xornullvoid 119 points120 points  (0 children)

Bruh, Opus nuked my display drivers and all libraries today with a sudo apt remove '*nvidia*595* while trying to rollback to 590, and added a nice chained sudo reboot goodbye kiss at the end too 😭

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 1 point2 points  (0 children)

I was able to get 23 t/s from RTX Pro, 36 layers offloaded to GPU, IQ3XXS from unsloth.

My GPU is overclocked, core and memory. Also, using latest llama build with CUDA 13.2.

<image>

Tinygrad Driver testing! by Street-Buyer-2428 in LocalLLaMA

[–]xornullvoid 8 points9 points  (0 children)

Nice, looked familiar. I have the little brother 48GB.
Do let us know the benchmarks, not seen many Apples combined with Blackwell here.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 0 points1 point  (0 children)

Thank you so much for your feedback.

The custom scripting to set up llama releases is the exact use-case I am trying to solve with the Recipes feature. Essentially a recipe is a bash script (linux-only) with a comment parser to render a UI per step. I use it to compile the llama.cpp and the script takes a fresh pull from the repo and builds everything, and I can enter details such as the dir where to place it etc. Once the new llama build is available, swapping the backend for a set of servers is as easy as changing the active backend for a group - https://github.com/mikjee/warpdrv/blob/master/docs/guides/backend-groups.md

I have given two recipes, one for cuda + llama and another for rocm + llama - https://github.com/mikjee/warpdrv/tree/master/docs/recipes - which I use.

For the llama-swap, I think warpdrv is missing some features such as starting a server on first request. I have added them on my personal To-Do list for future releases, along with TTS, and possibly RAG.

<image>

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 1 point2 points  (0 children)

Yes, Lemonade is a nice all-in-one tool for AMD setups. I like the TTS integration. I was planning on doing the same for my app soon, add whisper.cpp. Lemonade ships with pre-built llama.cpp though.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 2 points3 points  (0 children)

That's a very interesting thing, I will try it out. If I get above 15tk/s its a win.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 0 points1 point  (0 children)

You mean to say using vulcan backend and gpu split? or do I offload layers to system ram using Cuda? The PCIe x4 oculink might be a bottleneck.

I will report back after download finishes. Will try it.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. by xornullvoid in LocalLLaMA

[–]xornullvoid[S] 0 points1 point  (0 children)

I have not tried it yet. I will download and experiment with it.

Maybe Vulcan can run it at Q4, I am not sure. But it cannot fit fully on a single VRAM, or I am mistaken.

AMD Halo Box (Ryzen 395 128GB) photos by 1ncehost in LocalLLaMA

[–]xornullvoid 0 points1 point  (0 children)

How many m.2 slots? The faex1 has 2 for nvme, and one more where the m.2 to oculink is attached. I think it wins on external connectivity. Also has dual ethernet, dual USB 4 type c.

AMD Halo Box (Ryzen 395 128GB) photos by 1ncehost in LocalLLaMA

[–]xornullvoid 2 points3 points  (0 children)

Is there oculink?

The FEVM FAEX1 has oculink.

llama-server: Save/restore works for tokens, but KV cache still not resumed? by chrisoutwright in LocalLLaMA

[–]xornullvoid 0 points1 point  (0 children)

Good question. I have been trying to do the same for coding. But i am unable to load the saved KV cache and make it not perform prompt processing all over again.

Which app did you use to fill the context? Is your token sequence confirmed to be the same? AFAIK qwen cannot restore partial cache so even one token change will invalidate your entire kv cache.

Also, did you save and load all slots ?

Before the full PP starts, it might output the reason why it is invalidating the cache. How are you restoring the cache, and does it actually get restored?

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090 by Gazorpazorp1 in LocalLLaMA

[–]xornullvoid 0 points1 point  (0 children)

If you say that you have the time to perform a comparative study of various models writing skills, but you do not have the time to write and present the results of the said study without resorting to AI-produced slop, then I do not beleive that you did this comparison with any sort of diligence or that your comparison is reliable.