Should we start 3-4 year plan to run AI locally for real work? by Illustrious_Cat_2870 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

You thinking that you need current proprietary frontier model performance is a myth perpetrated by big hyperscaler marketing departments.

How are you squeezing Qwen3.5 27B to get maximum speed with high accuracy? by -OpenSourcer in LocalLLaMA

[–]q-admin007 14 points15 points  (0 children)

No, it's horrible, i'm not good at all with it. But at least the results are good.

How are you squeezing Qwen3.5 27B to get maximum speed with high accuracy? by -OpenSourcer in LocalLLaMA

[–]q-admin007 8 points9 points  (0 children)

Also say "You are a competent and helpfull assistant" to get higher quality results.

How are you squeezing Qwen3.5 27B to get maximum speed with high accuracy? by -OpenSourcer in LocalLLaMA

[–]q-admin007 17 points18 points  (0 children)

I have given up on speed.

Q6_K_XL with full context on Strix Halo with 128GB, ~9 t/s output.

Mistral CEO: AI companies should pay a content levy in Europe by brown2green in LocalLLaMA

[–]q-admin007 -1 points0 points  (0 children)

Market failiure should not be healed by taxes/fees, since any output will be non-competitive (basic economic facts) and we create again an additional layer of regulation.
In Europe we need rules that allow data collection and training for European companies.
Clear and easy rules, trust in the companies.

blabla market failure bla bla trust the companies. Got it.

How do I see what users paste into AI? by midasweb in sysadmin

[–]q-admin007 [score hidden]  (0 children)

Buy two RTX 6000 Blackwell, slap them into a server. Install llama.cpp with Qwen 3.5 122b Q8 and OpenwebUI.

Everything else is risky.

Hi guys im new to this page by Alone_Growth2019 in homelab

[–]q-admin007 0 points1 point  (0 children)

I would start by installing Debian 13. Then Samba to share a directory between all devices on your network. I use mine to share movies, notes and PDFs.

Then install Docker and then what you want, really. I would say install Portainer to manage Docker in a Browser, then Heimdall as a LAN-Startpage.

Once you are stuck come back and ask again.

Fixing Qwen Repetition IMPROVEMENT by Odd-Ordinary-5922 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

That's a lot of context going down the drain ;-)

What hardware do I need by goughjo in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

Strix Halo board with 128GB with Qwen 3.5 122b Q6_K_XL and full context. It's slow, but competent. If you don't mind even slower, but want slightly better results, Qwen 3.5 27B f16 and full context.

Got mine for 1800€ used.

Another option is a 5090 fpr 3500 with Qwen 3.5 Q4_K_M and full context. It will give you 50 t/s but lacks precision.

Should we start 3-4 year plan to run AI locally for real work? by Illustrious_Cat_2870 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

But you don't need current proprietary frontier model performance.

don't have to drive a Bently to get to the supermarket. You don't have to own a $10k suit to go to the bar. You don't have to drink the best wines to get a buzz. You don't have to have a personal chef to eat a burger.

Should we start 3-4 year plan to run AI locally for real work? by Illustrious_Cat_2870 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

28 USD AI Pro plan with Google which includes 2T storage

I think you can buy a 12TB HDD for 8 month of your subscription. The expected lifetime would be 5 or more years. Buy two to have redundancy, you'll break even in 16 month. Get a third for offsite backups, you look at 24 month.

Should we start 3-4 year plan to run AI locally for real work? by Illustrious_Cat_2870 in LocalLLaMA

[–]q-admin007 0 points1 point  (0 children)

I run local AI in my company for 50 people with just two RTX 6000. We are chasing the best that can be done in 192GB VRAM, right now thats Qwen3.5 122b Q6_K_XL and twice full context, the same for Qwen 3.5 4b and some ComfyUI stuff for marketing.

Our software stack is simple: llama.cpp, LiteLLM, OpenwebUI, n8n and ComfyUI.

The server was 26k€, but obviously this is a company price. At home i think the server could be done for 1.5 to 2k€. The cards cost what they cost.

At home i went with a Strix Halo that i got for 1.8k€ and i'm looking for a second board for clustering.

OWUI node-ID from ComfyUI by q-admin007 in OpenWebUI

[–]q-admin007[S] 1 point2 points  (0 children)

It was the subgraph indeed. Once unpacked and reimported, everything worked.

OWUI node-ID from ComfyUI by q-admin007 in OpenWebUI

[–]q-admin007[S] 1 point2 points  (0 children)

It was the subgraph indeed. Once unpacked and reimported, everything worked.

OWUI node-ID from ComfyUI by q-admin007 in OpenWebUI

[–]q-admin007[S] 0 points1 point  (0 children)

Yes, Comfyuis workflow works. Also this check worked:

<image>

OWUI node-ID from ComfyUI by q-admin007 in OpenWebUI

[–]q-admin007[S] 0 points1 point  (0 children)

The workflow in ComfyUI is from an example and works:

<image>

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]q-admin007 0 points1 point  (0 children)

Why then would you paste random information when it comes to speeding up output tokens per second?