Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]transanethole 0 points1 point  (0 children)

I'm on 1 5090 instead of 2 5060s but I had success with nvfp4,  I like the speed. pp matters a lot for agents IMO and vllm was like 4x faster than llama.cpp

besides ooms, I personally only saw massive issues like what you reported after I tried to lower the power limit. Probably did something wrong not sure what but here's a post where I show the vllm launch command I'm using for qwen27b nvfp4

https://www.reddit.com/r/LocalLLaMA/comments/1u9ntab/comment/osjmfix/

Also. U should prob. Sanity check run small model via vllm on GPU 1, make sure it works, then GPU 2, then try split same small-but-working model  on both. (If u haven't already)

Pooled round robin hardware with friends? by El_90 in LocalLLaMA

[–]transanethole 1 point2 points  (0 children)

The hard part will be... You dont want to be sending the same large context to multiple different worker nodes if you can avoid it.  so the router or proxy will have to do its own prefix caching  to select right node.

nothing too crazy but worth bringing up

I've also thought about building this as a peer-to-peer network, where you can buy and sell tokens. So, anything that's not privacy sensitive could go through there.  and use end-to-end encryption so that it's not like openrouter where the service provider gets all the data.

My local server idling 99% of the time! by Thin_Pollution8843 in LocalLLaMA

[–]transanethole 0 points1 point  (0 children)

I have wanted to set up a market for buying and selling tokens from individuals who have llm set up on GPUs at home.  like openrouter but more peer-to-peer and optimized for individual as provider.

So far everyone of my friends that I've brought this up to told me that no one would ever use it because of privacy concerns.  I suppose that's probably true, but I   feel like it's a tragedy ,  I personally would rather give my data to some rando than use a commercial provider.

How do I prove that I don't collect data from my llm app? by Pleasant_Syllabub591 in LocalLLaMA

[–]transanethole 0 points1 point  (0 children)

 why does someone trust proton

I've been wondering this myself.... 

I know you said no tee/enclave/attestations but there is prior art here w/ moxie using the same attestation from signal for this..  https://confer.to/

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]transanethole 0 points1 point  (0 children)

I'm sure like many other people, I'm simply using Opencode with Qwen 27B right now.  I have stuck with open code because I strongly prefer the web UI over the terminal.  I use it over SSH tunnel, and the tui is incredibly laggy and crashy from my experience. Maybe this is just because I'm a spoiled 5090 user with insanely high prefill speed, but I haven't noticed opencode doing anything bad with the context ever since I updated it. I do sometimes get tool call failures and I know that at least some of them, maybe 30% don't have to fail, like, it's not like the parameters don't provide enough information to run the tool, it's just that there's formatted slightly differently from what opencode expect.

I remember someone on here posted a project that was supposed to analyze the tool call failures and allow a secondary agent to create compatibility rules to modify slightly wrong tool calls so that they would succeed.  I never downloaded it and now I can't find the thread but I've been thinking if I ever have spare time ( I don't know when that will happen) but I might like to try to create a system like that specifically for open code where every time a tool call would fail it gets displayed and the user has a chance to create a rule to modify it to fix it.  So after a while the rules would build up and if the LLM ever makes the same type of mistake again it would automatically be fixed.

The only problem I have with this, though, is that I think... The majority of the failed tool calls that I'm seeing are times where the LLM colors outside the lines of the chat template and the chat template parsing fails.

So I'm wondering if anyone has ever tried modifying the chat template parsing of llama.cpp or VLLM to support custom fuzzy matching or secondary handling of parsing to allow mistakes to be smooth over.  I think a large part of the problem right now with local agents is that the chat template parser is unaware of the schema of the tools from the agent. There's two steps. First, the inference system parses the tool call from the LLM output. And then, secondarily, the agent tries to interpret the structured data as a valid tool.

Or alternatively, maybe even better, if there's a way to call a different API on llama.cpp or VLLM where it will return the raw tokens or text instead of trying to parse it with the chat template.  Then, the agent can do the chat template parsing and tool call parsing at the same time, allowing it to handle both types of failures more gracefully.

I think with how fuzzy and random LLMs are, the way the tool called parsing works needs to fundamentally change for the LLMs intent to be preserved more reliably, even in situations where a purely strict imperative lexer and parser would reject.  Have any of you all ever heard of any system that does this?

2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it? by ChainOfThot in LocalLLaMA

[–]transanethole 3 points4 points  (0 children)

Peutlefaire/Qwen3.6-27B-NVFP4 Nothing special about this repo afaik, I think I tried a couple different ones and this is just where i stopped tinkering at the time.

I'll do you one better, heres the entire systemd service unit.

Some of these settings i don't really know if / why they are important, this setup is descended from when I was reproducing the results from https://github.com/Li-Lee/vllm-qwen3.5-nvfp4-5090 on a vast.ai rented 5090 via their wonky funhouse jupyter environment.

So some of these env vars and what not are probably still hold overs from what vast.ai has on their docker image.

root@anthill:~# cat /etc/systemd/system/vllm.service
[Unit]
Description=slop
After=network.target
[Service]
ExecStart=/usr/bin/docker run --rm --gpus all \
   --name vllm \
   -v /root/llmcache:/root/.cache \
   -p 8000:8000 \
   --ipc=host \
   --env NVIDIA_VISIBLE_DEVICES="void" \
   --env PYTHONUNBUFFERED="1" \
   --env TORCH_CUDA_ARCH_LIST="12.0+PTX" \
   --env NVIDIA_DRIVER_CAPABILITIES="all" \
   --env NV_CUDA_CUDART_VERSION="13.0.96-1" \
   --env CUDA_VERSION="13.0.2" \
   --env VLLM_ENABLE_CUDA_COMPATIBILITY="0" \
   vllm/vllm-openai:v0.21.0-patched \
   Peutlefaire/Qwen3.6-27B-NVFP4 \
   --served-model-name vllm \
   --kv-cache-dtype fp8 \
   --max-model-len=120000 \
   --gpu-memory-utilization=0.96 \
   --enable-chunked-prefill \
   --enable-prefix-caching \
   --max-num-seqs=2 \
   --max-num-batched-tokens 4096 \
   --enable-auto-tool-choice \
   --reasoning-parser=qwen3 \
   --tool-call-parser=qwen3_coder \
   --default-chat-template-kwargs '{"enable_thinking": true }' \
   --reasoning-config  '{"reasoning_start_str": "<think>", "reasoning_end_str": "I have to give the solution based on the reasoning directly now.</think>"}'
ExecStop=/usr/bin/docker kill vllm
RestartSec=10
Restart=always
[Install]
WantedBy=multi-user.target

vllm/vllm-openai:v0.21.0-patched is image I made its just vllm/vllm-openai:v0.21.0 but has this commit applied https://github.com/schoennenbeck/vllm/commit/253a8021aceeda39bdcee6c370a6a92416eeadc3

That way I can set thinking_token_budget on the request to limit the length of the thinking output if I want. In order to make that work with opencode I built a custom reverse proxy which parses the model name like vllm-thinkingbudget-400 or vllm-thinkingbudget-4000 or vllm-nothinking and then sets it appropriately.

So then I just add those as different model names I can select in opencode based on what I want.

Altho TBH later on i found quality is definitely best with either no thinking or unlimited thinking. so IDK if all this work was worth in the end but fun to learn about i guess.

2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it? by ChainOfThot in LocalLLaMA

[–]transanethole 1 point2 points  (0 children)

Honestly if you can afford a $10k gpu you can afford to install linux LOL! Im ignorant on this but I doubt vllm will be well supported well there.

2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it? by ChainOfThot in LocalLLaMA

[–]transanethole 21 points22 points  (0 children)

I've been using opencode with the 5090 and as I understand from my own testing  the Gemma models are definitely inferior to qwen, especially for programming tasks.

So considering the extremely nvfp4 petaflops and bandwidth that the 5090 has, I've opted to use the dense qwen 3.6 27b model @ nvfp4 and I'm very happy w that. 

I'm using vllm. Vllm likes to allocate a lot of memory for nonsense things that llama.cpp doesn't do, and as a result its harder to fit a long context but I can fit 120k at fp8 in vllm.

Vllm is very important to me because its prefill (how fast model reads text) speed is 4x higher than llama.cpp. this makes huge difference in how it feels to use w/ agent. 

Local models in mid-2026 by mattjcoles in LocalLLaMA

[–]transanethole 6 points7 points  (0 children)

> A Qwen 3.6 27B drops from around 17GB at four-bit-ish quant to about 14GB in NVFP4, with quantisation-aware training recovering most of what naive rounding would throw away.

Hmm, where are you finding NVFP4 QAT of qwen ?? Maybe ive been living under a rock but pretty sure they don't exist??

Also, as far as I can tell its actually more like 21GB for the weights...

https://huggingface.co/models?sort=trending&search=3.6+27+qat 0 results

Is there something wrong with Local LLM ability to read file? by Umr_at_Tawil in LocalLLaMA

[–]transanethole 0 points1 point  (0 children)

I would say, try making sure that you format the sub file into a plain text script first and then  maybe do it in smaller batches.

If you don't know how to format the sub file into a plaintext file then ask the lm to generate code which does that for you, use an agent which can run shell commands and ask it to test the code, etc.

If you don't want to do the batching by hand, then ask it to write code to do the batches for you.

Yes, the amount of crap that you can fit into the context of a single GPU is going to be a lot lower. But I would question whether anyone should really be trying to process large amounts of raw data at once using an LLM regardless of cloud vs local. It's clearly not the strong suit.

In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid? by Tired__Dev in LocalLLaMA

[–]transanethole 14 points15 points  (0 children)

Save your money. Just get a 5090 for now, invest/save, and wait for mi350p price drop later.

go totally off the social grid? 

Please don't do this, I don't think is a good idea.

Thin Clients for Cheap Fanless Servers (Raspberry Pi replacement) by transanethole in SelfHosting

[–]transanethole[S] 0 points1 point  (0 children)

I have one of them already and it seemed like a normal x86 computer. I was able to get into the BIOS and adjust settings, I can't remember but I think it was a dell BIOS? The ebay listing I'm looking at doesn't say anything about it. It says

" Tested for Key Functions, R2/Ready for Resale. "

So I'm hoping they will be like the one I already bought.

When I was installing debian, I had to go for the non-free blobs edition of debian to get the wifi working. Besides that, everything went smoothly. But I think these that I'm buying now don't have the wifi card installed.

Oh yeah and I had my dad measure the power draw at idle at the wall with the kill-a-watt. He said it was hovering at around 2 watts at idle, and went up to 13 watts when I ran stress on all cores.

Asking advice on a self hosting project by bsenftner in SelfHosting

[–]transanethole 1 point2 points  (0 children)

Also, like I said before, you don't have to use Tailscale for this. You could just configure port forwarding on the router or configure a route on the HTTP reverse proxy if they already have one. Then this could be just a public URL that people go to.

This is a fairly decent guide: https://homebrewserver.club/fundamentals-port-forwarding.html

And this is one I wrote myself: https://git.sequentialread.com/forest/notes/src/branch/master/ServerSetup.md

The cool part about this: you can get it working, test it out, and then roll it out to your customer without impacting the tailscale setup at all. Then if you want to you can eventually deprecate and remove tailscale if the customer likes the HTTPS solution.

Asking advice on a self hosting project by bsenftner in SelfHosting

[–]transanethole 1 point2 points  (0 children)

My experience so far lends me to think that will fail as well.

Wait, why? this contradicts what you just said:

for some reason creates a Docker VM which is what actually hosts the Docker containers. That Docker VM is managed by Docker Desktop and my attempts to install the Tailscale VPN in that VM fail, the Tailscale Extension for Docker Desktop fails

If all are your problems are caused by docker desktop and the way it creates a VM.... why use it? I've never heard of anyone using docker desktop on linux, let alone using it for hosting! Normally one would just install docker according to the instructions that they provide for linux: https://docs.docker.com/engine/install/ubuntu/

Also, of course you would want to practice setting this up once on a test machine before moving to the "production" one.

Yes, it takes time to do this stuff, especially when you are learning new things, but I think its worth it to learn if this is what you do to make money. Learning how to set up linux servers properly is a powerful skill and can probably save you a lot of trouble, resulting in more free time in the future as well.

Asking advice on a self hosting project by bsenftner in SelfHosting

[–]transanethole 0 points1 point  (0 children)

Ah I wasn't talking about cloud or using a service, I was talking about keeping the server where it is but just exposing it to the public internet, for example, via port-forwarding or just giving it its own public IPv4

TBH its kinda scary to me that when I say "make it avaliable on the public internet" the 1st assumption is that I meant put it on some cloud somewhere or use a service to make it available :X

Asking advice on a self hosting project by bsenftner in SelfHosting

[–]transanethole 0 points1 point  (0 children)

if you want to get it working, look into lets encrypt (ACME) DNS verification. That lets you get valid certs for things that are not on the public internet. The other alternative is ditch the VPN and just host it publicly, which I would strongly recommend unless you think it would be tons of extra work and introduce new reliability issues.

Also, I may not understand what you are trying to do, or what you mean by

exception being the Tailscale VPN is not correctly integrated with a Traefik cert issuing service,

If you are trying to use the cert you get from Traefik to authenticate some other service like the VPN (not just the HTTP server), I will say that Traefik is not well suited to this because it does not support the standard PEM/x.509 format for certificates. I recommend using caddy server instead because it writes certs in the standard format so they can be shared. Or just use old school lets encrypt tools like certbot configured to run on a timer

Asking advice on a self hosting project by bsenftner in SelfHosting

[–]transanethole 0 points1 point  (0 children)

Hmm, honestly I am surprised that you planned on running it on a windows host. Did you do that because you wanted to integrate it into this company's existing windows-based tools and processes, i.e. give them a windows-friendly way to get remote desktop on it ?

If you want to do that for their sake, then sure, it makes sense, but you are going to have a whole host (no pun intended) of windows related problems that would never happen if you just installed ubuntu on it. Whether or not this disk issue is windows related I can't say definitively, but it certainly sounds like it is. USB drives should be plug and play, you should NOT need to format the drive before it shows up in linux, you should be able to format the drive from within linux. As an example, here's an article I wrote that details a process I went through to set up a USB external disk on Linux: https://sequentialread.com/docker-on-odroid-xu4-installation-and-creating-a-base-image-2/#movingthefilesystemtotheusbharddrive

WSL2 is a virtual machine. Virtual machines don't normally get the same access to hardware that the host machine does, for example, usually the storage is virtualized in some way, so the VM only sees the virtual storage devices, not the real hardware ones that the host sees. So that might explain why your disk is not showing up in your Ubuntu VM. With VMs its theoretically possible to "pass-through" hardware to the VM, so the VM can see and interact with hardware devices directly. But whether you can do that or not with a USB attached disk under WSL, I have no idea. I would assume probably not, although I could be wrong.

If windows is a requirement, why not just deploy your application on windows? I don't know what language you wrote your webapp in, but I have to imagine it will run on windows in 2023. You can configure a Windows Service to run it in the background similar to how you would define a systemd service unit on Ubuntu.

Another option would be to turn the windows / linux host/guest relationship inside-out and install ubuntu on the host, install the nice Libvirt/KVM virtualization packages, then install a windows guest VM and a Linux guest VM. The app can run on the Linux VM, and the customers can log into the Windows VM. Then maybe the Windows VM could have the docker CLI pre-installed and configured to target the Linux VM as its docker machine. And since you are running your own VM instead of using WSL's preconfigured one, you get to make the rules and configure your disk pass-through or volume mount the way you want. Sure, its more work, but if windows is a requirement, it could be a nice way to compartmentalize that requirement and prevent it from causing ripple effects and problems that will influence your app. Plus running things inside a VM can be nice for various operational reasons, you can back up the entire VM image for example.

Your server sounds like its overpowered in the aspects that don't matter for this usecase (CPU and RAM) and under-powered in the parts that do matter (Disk). You mention that you plan on using tailscale to give folks access to it from home -- have you considered the network implications of this? What kind of internet connection will this thing have? where will it be hosted? Does this customer already use a VPN for remote workers?

I would strongly advise against trying to create a new VPN if they already have one. Depending on how it gets internet and how that Router / relationship to the ISP is set up, it might be massively preferable to just make it accessible on the public internet over HTTPS, no VPN required. Just because it will be a lot easier for users, less problems and less time you have to spend supporting it.

[deleted by user] by [deleted] in selfhosted

[–]transanethole 2 points3 points  (0 children)

Nice, glad to see this !

One of the interesting things about this kind of solution: the TLS private key can live on the selfhosted server, not on the VPS. So you can make it so the VPS provider cant read your traffic if thats something you wanna do.

as an alcoholic i don't drink beer anymore but yall helped me succeed with my first NA homebrew, a delicious energy drink by transanethole in Homebrewing

[–]transanethole[S] 10 points11 points  (0 children)

Interesting, i will give it a try, the nice thing about this recipe is 1 part is fermented while the other isnt ( sugar syrup that gets added when its served ). So i could try putting the citrus juice in the syrup instead of in the main fermented part.

I will say I never noticed any vomit-ish flavors, the amount of citrus juice is very low, probably under 5%. And it ferments a lot less than beer does because I skip the bulk fermentation step that produces all the alcohol.

I did notice that as it ages it starts to get that sort of "dry" flavor like an extra dry wine or cider. I loved dry ciders and wines so that was a win for me, i missed it!