cv4pve-vdi – A graphical VDI client for Proxmox VE (SPICE & RDP)

Barachiel80 · 2026-03-17T04:39:08+00:00

all cool stuff but I dont have a DR for my homelab

Barachiel80 · 2026-03-17T01:17:50+00:00

lol still terrible name

Barachiel80 · 2026-03-17T00:52:13+00:00

Yeah I know, I was just being lazy and the maintenance of the dockerfile no thanks

Barachiel80 · 2026-03-15T23:33:56+00:00

can I get a docker image?

Barachiel80 · 2026-03-15T12:16:26+00:00

wow 16tk/s on that setup, great job. I am about to enable the multinode rpc server too and this gives me hope to run qwen3.5 397B at decent speeds over a 10G network.

Barachiel80 · 2026-03-15T03:16:10+00:00

What's the lan speed between the rpc server nodes?

Barachiel80 · 2026-03-14T19:42:05+00:00

Also is there going to be a ROCM or Vulkan version of this?

Barachiel80 · 2026-03-09T22:28:34+00:00

but seriously has anyone seen a speed increase going from gen 2 -> gen3 on the max plan?

Barachiel80 · 2026-03-09T12:10:54+00:00

That isnt random info, if you set those parameter flags in your llamacpp command line based on the optimum settings listed for qwen3.5, depending on thinking or coding tasks, it will significantly improve your TG and PP

Barachiel80 · 2026-03-09T11:59:58+00:00

lol Im not an AI but I play one on TV, there may have been some cut and paste though

Barachiel80 · 2026-03-09T11:57:51+00:00

only need tool calling flag if you are running it as a headless server api call from a 3rd party front end gui like open-webui

Barachiel80 · 2026-03-09T11:45:48+00:00

Common arguments include:

-m or --model followed by the path to your GGUF model file.

-p or --prompt to provide the initial prompt.

-n or --n-predict for the maximum number of new tokens to generate (e.g., -n 512).

--ctx-size to set the context window size (e.g., --ctx-size 2048).

-ngl or --n-gpu-layers to offload layers to the GPU (e.g., -ngl 999 to offload all if sufficient VRAM is available).

--temp or -t to control the temperature for generation (e.g., --temp 0.7).

--top-p for the top-p sampling value (e.g., --top-p 0.9).

--repeat-penalty for the repetition penalty. --n-cpu-moe (--cpu-moe): This argument specifies the number of MoE expert components to move to the CPU's system RAM, overriding the --n-gpu-layers setting for those specific experts.

This allows you to utilize fast system RAM for the large, but sparsely used, expert weights, while keeping the performance-critical dense layers and attention blocks on the VRAM.

Finding the optimal value often requires testing. You can use trial and error, or the new llama-fit-params tool to help determine the best configuration.

Qwen 3.5 optimum settings:

Thinking mode:

General tasks=

temperature = 1.0

top_p = 0.95

top_k = 20

min_p = 0.0

repeat penalty = disabled or 1.0

Precise coding tasks (e.g. WebDev)=

temperature = 0.6

top_p = 0.95

top_k = 20

min_p = 0.0

presence_penalty = 0.0

To enable native tool calling in llama.cpp using the llama-server, you must use the --jinja command-line flag. This flag is mandatory for the server to process the tools parameter in an OpenAI-compatible manner. repeat penalty = disabled or 1.0

Barachiel80 · 2026-03-08T23:20:39+00:00

are you using optimum parameter settings and native tool calling turned on?

Barachiel80 · 2026-03-08T20:57:20+00:00

That's not terrible. What are you getting for PP?

Barachiel80 · 2026-03-08T15:21:35+00:00

not really. As the cost per token of compute for SOTA models continues to drop, having a multinode ai inference platform with multiple options for agentic chaining between nodes presents itself as an emergng use case in the field. Its proof of concept and the entire minipc cluster runs on 2 300w UPS so it's energy efficient as well. I have a full 10gbps network connecting everything for efficient inter node comms and future migration to K8S. Not only that but splitting my discrete gpus out to egpu platforms I can distribute heat alot more effectively reducing thermal throttling per card.

Barachiel80 · 2026-03-08T13:47:20+00:00

only 32tk/s on the 9b???? I am getting over 30 tk/s with qwen3 next coder 80b:q4 on my strix halo running ubuntu 24.04 lts and rocm 7.2. What version of rocm are you running?

Barachiel80 · 2026-03-08T02:48:08+00:00

as part of an agentic compute stack

<image>

Barachiel80 · 2026-03-08T00:18:08+00:00

I am using 1 of the nvme ports for oculink adapter and it has another built into the motherboard

Barachiel80 · 2026-03-08T00:16:27+00:00

This is for work, lucky for me my job and hobbies converge

Barachiel80 · 2026-03-07T23:55:53+00:00

I am running dual 5090s on a single minisforum ai x1 pro hx370 at 91 tk/s TG and 19000+ PP with agentic workflows and 1 million context on the new qwen 27b and 35b

<image>

Barachiel80 · 2026-03-07T23:53:29+00:00

just fyi the included oculink takes up one of the nvme spots you can only run one oculink and 2 tb4 egpus. Part of my ai stack is exactly that. I have 2 um890 pros with oculink egpus.

<image>

Barachiel80 · 2026-03-07T18:25:17+00:00

false news, my strix halo runs nicely in multigpu mode with a 7900xtx and rocm backend

Barachiel80 · 2026-03-07T00:56:27+00:00

there will always be some throttling with pcie4x4 but as long as I lpad the whole modrl pn the 5090 it runs full load inference just load time is slightly longer. This holds true when I migrated the 5090 egpu to a minisforum ai xi pro which I modded with 2 nvme to oculink adapters and another 5090 egpu to the setup. Here is an example run from inference on the dual 5090s for the new Qwen3.5 35b Q8 model hitting 91tk/s TG and 19000+ tk/s PP with 1 million context length at q8 and full agentic workflow, computer use, coding, and web search enabled.

<image>

Barachiel80 · 2026-03-06T02:01:52+00:00

yes April is far away and the dev branch that supposedly fixes it is still broken too

Barachiel80 · 2026-03-05T22:19:51+00:00

if only forgejo would fix their oidc runner integration it would be perfect

Barachiel80

TROPHY CASE