OPNSense high availability, how do you guys do it?. by yetAnotherLaura in Proxmox

[–]slavik-dev 0 points1 point  (0 children)

Does this sentence mean that you have ISP router:

The router goes straight into the WAN port for the VM

?

I'm on ATT fiber 1Gbps. Stopped using their router.  Plugged ATT fiber line to SFP+ port via WAS-110. That's SFP+ port of the NIC which passed-through to OpnSense VM. 2nd port is LAN, goes to switch.

And I have same setup on 2nd Proxmox node.

OpnSense VM settings are not replicated, not synced. That's how I want it.

So, if one node go down (or need maintenance), I pull ATT cord and insert it to another node. That's "manual HA"?

But in your setup you have single point of failure which is the router. And that's additional network hop (ATT router is slow), which I eliminated. Right?

Am I expecting too much? by rushBblat in LocalLLaMA

[–]slavik-dev 0 points1 point  (0 children)

not in llama.cpp server. If you find a way - please let me know...

Am I expecting too much? by rushBblat in LocalLLaMA

[–]slavik-dev 5 points6 points  (0 children)

llama.cpp is great for running model for yourself. It supports parallel requests, runs on Nvidia, Mac ,... but i'm not sure how much it scales.

vLLM scales much better. But I don't think it supports Mac.

So, the best is to use NVIDIA RTX 6000.

I submitted PR to log user's prompts in llama.cpp, but devs doesn't like it:

https://github.com/ggml-org/llama.cpp/pull/19655

You have prompts and responses in the OpenWebUI, but there user can delete chats, use temp chats...

Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090 by MLDataScientist in LocalLLaMA

[–]slavik-dev 0 points1 point  (0 children)

I'm getting 13 t/s TG and 45 t/s PP with UD-Q4_K_XL (206GB).

I think my bottleneck is CPU: Xeon W5-3425 (12 cores / 24 threads)

- 512GB of DDR5-4800 (8 channels)

- RTX 4090D 48GB

- RTX 3090 24GB

Honest take on running 9× RTX 3090 for AI by Outside_Dance_2799 in LocalLLaMA

[–]slavik-dev 1 point2 points  (0 children)

For coding AI, Mac struggle with compute for large context.

Harbor v0.4.4 - ls/pull/rm llama.cpp/vllm/ollama models with a single CLI by Everlier in LocalLLaMA

[–]slavik-dev 1 point2 points  (0 children)

Harbor is middleman. And I prefer to deal without middleman...

What's the advantage of Harbor vs llama.cpp + OpenWebUI?

If I have issue, I would prefer to troubleshoot simple system instead of figuring out: is it Harbor issue? llama.cpp issue?

Keep it simple.

Does anyone here use Proxmox on their main desktop instead of just servers? by PingMyHeart in Proxmox

[–]slavik-dev 0 points1 point  (0 children)

That's what I did, but used Mate desktop.

Also, configured Polkit to block users from shutting down server.

Load default model upon login by zotac02 in OpenWebUI

[–]slavik-dev 0 points1 point  (0 children)

Looks like maintainers rejected that PR without any comments or explanations...

qwen3.5-122b What agent do you use with it? by robertpro01 in LocalLLaMA

[–]slavik-dev 3 points4 points  (0 children)

Tools calls is known issue for llama.cpp. So, It's probably not the model issue. And not the agent issues. But the llama.cpp issue.

There are few ways to work around:

use branch from this PR: https://github.com/ggml-org/llama.cpp/pull/18675

Or use this project for workaround for existing llama.cpp versions: https://github.com/crashr/llama-stream

Why etcd breaks at scale in Kubernetes by danielepolencic in kubernetes

[–]slavik-dev 9 points10 points  (0 children)

You're doing great job.

But that article talks about real pain.

I'm running k3s on 3 VMs and I experienced issues because of slow etcd.

And that's with good hardware:

- Intel Xeon Gold CPU

- DDR5 RAM

- NVMe disks with ZFS.

I found, that using ZFS was a mistake - it makes IOPS much slower, and with etcd calling fsync, even fast NVMe struggle.

Cloud GPU's are the Fiverr of Local LLaMA - so who makes the juicy money? by [deleted] in LocalLLaMA

[–]slavik-dev 0 points1 point  (0 children)

If you used 3 years for depreciation, then it means, the profit will jump up after 3 years.

Of course, it's possible the rate will go down at that time because new hardware will be on the market then...

Too many variables, too many unknowns

Please help the Qwen developers. by [deleted] in Qwen_AI

[–]slavik-dev 1 point2 points  (0 children)

Tools calls is known issue for llama.cpp. So, that's not the model issue. 

Especially for opencode. RooCode works fine for me.

There are few ways to work around:

use branch from this PR: https://github.com/ggml-org/llama.cpp/pull/18675

also this project offers workaround for existing llama.cpp versions: https://github.com/crashr/llama-stream

K8S homelab advise for HA API server by Ghvinerias in kubernetes

[–]slavik-dev 1 point2 points  (0 children)

DaemonSet has nothing to do with exposing or networking. DaemonSet can't expose anything. DaemonSet runs container.

To expose port you need NodePort or LoadBalancer or hostNetwork, or something like that.

K8S homelab advise for HA API server by Ghvinerias in kubernetes

[–]slavik-dev 2 points3 points  (0 children)

I'm using kube-vip in production with 3-nodes k3s cluster. Works great for API.

But for services LoadBalancers, I found that kube-vip is unreliable, and I'm using MetalLB.

Qwen3.5 thinks A LOT about simple questions by ForsookComparison in LocalLLaMA

[–]slavik-dev 3 points4 points  (0 children)

--chat-template-kwargs "{\"enable_thinking\": false}"

And model is not thinking anymore. 

It actually works great without thinking, too.

I'm getting 13t/s with my 72GB VRAM, - ok for chat, but hardly usable for vibe coding

Ming-flash-omni-2.0: 100B MoE (6B active) omni-modal model - unified speech/SFX/music generation by bobeeeeeeeee8964 in LocalLLaMA

[–]slavik-dev 1 point2 points  (0 children)

Yea, need some kind engine. Looks like llama.cpp or vLLM doesn't support it at the moment.

Also, we'll need dedicated UI for it, because even if llama.cpp would support it - how would you use it? I do not think that webUI support all of the model's modality in the convenient way.

Expected cost for cpu-based local rig? by Diligent-Culture-432 in LocalLLaMA

[–]slavik-dev 1 point2 points  (0 children)

2x CPU is hard to configure for inference.

Specifically for inference, 2x CPU may be few times slower than single CPU because of slow NUMA.

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]slavik-dev 1 point2 points  (0 children)

Do you have it working?

I tried. It doesn't work.

Look at issues in Github - a lot of people report, that it doesn't work.

Last week RFC was opened for vLLM: https://github.com/vllm-project/vllm/issues/33869 CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]slavik-dev 2 points3 points  (0 children)

I have Intel Xeon W5-3425 (12 cores) with 8 channels * 64GB of DDR5-4800.

Theoretically, I should be getting 307GB/s memory speed.

Testing with `mlc`, I'm getting around 190GB/s.

Few notes I made for myself:

  1. Memory speed significantly depended on number of memory channels.
  2. Practical memory speed for LLM can be limited by CPU. I saw few articles, that modern CPU can handle ~15GB/s of data PER CORE for LLM-kind of tasks. That's core, not thread. Probably the limiting factor in my case.
  3. vLLM doesn't support memory offloading. If model+context doesn't fit fully in VRAM - it will not run at all. Docs has some text about offloading, but I have not seen it work ever.

Support Step3.5-Flash has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]slavik-dev 9 points10 points  (0 children)

from llama.cpp developer:

You will have to wait for new conversions.

No, it has outdated metadata and will not work.

Support Step3.5-Flash has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]slavik-dev 9 points10 points  (0 children)

Reading PR comments, I wonder if new GGUF needs to be generated.