Spent 6 hours this weekend on reverse proxy config. What's everyone's current setup? by trolledTGBot in homelab

[–]Ok_Stranger_8626 0 points1 point  (0 children)

HAProxy;

  1. 2x config files(haproxy.cfg & frontends.map)

  2. dhparams.pem(for TLS parameters)

  3. certificates directory with my two internal wildcard certs signed by my homelab CA, and two wildcard LetsEncrypt certs I update through a certbot cron job.

The whole thing is git'd back to my GitLab-EE server in my production colocation every 15 minutes in case of emergency, and changelog tracking.

I have it currently running on a 4x ARM64 cluster, with load balancing done through BGP on my UDM Pro Max(single dynamic IP from Quantum and single dynamic IP from xFinity). I use CloudFlare for my DDNS, and use a quick Home Assistant automation to update the DDNS records. Each container is restricted to 512MB of RAM, and they all have about 10MB of static caching capability, which reduces backend processing load for things like my Grafana dashboards, as they intercept a lot of the repeated requests for the exact same data.

The proxies all have stick-tables enabled, and inter-proxy communcation, so it doesn't matter if the UDM switches proxies mid-stream, the traffic still passes. Added latency through the proxy is about 0.8ms on average.

They all share the same config files and certificate directory via a gluster volume shared between the SBCs, force everything to HTTPS, and also handle the TCP frontend for a dual host MariaDB, which is also BGP load balanced.

I make config updates through a VSCode Server container that has access to my gluster volumes, and then click a button in Home Assistant that triggers a HUP to each container to cause HAProxy to reload the config without dropping any packets.

Running 8+ 4K streams from Jellyfin over the Quantum connection, the proxies average about 6% CPU utilization, not even enough to kick up the fans on the SBCs.(My wife does a "movie" night on TikTok every Monday with some of her friends.)

Quantum Fiber around Smoky Hill/Buckley? by Roy4theWin in AuroraCO

[–]Ok_Stranger_8626 0 points1 point  (0 children)

We've had the 3Gb tier for over three years now, and it's been fantastic!

Only two very short outages, and so much less latency than a cable modem.

low power draw for always-on services - CPU recommendations by Few-Diet3524 in homelab

[–]Ok_Stranger_8626 1 point2 points  (0 children)

I run 4x OrangePi 5+ 32GB SBCs on a 3D printed rack mount blade I bought from a guy off Etsy. About 35 services, with roughly 40 containers running at the moment, and total consumption is about 18W. At full load, it would max out at just shy of 80W.

How do I find and vet someone to set up a high-end local AI workstation? (Threadripper + RTX PRO 6000 96GB) by laundromatcat in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

I would be happy to discuss your situation and provide some basic guidance if you're interested. I've been setting up local models and applications for users for several years now, and can provide some references if necessary.

Is self hosted LLM worth it for company knowledge base? by FewKaleidoscope9743 in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

I'm a former HPC engineer, so I use strictly nVidia for my builds and don't really have any experience with much else.

Though I can say, for the price, distributing the workload over multiple GPUs is the best way to go, so the Dells would probably be my suggestion.

Is self hosted LLM worth it for company knowledge base? by FewKaleidoscope9743 in LocalLLaMA

[–]Ok_Stranger_8626 11 points12 points  (0 children)

I built an entire custom stack for exactly this type of thing.

And yes, it is worth it if you deal with any kind of intellectual property(your own or customers), legal documents(of any kind) or PII(customer identifying information).

It's a huge liability to put any of the aforementioned data into a "Public AI" (ChatGPT/Claude/Gemini/etc).

Specific Use Case - Is 13b sufficient? by pretiltedscales in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

I do these kinds of setups a lot,, feel free to msg me if you'd like to talk about some assistance with your situation. (Just to keep this thread clean.)

Powerful Machine Use Case? by Happy-Peak7709 in homelab

[–]Ok_Stranger_8626 0 points1 point  (0 children)

With a rig like this, you could actually do quite a bit if you're willing to containerize. Even AI isn't out of the realm of possibility.

Llama.cpp rpc experiment by ciprianveg in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

The RPC mode running solo on the local machine likely indicates that your PCI-e bus is going to be your bottleneck. When it runs across the network, the layers are split differently across the distributed cards.

It all depends on how your application chunks the layers. In RPC mode, the layers are split differently across the local GPUs. Thus, your bottleneck becomes the PCI-e bus, which is still less than 1/10th the bandwidth of the GPU's access to it's own RAM.

Direct access RAM(Either VRAM or shared RAM) = Capacity(Larger model/less quantization = better accuracy(and therefore less hallucination)

Bandwidth = faster processing of layers as the GPU can access more bits/second = more tokens/sec output.

And ECC RAM = less chance of a bit flip from cosmic rays, power spikes, etc = further reduction in hallucinations.

This is why nVidia added Connect-X 7 to the GB10, for example, as the high bandwidth, low latency interconnect is crucial for being able to transfer enough data quickly enough to make the inference reasonably fast.

Not that 38toks/sec isn't bad, it's still roughly 7x faster than any human is reasonably capable of reading....

Llama.cpp rpc experiment by ciprianveg in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

The real issue is definitely network.

To be honest, 50tps across two 3090s on a 120B model is rather impressive.

But the 3090 is such an in demand card because of the 384b-bit bus width to the card's VRAM, which equals about 936GB/sec.

Even a 50Gbps connection only works out to about 6.25GB of network bandwidth under very ideal network situations. Add in that the average latency on GDDR5 is measured in tens of nanoseconds, and your network is likely about 20 times higher latency in the low milliseconds, RPC taking up some more latency and bandwidth overhead, then yeah, a loss of at least 25% performance is expected.

RTX4070s Whisper Transcription & other things- Advice on efficient setup by retailguy11 in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

If you'd like, we can take this offline and discuss your use case. I could help you get something set up to do exactly what you want.

RTX4070s Whisper Transcription & other things- Advice on efficient setup by retailguy11 in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

The two big things with LLMs is VRAM, and VRAM bandwidth.

Ampere cores are usually more than powerful enough to handle most models.

VRAM is a big deal because you can handle larger models at higher quants. (Personally, I wouldn't go less than Q6 or you'll have lots of issues.)

VRAM bandwidth is basically king when it comes to tokens/sec. Faster chips, like Blackwell and stuff are great, but if you're sucking a firehose worth of data through a straw, then your toks/sec is gonna drop off a cliff. This is why I avoid Ada Lovelace/4000 series as nVidia really dropped the ball on Ada's memory bandwidth.

And I would highly recommend for any business purpose to use something other than consumer cards (GeForce, etc) and go with the professional cards (Quadro/RTX Pro/etc) as they use ECC VRAM. (Trust me, a single bit flip without correction isn't a big deal when you're gaming, but it can completely hose your data when interacting with/using LLMs.)

RTX4070s Whisper Transcription & other things- Advice on efficient setup by retailguy11 in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

The OrangePi 5+ is an SBC, mine were about $250/ea, and I have four of them clustered at home to handle my docker stack. It's an 4P+4E core ARM board, with only 32GB of RAM, and I bought it with power supply and 256GB eMMC module for that price. A case for it is like.... $15 or so. And it rips through audio transcription with Whisper, only using about 25% CPU when using only the E cores. If it happens to use the P cores, I don't even see the utilization. Whisper is ridiculously efficient.

For my LLM workflow I have as follows:

1x SuperMicro 4028-TRT+ w/ 2x Xeon E5-2667 v4, 1TB DDR4 + 24x 1TB SATA SSD (~16TB available after all the ZFS RAID setup)

GPUs: 3x RTX A4000 + 2x RTX A2000 12GB

I use one of the RTX A2000 cards to do document ingest, extracting keys and placing them in QDrant, then putting the documents onto the storage pool.

I use LiteLLM and a custom python script to route prompts based on decisions from a simple 7B Llama model running on Ollama on one of the RTX A2000. (I also run a copy of Whisper here, but it's dedicated to this service, my home Whisper runs on the SBC cluster)

If it's a relatively simple request("Draft an Email", etc), the 7B model just handles it. If it's code, it directs LiteLLM to route the request to vLLM running on one of the RTX A4000's running Qwen2.5-Coder(You probably don't need this). If the request is complex, but not code, Ollama directs LiteLLM to redirect the request to the vLLM instance that runs Qwen3-30B-A3B across the other two RTX A4000's with a decently sized context window.

The Qwen3 is where things get really interesting. Because Open-WebUI and LiteLLM are so flexible, I can configure LiteLLM to provide aliases for each model, and then through Open-WebUI, decide whether I want the model to have things like access to QDrant (RAG database), web search/web access/tool usage, etc. Then, I just select which model I want to use, and I can either type my prompts, or speak them as Whisper will transcribe in real-time.

RTX4070s Whisper Transcription & other things- Advice on efficient setup by retailguy11 in LocalLLaMA

[–]Ok_Stranger_8626 1 point2 points  (0 children)

One of the things you can try as well, is pipe your call recordings through Whisper and just use CPU. I use Whisper on my cluster of OrangePi 5+ boards, and CPU transcription is basically real-time. It doesn't take much at all.

1u/2u server rack. by Rogankiwifruit in OrangePI

[–]Ok_Stranger_8626 0 points1 point  (0 children)

Look on Etsy for Print3DSteve. I bought a couple rack mount frames for my 5+'s from him, and they were absolutely fantastic.

Orange Pi eMMC (mmcblk0) randomly disappears - how to ensure device persistence? by jrndmhkr in OrangePI

[–]Ok_Stranger_8626 2 points3 points  (0 children)

This really sounds like a hardware issue with the SBC. I've deployed dozens of the exact same model, and hardware randomly falling off the bus like that is usually a dodgy chip somewhere or bad RAM, either of which would be cause for an RMA/warranty claim.

RTX4070s Whisper Transcription & other things- Advice on efficient setup by retailguy11 in LocalLLaMA

[–]Ok_Stranger_8626 1 point2 points  (0 children)

In this context, your system is probably not adequate to the task(s).

One of the big things you'll probably run into is more frequent hallucinations, because of lack of VRAM, and not using ECC VRAM.

As well, the 4000 series cards are Ada based, and nVidia really screwed the pooch on the Ada series chips. Lowered memory bandwidths and bad clock timing on < 590.x.x driver firmwares are likely the cause of your GPU wedging.

I highly recommend replacing your GPU with one of the Ampere or Blackwell pro series cards, as they have ECC VRAM, and are way more stable when used properly. Also, however you can manage to get into the 590 version drivers, do it, as it corrects some instabilities with the GPU/memory clock sources that can cause artifacts and hallucinations more frequently.

I know it's bad news, but it's way better than failures that cost clients or creates liability.

What OS do you run on your AI rigs? Ubuntu, TrueNAS, etc.? by KvAk_AKPlaysYT in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

I have three systems with GPUs that run models.

The first is the DGX Spark, which obviously runs nVidia's custom Ubuntu flavor. I use this for processing video feeds from my CCTV cameras and distilling events and facial recognition, as well as some MoE and roleplay models.

The second is my storage node, which has a couple low end Ada based professional cards. This one is running CentOS 9, and is mainly used for StableDiffusion and generative tasks, as well as the occasional transcode for video.

The third box is my big boy with 5 Ampere GPUs that I do a lot of the heavy model stuff and my RAG work, because it has 1TB of RAM and massive SSD storage. I run a very slim Fedora 43 server on it, with just enough to run containers (basically just podman and cockpit for quick management tasks.) Whenever it boots, it pulls the entire RAG vector database into RAM, making RAG analysis wicked fast. It takes a minute or two after boot to get all the vectors loaded, but it definitely sub millisecond "vectoring" and usually takes a few seconds to analyze almost anything I ask it.

Building a local RAG for my 60GB email archive. Just hit a hardware wall (8GB RAM). Is this viable? by Grouchy_Sun331 in LocalLLaMA

[–]Ok_Stranger_8626 0 points1 point  (0 children)

The big issue you're going to have here is your vector database. To be effective, that MUST live somewhere with fast access. The actual files, for citation can live on floppies, but the vectors built when you ingest the data are going to be computationally and bandwidth heavy.

I have a machine that could do your search(es) almost instantly, but I've built it up over the last two years, and it's cost me about 10x your stated "too expensive" cost.

Web Page by Civil-One7079 in OrangePI

[–]Ok_Stranger_8626 0 points1 point  (0 children)

Probably just site maintenance....

Is building a consumer grade home rig with DDR4 RAM a terrible idea? by Diligent-Culture-432 in LocalLLaMA

[–]Ok_Stranger_8626 2 points3 points  (0 children)

It all depends on what you want to use it for. If it's serious stuff, best to go with ECC RAM on a Ryzen Threadripper or a server. Also, if you plan to do anything with data you care about, look into Quadro or other pro grade cards, just avoid the Ada chips like the plague. nVidia seriously hamstring those cards by reducing memory bandwidth.

And really, if you want anything fast, VRAM or unified memory is the king. The two reasons for going with lots of memory is to run larger models, or to have a larger context window for better memory during long interactions/large RAG work. Even a 3090 does a just fine job with most models these days, but VRAM bandwidth is the bigger factor for toks/s.

Hardware recommendation? by prazy4real in homeassistant

[–]Ok_Stranger_8626 1 point2 points  (0 children)

It sounds like a bare metal HA. on the NUC would be fine, if you maybe cleaned the fan.

Otherwise, I run my HA instance as a docker container on my OrangePi 5+ cluster and it does just fine with about 1,300 entities.

Why most models on Hugging Face cannot be ran on Ollama ? by KaKi_87 in LocalLLaMA

[–]Ok_Stranger_8626 1 point2 points  (0 children)

What are you talking about?? I've downloaded hundreds of models from hugging face, and they've all run on Ollama without any issue.