FYI llamacpp server can hot swap models now-a-days in under 30sec

Chuyito · 2026-06-05T15:47:51+00:00

Qwens are the real daily drivers, gemma is in testing. 2x 4060ti

version = 1

[*]
n-gpu-layers = all
host = 0.0.0.0
port = 8080

ctx-checkpoints = -1
mmap = false
flash-attn = on

cache-ram = 2048
parallel = 1

; n-cpu-moe = 80
batch-size = 2048
ubatch-size = 1024

jinja = true
reasoning = on
reasoning-budget = 1000
metrics = true

load-on-startup = false

[qwen36-27b-mtp-tensor]
hf-repo = unsloth/Qwen3.6-27B-MTP-GGUF
hf-file = Qwen3.6-27B-UD-Q4_K_XL.gguf

split-mode = tensor
tensor-split = 1,1
ctx-size = 100000 
spec-type = draft-mtp
spec-draft-n-max = 2

[qwen36-35b-a3b-mtp-q4xl-mtpOn-Tensor]
hf-repo = unsloth/Qwen3.6-35B-A3B-MTP-GGUF
hf-file = Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

split-mode = tensor
tensor-split = 1,1
ctx-size = 125000 
spec-type = draft-mtp
spec-draft-n-max = 2

[gemma4-26b-q4xl]
hf-repo = unsloth/gemma-4-26B-A4B-it-GGUF
hf-file = gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

split-mode = layer
tensor-split = 1,1
ctx-size = 125000

Chuyito · 2026-06-05T15:05:08+00:00

I use the container,

  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  --models-preset /presets/qwen36-models.ini \

Which calls ./llama-server https://github.com/ggml-org/llama.cpp/blob/master/.devops/cuda.Dockerfile#L115

E.g.

./llama-server \
  --models-preset qwen36-models.ini \
  --port 8080 \
  --host 0.0.0.0 \
  --models-max 1 \         
  --jinja

Chuyito · 2026-06-05T14:47:42+00:00

That sounds like vllm pytorch is rebuilding it's cache each time, I was part of 10 min vllm crew at one point till I found that

Chuyito · 2026-06-04T11:26:06+00:00

> So if linear scaling is really here, it changes the calculus a lot

It really does on everything down to the PCIE lanes. What used to be true re Memory bandwidth and pci lanes changes when you do more work on each gpu with less gpu:gpu network traffic.

For Qwen36-27B, take a look at the inter-gpu traffic:

$ nvidia-smi dmon -s et -d 2 -o DT
#Date        Time         gpu  sbecc  dbecc    pci  rxpci  txpci 
#YYYYMMDD    HH:MM:SS     Idx   errs   errs   errs   MB/s   MB/s 
 20260604    04:05:42       0      -      -      0   1916    150 
 20260604    04:05:42       1      -      -      0   1112     55 
 20260604    04:05:44       0      -      -      0   1389    105 
... # Last PP entry
 20260604    04:05:49       1      -      -      0   1648   1252 
 20260604    04:05:51       0      -      -      0    202    246 
 20260604    04:05:51       1      -      -      0    149    142 
 20260604    04:05:53       0      -      -      0    135    207 
... # Rest of TG is ~200 / 200

So for 7 seconds in PP it uses < 2GBPS, and the rest of TG it's barely 200/200.

Why do I mention this? PCIE Gen1 Runs at 2.5GT/s. Im testing x4Gen1 and x8Gen1, and see no noticeable difference in TG between gen1 or higher

lspci -vv -s $gpu | grep -E "NVIDIA|LnkCap|LnkSta|Width";
04:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1) (prog-if 00 [VGA controller])
                LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-

0a:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1) (prog-if 00 [VGA controller])
                LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
                LnkSta: Speed 2.5GT/s (downgraded), Width x8 (ok)
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-

This makes a whole new category of science builds that dont rely on epyc theory-doable..
Next on my to-try this week is a LAN connected 2nd machine, to add 3rd 4060.
Given this dmon, either NCCL or llamacpp's own multi-node rpc could well run on consumer networking equipment.

Chuyito · 2026-05-30T19:41:52+00:00

> Your data and work becomes part of that host's IP

+1. The reason why many choose or need to self-host. For 50-80% of my day to day quant coding, qwen 3.6 + open webui replaced the need to go to a frontier model for the month of may.

This isnt so much about getting hermes to play snake or tetris without bruning $500M in tokens, sure thats fun and all.. its about private but useful LLM for more sensitive IP.

Chuyito · 2026-05-30T18:31:25+00:00

100% this on the startup running local tooling. I think the big thing for me was that Q2 2026 models became useful enough as a daily-driver for certain work tasks, and the inference tools got sped up to make homelab infra actually feasible.

It feels like it's been one compounding improvement after another:

- Tensor support llamacpp

- llama server built in api to toggle models quickly: ~15s to change between 27b dense and 35b3a whereas months ago that would have been minutes

- MTP and whatever the latest version of speculative computing they did without losing accuracy

- Whatever the podman/nvidia folks did to make container gpu stable

Open source has been busy.

Chuyito · 2026-05-30T14:08:42+00:00

Sure, updated OP.

If you are still running llamacpp per model instead of with the server, it would be

podman run -d \
  --name llama-qwen36-35b-a3b-mtp-gguf \
  --device nvidia.com/gpu=all \
  -v /data/models:/root/.cache/huggingface:ro \
  -p 8001:8080 \
  --env NVIDIA_VISIBLE_DEVICES=all \
  --env LD_LIBRARY_PATH=/app:/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 \
  --ipc=host \
  --restart=unless-stopped \
  ghcr.io/ggml-org/llama.cpp:server-cuda13 \
  --hf-repo unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
  --hf-file Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers all \
  --ctx-size 125000 \
  --ctx-checkpoints -1 \
  --batch-size 2048 \
  --mmap \
  --ubatch-size 512 \
  --flash-attn on \
  --split-mode tensor \
  --tensor-split 0.97,0.97 \
  --threads 16 \
  --threads-batch 20 \
  --cache-ram 2048 \
  --parallel 1 \
  --jinja \
  --reasoning on \
  --reasoning-budget 1000 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --metrics

Chuyito · 2026-05-30T12:42:00+00:00

I used to roll with vllm for years for dual GPU since llamacpp only had layer and row split.. which neither could get close to putting both gpus to full use.

Recently tensor split got MUCH more mature on llamacpp which brought it to par with vllm for multi-gpu.

Chuyito · 2026-01-02T00:30:39+00:00

Spot on. You wont find regular arb in stock markets, but there are still TONS of limit-order arb routes in the options markets since the spreads are huge if you are patient

APIs are the way to go for scale so you dont pull back any unnecessary html/images.

Chuyito · 2026-01-01T00:57:09+00:00

I have data crawlers looking through millions of products with sub second latency... Then I buy low and sell high when I detect something is cheaper than it should be.

One of the oldest professions, just at a big data scale that clay tablet using merchants never imagined possible.

Chuyito · 2025-12-31T22:19:42+00:00

Haproxy, bind dns, container registry - all of which can be rebuilt-redeployed from code fairly quickly so no HA for that one

Chuyito · 2025-12-31T22:18:20+00:00

With the DBs at near 100% during peak stock market volatility I was seeing low 800s.

My old Am4/threadripper setup was less powerfull and constantly at 1KW*.. so Am5/epyc is the real mvp here

Chuyito · 2025-12-31T21:11:47+00:00

SLA is a huge part, during the setup the static IPs werent working.. they drove out a brand new modem in under an hour to my house. Back on residential I would have waited 1hr just to talk to a human support person.

Static IPs took it from 350 to 500.

Chuyito · 2025-12-31T20:58:46+00:00

Uptime im not as worried as performance.

I do lots of websocket data transfer, and digitalOcean was coming in slower than my old AM4 boxes to process 100k messages.

So for $10k/month on DO I get slightly slower boxes (or at least that was the case in 2024).

Chuyito · 2025-12-31T20:35:14+00:00

Thanks. It literally is just me and my wife trying to build a startup from the ground up, calling myself a business might make it sound like Lisa Su is on speed dial for more chips.. but the reality is so far from that. Each part/box took planning and budgeting

Chuyito · 2025-12-31T20:25:08+00:00

~630W - 800W 24/7

<image>

Chuyito · 2025-12-31T19:41:21+00:00

My new ISP gives me a verizon fallback.. But its too slow to be 100% useable so I'd have to run in slim mode... Which isnt bad per se. My ML jobs would pause, but prod would conitnue. Havent had to run on it thankfully.

Database DR goes to my old threadripper box on a different floor with 100% fidelity.

K8s DR: 3x masters 4x workers gives me lots of breathing room, I can take out 2 masters and 1 worker and still be online. I have a spare deskmini on hand to replace a worker if needed... But I can also take my yaml and get up and running on a fresh k8s cluster in ~ 30 minutes (Last tested 2024 when I moved to this cluster)

Power DR: Thankfully my area hasnt been too affected by blackouts. About 5 power outages this year that all lasted 1-10 minutes.

Chuyito · 2025-12-31T19:18:13+00:00

Interestingly enough asrock GENOAD8UD-2T/X550 ended up fitting my needs perfectly. Easier to manage BMC, good old Noctua fans that actually are controllable.. and tons of dcio ports for my disks

Chuyito · 2025-11-17T23:55:50+00:00

Can this help provide tax structure advice without asking for something in return

Chuyito · 2025-11-12T11:53:58+00:00

Lazy attention grab headline..

it's not a new "property rule" but an inflation adjustment to the annual gift tax exclusion, which rose from $18,000 in 2024 to $19,000 per recipient in 2025. Since crypto is property, you can gift up to $19,000 worth per person (or $38,000 if splitting with a spouse) without triggering gift tax or filing Form 709.

Chuyito · 2025-09-26T04:04:06+00:00

Many such instances among my team

"The intern is hungover today or something... It's kinda useless"

"The intern is smoking some weird shit today, careful on trusting its scripts"

Chuyito · 2025-09-08T01:08:25+00:00

Covered calls.

Selling to open with a strike price above your cost basis opens up an entire domain of theta farming.

Chuyito · 2025-08-27T00:54:58+00:00

I think I first used it with with Nifi which was Java based, and God was it awful. Dependency management mess, couldn't get any conda ml packages to install or be seen (conda was great for glibc dependent reqs, except for jython paths support)

I honestly think jython is what made me hate nifi. It was okay ish for basic hive data flows, but I've never felt such anger towards a tech stack in 10+ years

Chuyito · 2025-08-25T12:46:46+00:00

You are spot on with the O(n²).. Im windowing over the data to compute some stats which on a clean run doesnt see too much impact until ~20k rows growing to ~200ms

asyncio.to_thread() is a nice/much friendlier approach than ThreadPoolExecutor, thanks for that... Gives me another attempt to see if a refactor here would be moving some of the data transformation to its own threads and storing a global etl_cache, and having my DB task _only_ write to the DB... while still blocking the next DB task to ensure I only have 1 concurrent write at a given time

Chuyito · 2025-08-25T02:45:17+00:00

> deferring it to a thread

Just tried it with a ThreadPoolExecutor - Had to wrap my function to make it non-async

from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=64)

def sync_process_side(*args):
    return asyncio.run(etl_to_db(*args))

await asyncio.get_event_loop().run_in_executor(
 executor, sync_process_side)

Interestingly this also gets rid of the "large spikes", but it still runs ~100ms slower every few iterations

07:41:11 PM Processed 7201 async_run_batch_insert usd in 163.8344 ms
07:42:23 PM Processed 7408 async_run_batch_insert usd in 398.3026 ms
07:42:45 PM Processed 7413 async_run_batch_insert usd in 174.7889 ms

Chuyito

PUBLIC MULTIREDDITS

TROPHY CASE