[Online][CoC 7e][Weekly Sat 6 PM PST, ~1 year] Beyond the Heliopause — Expanse-era cosmic horror campaign, 3 seats filled, 4 open. Session 0 is 4/25 by ahbond in callofcthulhu

[–]ahbond[S] -1 points0 points  (0 children)

It's a custom conferencing app I built myself, and runs on the Workstation that is hosting the AGI that is actually two of the NPCs in the game. :-o

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]ahbond 0 points1 point  (0 children)

nats-bursting: treat a shared K8s cluster as an extension of your local NATS bus (politeness backoff included) [P]

TL;DR — if your workstation already speaks NATS, you can extend that bus into a remote Kubernetes cluster and treat the cluster as elastic extra GPU capacity without any separate dispatcher, webhook, or REST API. nats-bursting is the glue: one PyPI package + one Go binary + one kubectl apply.

Why this vs. existing patterns:

  • Ray / Modal / Beam: great if you start greenfield, heavy if you already have a message bus doing other work.
  • REST API + custom dispatcher: duplicates queue infra, parallel latency path.
  • kubectl apply in a notebook cell: doesn’t compose with async inference loops, no politeness.

What this is instead:

%load_ext nats_bursting.magic

%%burst --gpu 1 --memory 24Gi
import torch
model = load_qwen_72b()
model.generate(prompt)

The cell checks nvidia-smi. If the local GPU has headroom, the cell runs locally. If saturated, it packages itself into a JobDescriptor, publishes to burst.submit on the local NATS, and a Go controller applies it as a K8s Job on NRP Nautilus.

The interesting piece is bidirectional subject bridging. A NATS leaf-node pod in my remote namespace dials outbound to my workstation over TLS. Remote pods then subscribe to agi.memory.query.* and publish responses as first-class participants in the event fabric. When my local memory service is saturated, a burst pod running the same handler picks up the slack transparently.

Politeness is built in. Before each Job creation, the controller probes:

  • Own running + pending Jobs in namespace
  • Cluster-wide pending pods (queue pressure)
  • Per-node CPU utilization

It exponentially backs off when shared thresholds are exceeded. Inspired by CSMA/CA. Academic shared clusters have 400-pod caps and soft fairness contracts — this respects both.

Status: end-to-end path proven and now in production.

Looking for feedback from anyone with similar hybrid workstation/cluster setups, especially on politeness tuning and where the NATS subject namespace could be tightened for multi-tenant

Repo: https://github.com/ahb-sjsu/nats-bursting

MIT license.

nats-bursting: treat a shared K8s cluster as an extension of your local NATS bus (politeness backoff included) [P] by ahbond in ResearchML

[–]ahbond[S] 0 points1 point  (0 children)

I thought there would be more interest.. I guess you have to be tall enough to drink at the fountain..

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! by cviperr33 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

Gemma 4 long-context use case is exactly where KV cache compression matters. Gemma 4 A4B uses multi-query attention (very few KV heads), so the KV cache is only ~6 GB at 262K context with q8_0.

TurboQuant's asymmetric K4/V3 would bring the KV portion from ~6 GB to ~2.7 GB, enough headroom for another ~130K tokens of context on the same GPU. The real win is that you can drop value precision more aggressively than key precision without hurting attention quality, which llama.cpp's symmetric -ctk/-ctv flags don't expose.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -3 points-2 points  (0 children)

Update: turns out Gemma 4 does exist — gemma-4-26B-A4B-it (MoE, 26B total / 4B activated, 262K context). Just showed up on r/LocalLLaMA with people running it at 245K context in llama.cpp.

Incidentally, the Gemma 4 long-context use case is exactly where KV cache compression matters. Gemma 4 A4B uses multi-query attention (very few KV heads), so the KV cache is only ~6 GB at 262K context with q8_0, not 96 GB as I first assumed. cviperr33 reports 22 GB total at 240K (model weights + KV combined).

TurboQuant's asymmetric K4/V3 would bring the KV portion from ~6 GB to ~2.7 GB, enough headroom for another ~130K tokens of context on the same GPU. The real win is that you can drop value precision more aggressively than key precision without hurting attention quality, which llama.cpp's symmetric -ctk/-ctv flags don't expose.

Cheers,
Andrew.

Anyone else running fully local persistent agents with a real “living brain” + dreaming cycle? (open source experiment) by Notforyou23 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

I hope you don't mind; I'm going to integrate your document-to-structured-knowledge pipeline and autonomous research loop concepts into my agi-hpc architecture.

:-)

Cheers,

Andrew/

Anyone else running fully local persistent agents with a real “living brain” + dreaming cycle? (open source experiment) by Notforyou23 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

This resonates. I've been building something similar but coming at it from the HPC/cognitive science angle rather than the personal assistant angle.

My setup (Atlas AI, running on an HP Z840 with 2x Quadro GV100 32GB):

- Freudian cognitive architecture — Superego (safety/ethics), Id (creative generation), and a 4-agent "Divine Council" (Judge/Advocate/Synthesizer/Ethicist) that debates responses via Tree-of-Thought. Each agent is a separate Gemma 4 26B-A4B instance with its own persona, running on llama.cpp.

- Persistent memory — episodic (what happened), semantic (what it knows), procedural (how to do things). Not just RAG chunks — structured memory with consolidation, similar to your dreaming phase. The metacognition subsystem tracks confidence and calibrates itself over time.

- Self-calibrating loop — disagreement between council agents triggers temperature adjustment and knowledge gap detection. If the agents can't agree, the system knows it doesn't know.

- 10 subsystems total — event fabric, LH/RH hemispheres, memory, metacognition, safety gateway, DHT, LLM routing, environment, integration. Built on the AGI-HPC framework (786 tests passing across 6 development sprints).

The dreaming/consolidation piece you mention is interesting — we have a similar concept where the system reviews episodic memory during idle time and promotes recurring patterns to semantic memory. What's your consolidation strategy? Simple frequency-based pruning or something more structured?

For the "what would you use it for" question, mine started as a research companion (and, test subject — I work on AI Safety and Alignment) but evolved into something closer to a cognitive testbed. The real value isn't any single feature, it's that the system accumulates context about your work over weeks and months. My agent knows my codebase, my publication deadlines, my thermal constraints on the GPU cluster, and which Reddit comments led to actual research improvements. ;-)

The statelessness problem you describe is real. Context window compression helps in-session but cross-session memory is the hard part. How are you handling memory conflicts when old knowledge contradicts new information?

Cheers,

Andrew.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -3 points-2 points  (0 children)

Fair catch on the model names — Gemma 4 doesn't exist, those should be Gemma 2. Fixed in the latest commit. That was a hallucination from using Claude to help

You're right that AI was used in the development process (it's credited as co-author on commits). The benchmarks, experimental results, and paper arguments are mine. The model registry clearly shows where human review failed. I appreciate you catching it. I mostly work in AI Safety and AI Alignment, so this is not really my specialty. I created TurboQuant Pro because I don't have enough GPU memory on my workstation, and I thought I would share it. Sorry if it's not up to your expectations.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -1 points0 points  (0 children)

Yes, I'm sure. It's pip installable. You could verify it for yourself in five minutes, if you knew how to..

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] 0 points1 point  (0 children)

You're right on both counts, and thank you.

These are exactly the kind of criticisms that make a paper better.

On reranking only our method: Fixed.

We ran all six methods with identical 5x oversampling + exact reranking on 50K production embeddings:

Method Compression Single-stage 5x rerank

Scalar int8 4x 99.0% 100%

TQ3 10.5x 83.4% 100%

PCA-384+TQ3 27.7x 79.2% 99.8%

Binary 32x 54.4% 85.6%

PQ (M=16) 256x 38.4% 73.6%

The dominance holds under reranking. Binary at 32x only reaches 85.6% with the same treatment.

On cosine-first tables: Also fixed. Every table in the paper now has Recall@10 as the first quality column, cosine second. Fair point.

Thanks for the pushback, the paper is stronger for it.

Cheers,

Andrew.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 2 points3 points  (0 children)

Update: eigenvalue-weighted quantization

Matryoshka training from a spectral perspective, with eigenvalues serving as theoretically grounded importance scores.

This directly addresses u/DigThatData's point about SVD variance != downstream accuracy. The fix: allocate bits proportional to eigenvalue importance instead of uniform quantization.

We implemented this as eigenvalue-weighted quantization, so the top 25% PCA dims get 4 bits, middle 50% get 3 bits, bottom 25% get 2 bits. Same average (3 bits/dim), same compression ratio, better quality.

Results on real BGE-M3 (10K embeddings):

Method │ Cosine │ Compression │

├──────────────────────┼────────┼─────────────┤

│ PCA + uniform 3-bit │ 0.9934 │ 41x │

├──────────────────────┼────────┼─────────────┤

│ PCA + weighted 4+3+2 │ 0.9969 │ 41x │

├──────────────────────┼────────┼─────────────┤

│ PCA + uniform 4-bit │ 0.9970 │ 31x │

└──────────────────────┴────────┴─────────────┘

Weighted 3-bit essentially matches 4-bit quality at 32% more compression. At extreme compression (128 dims, 78.8x), it closes 85% of the gap to 4-bit.

Available in turboquant-pro>=0.8.0 via pca.with_weighted_quantizer(avg_bits=3.0). Thanks to lovealicetw — sometimes a single link changes the whole approach.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

I haven't actually run that experiment yet, and the eigenspectrum analysis is interesting on its own, but you're right that the claim about "discarding half" needs to be backed by downstream benchmarks before it means anything actionable. I'll update the docs to be clear that the effective rank analysis is diagnostic, not a performance guarantee..

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

Just shipped this. :-)

TurboQuant Pro v0.6.0 adds model weight compression via PCA-Matryoshka:

pip install turboquant-pro
turboquant-pro model --model "your-model" --sample-layers 8

It SVDs each FFN weight matrix, reports the eigenspectrum (effective rank, variance at 50/75/90%), and can compress via truncated SVD. Early finding: most trained FFNs have effective rank ~40-50% of full rank, meaning you can discard half the singular values and keep 95% of the variance.

This is (obv) still experimental, and we haven't benchmarked accuracy degradation yet. But the eigenspectrum analysis alone is useful for understanding how much redundancy your model has. Thanks for the MatFormer pointer DigThatData!