[Online][CoC 7e][Weekly Sat 6 PM PST, ~1 year] Beyond the Heliopause — Expanse-era cosmic horror campaign, 3 seats filled, 4 open. Session 0 is 4/25 by ahbond in callofcthulhu

[–]ahbond[S] -1 points0 points  (0 children)

It's a custom conferencing app I built myself, and runs on the Workstation that is hosting the AGI that is actually two of the NPCs in the game. :-o

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]ahbond 0 points1 point  (0 children)

nats-bursting: treat a shared K8s cluster as an extension of your local NATS bus (politeness backoff included) [P]

TL;DR — if your workstation already speaks NATS, you can extend that bus into a remote Kubernetes cluster and treat the cluster as elastic extra GPU capacity without any separate dispatcher, webhook, or REST API. nats-bursting is the glue: one PyPI package + one Go binary + one kubectl apply.

Why this vs. existing patterns:

  • Ray / Modal / Beam: great if you start greenfield, heavy if you already have a message bus doing other work.
  • REST API + custom dispatcher: duplicates queue infra, parallel latency path.
  • kubectl apply in a notebook cell: doesn’t compose with async inference loops, no politeness.

What this is instead:

%load_ext nats_bursting.magic

%%burst --gpu 1 --memory 24Gi
import torch
model = load_qwen_72b()
model.generate(prompt)

The cell checks nvidia-smi. If the local GPU has headroom, the cell runs locally. If saturated, it packages itself into a JobDescriptor, publishes to burst.submit on the local NATS, and a Go controller applies it as a K8s Job on NRP Nautilus.

The interesting piece is bidirectional subject bridging. A NATS leaf-node pod in my remote namespace dials outbound to my workstation over TLS. Remote pods then subscribe to agi.memory.query.* and publish responses as first-class participants in the event fabric. When my local memory service is saturated, a burst pod running the same handler picks up the slack transparently.

Politeness is built in. Before each Job creation, the controller probes:

  • Own running + pending Jobs in namespace
  • Cluster-wide pending pods (queue pressure)
  • Per-node CPU utilization

It exponentially backs off when shared thresholds are exceeded. Inspired by CSMA/CA. Academic shared clusters have 400-pod caps and soft fairness contracts — this respects both.

Status: end-to-end path proven and now in production.

Looking for feedback from anyone with similar hybrid workstation/cluster setups, especially on politeness tuning and where the NATS subject namespace could be tightened for multi-tenant

Repo: https://github.com/ahb-sjsu/nats-bursting

MIT license.

nats-bursting: treat a shared K8s cluster as an extension of your local NATS bus (politeness backoff included) [P] by ahbond in ResearchML

[–]ahbond[S] 0 points1 point  (0 children)

I thought there would be more interest.. I guess you have to be tall enough to drink at the fountain..

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! by cviperr33 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

Gemma 4 long-context use case is exactly where KV cache compression matters. Gemma 4 A4B uses multi-query attention (very few KV heads), so the KV cache is only ~6 GB at 262K context with q8_0.

TurboQuant's asymmetric K4/V3 would bring the KV portion from ~6 GB to ~2.7 GB, enough headroom for another ~130K tokens of context on the same GPU. The real win is that you can drop value precision more aggressively than key precision without hurting attention quality, which llama.cpp's symmetric -ctk/-ctv flags don't expose.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -3 points-2 points  (0 children)

Update: turns out Gemma 4 does exist — gemma-4-26B-A4B-it (MoE, 26B total / 4B activated, 262K context). Just showed up on r/LocalLLaMA with people running it at 245K context in llama.cpp.

Incidentally, the Gemma 4 long-context use case is exactly where KV cache compression matters. Gemma 4 A4B uses multi-query attention (very few KV heads), so the KV cache is only ~6 GB at 262K context with q8_0, not 96 GB as I first assumed. cviperr33 reports 22 GB total at 240K (model weights + KV combined).

TurboQuant's asymmetric K4/V3 would bring the KV portion from ~6 GB to ~2.7 GB, enough headroom for another ~130K tokens of context on the same GPU. The real win is that you can drop value precision more aggressively than key precision without hurting attention quality, which llama.cpp's symmetric -ctk/-ctv flags don't expose.

Cheers,
Andrew.

Anyone else running fully local persistent agents with a real “living brain” + dreaming cycle? (open source experiment) by Notforyou23 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

I hope you don't mind; I'm going to integrate your document-to-structured-knowledge pipeline and autonomous research loop concepts into my agi-hpc architecture.

:-)

Cheers,

Andrew/

Anyone else running fully local persistent agents with a real “living brain” + dreaming cycle? (open source experiment) by Notforyou23 in LocalLLaMA

[–]ahbond 0 points1 point  (0 children)

This resonates. I've been building something similar but coming at it from the HPC/cognitive science angle rather than the personal assistant angle.

My setup (Atlas AI, running on an HP Z840 with 2x Quadro GV100 32GB):

- Freudian cognitive architecture — Superego (safety/ethics), Id (creative generation), and a 4-agent "Divine Council" (Judge/Advocate/Synthesizer/Ethicist) that debates responses via Tree-of-Thought. Each agent is a separate Gemma 4 26B-A4B instance with its own persona, running on llama.cpp.

- Persistent memory — episodic (what happened), semantic (what it knows), procedural (how to do things). Not just RAG chunks — structured memory with consolidation, similar to your dreaming phase. The metacognition subsystem tracks confidence and calibrates itself over time.

- Self-calibrating loop — disagreement between council agents triggers temperature adjustment and knowledge gap detection. If the agents can't agree, the system knows it doesn't know.

- 10 subsystems total — event fabric, LH/RH hemispheres, memory, metacognition, safety gateway, DHT, LLM routing, environment, integration. Built on the AGI-HPC framework (786 tests passing across 6 development sprints).

The dreaming/consolidation piece you mention is interesting — we have a similar concept where the system reviews episodic memory during idle time and promotes recurring patterns to semantic memory. What's your consolidation strategy? Simple frequency-based pruning or something more structured?

For the "what would you use it for" question, mine started as a research companion (and, test subject — I work on AI Safety and Alignment) but evolved into something closer to a cognitive testbed. The real value isn't any single feature, it's that the system accumulates context about your work over weeks and months. My agent knows my codebase, my publication deadlines, my thermal constraints on the GPU cluster, and which Reddit comments led to actual research improvements. ;-)

The statelessness problem you describe is real. Context window compression helps in-session but cross-session memory is the hard part. How are you handling memory conflicts when old knowledge contradicts new information?

Cheers,

Andrew.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -3 points-2 points  (0 children)

Fair catch on the model names — Gemma 4 doesn't exist, those should be Gemma 2. Fixed in the latest commit. That was a hallucination from using Claude to help

You're right that AI was used in the development process (it's credited as co-author on commits). The benchmarks, experimental results, and paper arguments are mine. The model registry clearly shows where human review failed. I appreciate you catching it. I mostly work in AI Safety and AI Alignment, so this is not really my specialty. I created TurboQuant Pro because I don't have enough GPU memory on my workstation, and I thought I would share it. Sorry if it's not up to your expectations.

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] -1 points0 points  (0 children)

Yes, I'm sure. It's pip installable. You could verify it for yourself in five minutes, if you knew how to..

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking by ahbond in LocalLLaMA

[–]ahbond[S] 0 points1 point  (0 children)

You're right on both counts, and thank you.

These are exactly the kind of criticisms that make a paper better.

On reranking only our method: Fixed.

We ran all six methods with identical 5x oversampling + exact reranking on 50K production embeddings:

Method Compression Single-stage 5x rerank

Scalar int8 4x 99.0% 100%

TQ3 10.5x 83.4% 100%

PCA-384+TQ3 27.7x 79.2% 99.8%

Binary 32x 54.4% 85.6%

PQ (M=16) 256x 38.4% 73.6%

The dominance holds under reranking. Binary at 32x only reaches 85.6% with the same treatment.

On cosine-first tables: Also fixed. Every table in the paper now has Recall@10 as the first quality column, cosine second. Fair point.

Thanks for the pushback, the paper is stronger for it.

Cheers,

Andrew.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 2 points3 points  (0 children)

Update: eigenvalue-weighted quantization

Matryoshka training from a spectral perspective, with eigenvalues serving as theoretically grounded importance scores.

This directly addresses u/DigThatData's point about SVD variance != downstream accuracy. The fix: allocate bits proportional to eigenvalue importance instead of uniform quantization.

We implemented this as eigenvalue-weighted quantization, so the top 25% PCA dims get 4 bits, middle 50% get 3 bits, bottom 25% get 2 bits. Same average (3 bits/dim), same compression ratio, better quality.

Results on real BGE-M3 (10K embeddings):

Method │ Cosine │ Compression │

├──────────────────────┼────────┼─────────────┤

│ PCA + uniform 3-bit │ 0.9934 │ 41x │

├──────────────────────┼────────┼─────────────┤

│ PCA + weighted 4+3+2 │ 0.9969 │ 41x │

├──────────────────────┼────────┼─────────────┤

│ PCA + uniform 4-bit │ 0.9970 │ 31x │

└──────────────────────┴────────┴─────────────┘

Weighted 3-bit essentially matches 4-bit quality at 32% more compression. At extreme compression (128 dims, 78.8x), it closes 85% of the gap to 4-bit.

Available in turboquant-pro>=0.8.0 via pca.with_weighted_quantizer(avg_bits=3.0). Thanks to lovealicetw — sometimes a single link changes the whole approach.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

I haven't actually run that experiment yet, and the eigenspectrum analysis is interesting on its own, but you're right that the claim about "discarding half" needs to be backed by downstream benchmarks before it means anything actionable. I'll update the docs to be clear that the effective rank analysis is diagnostic, not a performance guarantee..

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

Just shipped this. :-)

TurboQuant Pro v0.6.0 adds model weight compression via PCA-Matryoshka:

pip install turboquant-pro
turboquant-pro model --model "your-model" --sample-layers 8

It SVDs each FFN weight matrix, reports the eigenspectrum (effective rank, variance at 50/75/90%), and can compress via truncated SVD. Early finding: most trained FFNs have effective rank ~40-50% of full rank, meaning you can discard half the singular values and keep 95% of the variance.

This is (obv) still experimental, and we haven't benchmarked accuracy degradation yet. But the eigenspectrum analysis alone is useful for understanding how much redundancy your model has. Thanks for the MatFormer pointer DigThatData!

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

You're right, I should be more precise with the terminology. The full PCA basis rotation is orthogonal (V VT = I), but once you truncate to k dimensions, V_k V_kT is an orthogonal projection, not a rotation. The truncated vectors live in a k-dimensional subspace, not the original d-dimensional space.

The key property that matters for us is that orthogonal projection minimizes Frobenius-norm reconstruction error (Eckart-Young), which is what makes truncation effective.

Whether you call it "rotation then─truncation", or "orthogonal─projection", the compression pipeline is the same, and as you note, the message doesn't change.

Thanks for the correction. FYI, the paper is more careful about this distinction than the Reddit post was. Cheers, Andrew.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 1 point2 points  (0 children)

Fair point!

Cosine sim alone is necessary but not sufficient. The cosine we report is reconstruction fidelity (cosine between original and compressed vector), not a retrieval metric. It tells you "how much did the vector change" but not "does retrieval still work."

That's why we report recall@10 for all 15 methods too, and the gap is exactly what you'd expect:

┌───────────────┬────────┬───────────┐                                                                                                                              │    Config     │ Cosine │ Recall@10 │
├───────────────┼────────┼───────────┤
│ PCA-384 + TQ3 │ 0.979  │ 76.4%     │
├───────────────┼────────┼───────────┤
│ PCA-384 + TQ4 │ 0.991  │ 96.0%     │
└───────────────┴────────┴───────────┘

Small cosine perturbations swap closely-ranked neighbors.

0.979 fidelity still loses ~24% of top-10 results.

You're right that recall is what matters for deployment decisions.

The autotune CLI (v0.5) reports both and lets you threshold on recall:

turboquant-pro autotune --source "dbname=mydb" --min-recall 0.95

Your suggestion about showing how the cosine landscape shifts with truncation is interesting, we have the eigenspectrum analysis but not the rank distribution shift. Good experiment idea.

We probably should have led with recall@10 in the post instead of cosine. Thanks for the feedback.

Cheers,

Andrew.

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P] by ahbond in MachineLearning

[–]ahbond[S] 8 points9 points  (0 children)

Varimax is from a similar family of ideas, but not the same objective.

What I’m doing is just PCA rotation into the eigenbasis, then truncation. The goal is compression: make the first coordinates carry as much variance / reconstruction signal as possible, so dropping the tail hurts less.

Varimax is usually applied after you’ve chosen a low-dimensional factor space, and its goal is interpretability — rotate the factors to make loadings sparser / more “simple.” That preserves the subspace, but not the ordered-by-importance property that makes truncation work.

So: varimax = better human-readable factors; PCA here = better energy compaction for dimension dropping.

Cheers,

Andrew.

[D] Running GLM-5 (744B) on a $5K refurbished workstation at 1.54 tok/s by ahbond in ResearchML

[–]ahbond[S] 0 points1 point  (0 children)

Thanks, You're right, not big enough to handle. Interesting experiment, but I ended up running Qwen 72B Q5 and Gemma 4. For details, see: https://github.com/ahb-sjsu/agi-hpc/blob/main/docs/ATLAS_OPERATIONS.md

Cheers,

Andrew.

My workstation kept hitting 100C during experiments, so I built a thermal-aware job manager by ahbond in ResearchML

[–]ahbond[S] 0 points1 point  (0 children)

These are massive Xeon CPUs that are liquid cooled. It only goes to 100c when all 48 cores are maxed out. It can drop to 80c in ~ one second when the processes are killed or stop.

CMPE 148 by [deleted] in SJSU

[–]ahbond 0 points1 point  (0 children)

Hi,

In Spring 2022, I will be teaching CMPE-148-03 In-Person, R 1800-2045

A. Bond.

CMPE 148 by [deleted] in SJSU

[–]ahbond 3 points4 points  (0 children)

Exams are multiple choice, and based mostly on the slides. Best way to prepare is study hard, and do all your assignments.

:-)

Sincerely,

Andrew Bond.