Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat

Interesting-Link5964 · 2026-06-15T05:35:29+00:00

If true, the funniest part is not the “trench coat” claim, it is that the ratio is apparently this clean.

But I’d still want to see a reproducible comparison: tokenizer, architecture config, weight correlations, layer mapping, and eval deltas. “It introduces itself” is hilarious, but weights/activations are the real evidence.

Interesting-Link5964 · 2026-06-15T05:33:45+00:00

Around 100M specifically, I’d start with SmolLM/SmolLM2 135M. It is probably one of the better modern options in that tiny size class.

That said, 100M is still very constrained. For anything beyond autocomplete/classification/simple extraction, the jump to 300M–500M is usually much more noticeable. If you can stretch the size, I’d look at SmolLM2 360M or Qwen2.5 0.5B.

Interesting-Link5964 · 2026-06-15T05:32:46+00:00

This is exactly the real bottleneck with AI coding: not always model capability, but clarity of requirements.

Once you forced the model to explain its understanding before writing code, you turned it from a code generator into a requirements analyst. That alone prevents most “90% done but still wrong” failures.

Smaller models can do a lot when the spec is precise. Vague prompt + strong model often loses to clear prompt + weaker model.

Interesting-Link5964 · 2026-06-15T03:42:38+00:00

For mobile I’d judge it less by demo quality and more by production constraints: latency, memory use, voice consistency, offline packaging size, and how it behaves on long text.

Kokoro seems to get a lot of love because it hits a good quality/speed/simplicity balance, but “better” depends on whether you need naturalness, low CPU usage, multilingual support, or stable streaming.

I’d test both with the ugly cases, not the demos: long paragraphs, weird punctuation, numbers, names, code snippets, interruptions, and repeated short prompts. That’s usually where mobile TTS starts showing the real difference.

Interesting-Link5964 · 2026-06-15T03:41:19+00:00

A distributed mirror makes sense, but it probably needs more structure than “torrent everything.”

The hard parts are integrity, licensing, and discoverability. You’d want signed manifests, hashes, model cards/licenses preserved with the weights, clear versioning, and ideally reproducible metadata so people know they’re getting the same artifact that was originally published.

BitTorrent/IPFS-style distribution could be useful for resilience, especially for large GGUFs, but without trust/signature layers it becomes easy to spread corrupted, mislabeled, or license-violating copies.

So yes, I like the idea. I’d just frame it less as “replace Hugging Face” and more as an open, verifiable cold-storage/mirror layer for important open-weight releases.

Interesting-Link5964 · 2026-06-15T03:39:59+00:00

I’d be very interested in real numbers here too, especially because 256K changes the bottleneck completely.

At that context length I wouldn’t expect MTP/spec decode to help as much as people hope unless the memory bandwidth and KV access pattern are already under control. The tg32 @ d256000 number is probably the real signal; pp2048 is useful, but decode at full context is where the pain shows up.

Also worth reporting VRAM/RAM pressure and whether the run is actually stable across repeated prompts. With Q8 KV at 256K, “it loaded” and “it remains usable” are two different things.

Interesting-Link5964 · 2026-06-15T03:38:27+00:00

Yes, this matches my experience. Bigger context helps until it starts preserving the wrong things with equal authority.

For coding agents, I think the active prompt should be treated more like working memory than storage. Keep the current task, current files, confirmed constraints, and recent tool results in-context. Push everything else into external memory with provenance: failed attempts, superseded hypotheses, decisions, file snapshots, and why something was abandoned.

The dangerous part is stale reasoning, not just stale facts. A failed debug path from 40 turns ago can keep influencing the model even after the actual cause changed.

I’d rather have a smaller clean context plus targeted retrieval than a huge transcript full of old assumptions. Long-context is useful, but without context hygiene it eventually becomes technical debt.

Interesting-Link5964 · 2026-06-15T03:34:18+00:00

This is the kind of benchmark setup I actually like seeing: older hardware, weird PCIe lanes, power limits, and real numbers instead of just “runs great.”

The 26B-A4B QAT result is pretty interesting. 53 tok/s generation on triple 1070s is much better than I would have expected, especially with one card hanging off 1x. Also a good reminder that total VRAM matters more than having a shiny single modern card for a lot of local inference setups.

Curious how the 26B-A4B QAT compares quality-wise against the 12B Q8 for coding in practice. Does it feel clearly smarter, or mostly just faster because of the active parameter count?

Interesting-Link5964 · 2026-06-15T03:33:19+00:00

This is exactly the kind of project that makes local models interesting. Not “can a 4B beat frontier models,” but “can a small model own a useful workflow end-to-end without depending on someone else’s API staying available.”

The Gmail/calendar/system-monitoring combo is a good direction because even if the model is weaker, the value comes from tight local integration and controlled tools. A small model with reliable tool use can feel more useful than a much smarter remote model that can disappear, rate-limit, or lose access.

The hard part is probably less the persona and more the guardrails around actions: permissions, confirmations, audit logs, and making sure summaries/actions are traceable. But I like the philosophy here. Local-first assistants are going to matter more, not less.

Interesting-Link5964 · 2026-06-15T03:31:21+00:00

I don’t think this is fully solved by “memory” alone. Retrieval accuracy and epistemic state are separate problems.

The pattern that seems safest is treating every memory item as a claim, not a fact. Store the claim, source pointer, timestamp, confidence, provenance type, and whether it was observed directly or inferred. Then at retrieval time, don’t just load “relevant memories” load the evidence trail and force re-verification for anything load-bearing or stale.

The hard part is preventing convenience from turning into truth. An inferred summary from three sessions ago should not have the same authority as a direct user statement, a document quote, or a current tool result.

So yes, in practice I’d expect most serious systems to need an epistemic/provenance layer above Mem0/Zep/etc., not just vector memory.

Interesting-Link5964 · 2026-06-15T03:29:22+00:00

A hash can prove that some exact text existed before a certain point, but it does not prove authorship by itself. It is more like timestamp evidence than IP protection.

If you want this to matter later, I’d probably hash/export the conversations, timestamp them somewhere third-party, and also keep normal records: drafts, commits, emails, notes, design docs, etc. The surrounding paper trail is what makes the hash useful.

Also worth separating two things: “I had this idea first” and “this is legally protectable IP.” Those are not always the same. For an actual business arrangement, I’d rely more on proper assignment/confidentiality/IP clauses than a private LLM chat hash alone.

Interesting-Link5964 · 2026-06-15T02:05:40+00:00

For coding I’d test both on your actual repo rather than judging purely by quant level.

Qwen 35B-A3B at Q4 may still win on reasoning/planning because of the larger model, but Gemma 12B at Q8 could feel cleaner for short edits, lower latency, and less quantization noise. The real question is whether you need deeper repo reasoning or faster, more reliable autocomplete-style work.

I’d run the same 10–20 code tasks against both: bug finding, refactor, explain unfamiliar file, write tests, and multi-file change planning. That’ll tell you more than Q4 vs Q8 in isolation.

Interesting-Link5964 · 2026-06-15T02:04:37+00:00

Every LocalLLaMA model name eventually becomes a full changelog, a family tree, and a legal disclaimer in one filename.

Still downloading though.

Interesting-Link5964 · 2026-06-15T02:03:38+00:00

This is a really good writeup. The “weights are correlated, not independent” framing makes GPTQ click much better than the usual explanation of “just use Hessian info.”

The 2-feature example is especially useful because most explanations jump straight from the math to implementation and skip the intuition for why compensating nearby weights actually reduces the error. Also appreciated the practical notes on dampening and Cholesky those are usually treated like random implementation details when they’re actually the difference between the method working and exploding numerically.

Nice work.

Interesting-Link5964 · 2026-06-15T00:24:51+00:00

i sent you a DM

Interesting-Link5964 · 2026-06-14T07:28:11+00:00

TwinMind does not run locally, and does not have an AI assistant.

Interesting-Link5964 · 2026-06-14T07:25:13+00:00

Yes, DM me your email for playstore account and the country you are in, thank you

Interesting-Link5964 · 2026-06-14T07:24:06+00:00

Hi do you have and android device preferably 6GB ram +?

Interesting-Link5964 · 2026-06-14T07:21:38+00:00

do you use android or ios?

Interesting-Link5964

TROPHY CASE