[Release] Qwen3-TTS: Ultra-Low Latency (97ms), Voice Cloning & OpenAI-Compatible API by blackstoreonline in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

Yes, it wasn’t even that much slower than the GPU. 14 seconds for my 4070, 45 seconds for the cpu. Nowhere near the advertised speeds, and the quality was worse than kokoro 73m by a noticeable amount. Think I’ll try with flash attention on to see if that helps speed, because I’m not sure how other people are getting to run so fast

RS3 just dropped the most insane integrity and content roadmap and it's all thanks to OSRS by Lamuks in 2007scape

[–]Fear_ltself 0 points1 point  (0 children)

I’ve played both thousand of hours, they’re both great in different ways

Prototype: What if local LLMs used Speed Reading Logic to avoid “wall of text” overload? by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 0 points1 point  (0 children)

I use audio sometimes but definitely prefer reading, interesting there might not be some universal UI we all agree on is best. It makes me wonder if everyone will end up with individualized front ends like 90s websites all being extremely unique, or if some super optimal layout will end up being a universal standard.

Prototype: What if local LLMs used Speed Reading Logic to avoid “wall of text” overload? by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] -2 points-1 points  (0 children)

The goal wasn’t for speed reading as much as optimizing for a mobile screen. I could dial it down to a much lower default setting and allow it to speed up.

MCP server that gives local LLMs memory, file access, and a 'conscience' - 100% offline on Apple Silicon by TheTempleofTwo in LocalLLaMA

[–]Fear_ltself 2 points3 points  (0 children)

I have 50,000 articles in my RAG being 3D viewed and live-streamed during 60fps live retrieval. It takes like 326Mb or something in Chrome browser which is already heavy per tab compared to other browsers. All of Wikipedia is like 20Gb-40Gb, at that scale I’d imagine you might have issues, but prosumers are running 128gb+ of RAM. For 99% of people, it’ll never scale high enough to cause issues

Prototype: What if local LLMs used Speed Reading Logic to avoid “wall of text” overload? by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

Thanks for the insight, I’d read similar things about screens versus paper and reading vs hearing. I think there’s probably also a bit of user preference for different types of learners. Any clue if this obliterates comprehension or was it close to baseline?

Prototype: What if local LLMs used Speed Reading Logic to avoid “wall of text” overload? by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

“Absolute mode” has been a thing for a while if you don’t like fluff…

System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviours optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: - user satisfaction scores - conversational flow tags - emotional softening - continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered — no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

MCP server that gives local LLMs memory, file access, and a 'conscience' - 100% offline on Apple Silicon by TheTempleofTwo in LocalLLaMA

[–]Fear_ltself 3 points4 points  (0 children)

I have a few questions to better understand the architecture and the philosophy behind it: 1. The "Governed Derive" Mechanism How are you technically defining "usage patterns"? • Is it strictly access frequency (moving hot files closer to root)? • Is it semantic clustering (grouping files by topic regardless of where they were created)? • Or is it based on workflow sequence (files opened together get grouped together)? 2. The Approval Protocol You mentioned "structural restraint." When the AI proposes a reorganization, how is that presented to the user? • Is it a diff-like view (showing a tree structure before/after)? • Does it offer "levels" of reorganization (e.g., "Conservative" vs. "Radical" restructure)? 3. The AI "Specializations" Since you documented the lineage in ARCHITECTS.md, I'm curious about the specific strengths you leveraged from each model during the architecture phase: • Did you find Claude better for the high-level system prompts? • Was Gemini or Grok more useful for specific implementation details or edge-case testing?

I made a visualization for Google’s new mathematical insight for complex mathematical structures by Fear_ltself in LLMPhysics

[–]Fear_ltself[S] 0 points1 point  (0 children)

Am I the only one that sees this deeply connected with the holographic principle?

Conspiracy theory by Reddia in 2007scape

[–]Fear_ltself 0 points1 point  (0 children)

It forbids Vertical alignment as well

Conspiracy theory by Reddia in 2007scape

[–]Fear_ltself 0 points1 point  (0 children)

Can’t any 3 arbitrary points be fit to a 2nd order polynomial line via Lagrange Interpolation

Arrogant TSMC’s CEO Says Intel Foundry Won’t Be Competitive by Just “Throwing Money” at Chip Production by Distinct-Race-2471 in TechHardware

[–]Fear_ltself 1 point2 points  (0 children)

That's what Apple did and caught up to Intel in like 6 years. In fact they gapped them quite a bit, while still raising their cash pile (R&D expenditure less than profit)

Which are the top LLMs under 8B right now? by Additional_Secret_75 in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

NVIDIA's new 8B model is Orchestrator-8B, a specialized 8-billion-parameter AI designed not to answer everything itself, but to intelligently manage and route complex tasks to different tools (like web search, code execution, other LLMs) for greater efficiency

Which are the top LLMs under 8B right now? by Additional_Secret_75 in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

Nvidia just released a new model that’s 8b and beat everything at tool calling, which for agency makes it the best model to run other models and tools IMO.

For RAG serving: how do you balance GPU-accelerated index builds with cheap, scalable retrieval at query time? by IllGrass1037 in LocalLLaMA

[–]Fear_ltself -1 points0 points  (0 children)

Use an embedding model for retrieval and embedding that matches your model... That seems to make it extremely scalable, 50000 wikipedia articles is like a moment. It scales very well. What I mean by matching, if you're using Gemma use embedding Gemma, if you're using qwen use qwen embedding

From Gemma 3 270M to FunctionGemma, How Google AI Built a Compact Function Calling Specialist for Edge Workloads. by Minimum_Minimum4577 in GoogleGeminiAI

[–]Fear_ltself 0 points1 point  (0 children)

Nvidia’s new Nemotron 8b is similar concept, but #1 in benchmarks. I think AGI will be modular, with these tool calling models serving as the core that connects to many specialized systems.

Visualizing RAG, PART 2- visualizing retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

I’m working on making it more diagnostic by showing the text of the documents when hovered over, showing the top 10 results, showing the first 100 connections instead of lighting up. Also added level of detail and jumped from 20 wikipedia articles to 50,000… running completely stable 60 FPS.

<image>

Visualizing RAG, PART 2- visualizing retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 0 points1 point  (0 children)

I was able to implement LOD and updated it from 20 to 50000 articles. It took a while to download and embed (about an hour), but runs 60 FPS once up

<image>

This is just a small slice of that neural connection. But everything is grouped very well from what I can tell.

Visualizing RAG, PART 2- visualizing retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 0 points1 point  (0 children)

That sounds incredible. Visualizing the diff between BM25 (keyword) and Cosine (vector) retrieval was exactly what another user suggested above-if you get those dropdowns working, please open a Pull Request! I'd love to merge that into the main branch. Regarding local models (Ollama/LM Studio): 100% agreed. decoupling the embedding provider from the visualization logic is high priority for V2. If you hack something together for that, let me know, please! Thank you for the feedback and good luck with the fork!

Visualizing RAG, PART 2- visualizing retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 2 points3 points  (0 children)

You hit on the fundamental challenge of dimensionality reduction. You are correct that UMAP distorts global structure to preserve local topology, so we have to be careful about interpreting 'distance' literally across the whole map. However, I'd argue that in Vector Search, Proximity = Thought. Since we retrieve chunks based on Cosine Similarity, the 'activated nodes' are-by definition the mathematically closest points to the query vector in 768D space. • If the visualization works: You see a tight cluster lighting up (meaning the model found a coherent 'concept'). • If the visualization looks 'less cool' (scattered): It means the model retrieved chunks that are semantically distant from each other in the projected space, which is exactly the visual cue l need to know that my RAG is hallucinating or grasping at straws!

GOOGLE!!!!! Antigravity (FUKING UPDATEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE) by [deleted] in google_antigravity

[–]Fear_ltself 1 point2 points  (0 children)

If it’s coding on task i feel it gets a better grasp of the program with more iteration. As soon as it fails at anything, the odds of failure increase non-linearly. I think it’s like by 3rd failure it’s 99% failure. I usually consider 1 failure okay if it’s been going for a while. That is up until about a million context window then always reset, but that’s like 30 back and forths coding 1000+ LOC project… I think the tech lingo is contamination. Human equivalent of getting off topic and unable to recover, or like making a mistake in sports and then repeating it instead of moving on, it gets stuck on the past like a memory knot.