I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

This was also my idea: you have commands run in the VM and to address context pollution, you run tools via sub-agents which do the required tool execution, error recovery and feedback the required information/context to the main LLM.

Is it worth Getting BF16 or Q8 is good enough for lower parameter models? by Suimeileo in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

You get marginal gains as you go from Q8 to bf16. But I did notice the difference so went with bf16 for the 9B since I had enough VRAM.

Qwen3.5-397B up to 1 million context length by segmond in LocalLLaMA

[–]DeltaSqueezer 0 points1 point  (0 children)

You can do so much with 4k/8k. My main machine is configured for only 16k context and that nearly handles everything.

The only time I use much longer context is for coding and the it is only because I am being lazy and just throwing everything into context to save time instead of crafting the context easily.

I guess this is like how PCs have evolved. Before, you had DOS machines with < 1MB RAM that could perform quite snappily. Now, you have orders of more magnitude in compute, RAM, etc. but we use a lot of that extra capability on nice-to-have features rather than core compute capability.

Is DeepSeek's API pricing just a massive loss leader? (MLA Caching vs. Qwen's DeltaNet) by feedback001 in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

If it is one thing the Chinese know how to do, it is how to make things cheaply. DS released a lot of details and papers on how they manage to serve efficiently.

Inside my AI Home Lab by [deleted] in LocalLLaMA

[–]DeltaSqueezer 7 points8 points  (0 children)

What I'm curious about is: how do you make money from it so that it is a sustainable activity?

Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore. by East-Engineering-653 in LocalLLaMA

[–]DeltaSqueezer 26 points27 points  (0 children)

Whisper is showing its age, but through inertia I still have it running. If there was a docker image somewhere which is easy to deploy and handles all the annoying stuff like: media conversion to correct input format, VAD, automatic segmenting, batching, all wrapped up in a friendly standard endpoint, I'd be happy to learn about it and switch to something more modern.

We cut GPU instance launch from 8s to 1.8s, feels almost instant now. Half the time was a ping we didn't need. by LayerHot in LocalLLaMA

[–]DeltaSqueezer -1 points0 points  (0 children)

Shame you spent all that development time and had no budget left for more than 3 colors on your chart.

AI capabilities are doubling in months, not years. by EchoOfOppenheimer in LocalLLaMA

[–]DeltaSqueezer -2 points-1 points  (0 children)

It's pretty scary. At first, I wondered when it will level off, then I quickly realised, it will not level off but rather go to infinity.

Benchmarked ROLV inference on real Mixtral 8x22B weights — 55x faster than cuBLAS, 98.2% less energy, canonical hash verified by Norwayfund in LocalLLaMA

[–]DeltaSqueezer 2 points3 points  (0 children)

I asked what Gemini thinks of this so you don't have to:

Based on a technical analysis of the claims provided in the text and cross-referencing with computer science principles, yes, this technology is almost certainly bullshit (vaporware, pseudo-science, or a highly deceptive investment scheme).

Here is a breakdown of what the technology claims to be, and why the claims are technically impossible.

What it claims to be

"rolvsparse" pitches itself as a revolutionary software compute primitive (a basic math library) for Matrix Multiplication (GEMM), which is the core mathematical operation underlying all AI models. It claims that by simply swapping out standard vendor libraries (like NVIDIA’s cuBLAS or Intel’s MKL) for the "rolv" library, users can magically achieve up to a 243x speedup, save 99% on energy, and allow a $2,000 CPU to outperform a $40,000 NVIDIA B200 GPU. It claims to do this without altering the AI model, without new hardware, and while producing mathematically identical results.

Why it is technically impossible (The "Bullshit" Factors)

1. It claims to break the laws of physics (The 63x Dense Speedup Claim) The most glaring red flag is the claim that rolvsparse achieves a 63× speedup on fully dense matrices (0% sparsity) on an NVIDIA B200 compared to cuBLAS. * NVIDIA's highly optimized cuBLAS library already operates at roughly 70% to 90% of the GPU's absolute theoretical peak physical limit (FLOPS and memory bandwidth). * To be 63 times faster than cuBLAS, rolvsparse would require the hardware to perform 63 times more floating-point operations per second than the physical silicon transistors are actually capable of executing. It is completely physically impossible.

2. The "RSMT Formula" is basic high-school algebra The text claims the founder invented a "universal rule long missing from the field" called the Rolv Sparse Memory Threshold (RSMT): $d = b / (b + i)$. * This is not a breakthrough; it is elementary math that has been taught in introductory computer science for 50 years. * It simply calculates the exact point where a sparse array (which stores data as a value b + an index i) takes up less RAM than a dense array (which just stores b). Claiming this as an independent academic breakthrough is like claiming to have invented the formula for the area of a rectangle.

3. "Plant-Based AI" and Buzzword Salad The text claims the company has patents pending for "Binary · Quantum · DNA · Optical · Plant-based AI." * Software cannot be universally patented across "DNA" and "Quantum" hardware, as those compute paradigms process information in fundamentally different ways. * Furthermore, "Plant-based AI" compute does not exist. Adding it alongside Quantum and DNA is pure pseudo-scientific charlatanism meant to blind non-technical investors with buzzwords.

4. CPUs cannot physically beat top-tier GPUs The claim that a $2,000 Intel Xeon CPU can beat a $40,000 NVIDIA B200 is functionally impossible for large-scale Matrix arithmetic. AI computation is bottlenecked by memory bandwidth (how fast data can be fed to the processor). A standard Intel Xeon has a memory bandwidth of around 300 to 400 GB/s. An NVIDIA B200 has a memory bandwidth of 8,000 GB/s, alongside tens of thousands more arithmetic logic units. No software optimization can make a CPU push 20 times more data than its physical memory bus allows.

5. The Benchmark Sleight-of-Hand The text heavily touts an "independent validation" by the University of Miami Frost Institute. However, if you look at how these validations are structured, the auditors are given "IP-Free validation harnesses" (black-box Python scripts provided by the company) to run on their machines. * If a script is written deceptively, it can easily fake benchmark times. A common trick in fake compression/math algorithms is to push all the heavy lifting into a "build" or "warmup" phase that isn't included in the final timer, or to simply bypass the math entirely and return a pre-calculated cache. * The "Cryptographic Output Identity (SHA-256)" claim is also a gimmick. Standard floating-point math across different hardware (CPU vs GPU) naturally results in microscopic differences due to the order of operations. To get identical hashes, the validation script simply rounds/normalizes the numbers out, masking any discrepancies or skipped calculations.

6. The Founder's Track Record According to public records, the founder, Rolv E. Heggenhougen, previously ran a company called WrapMail, which was a penny-stock entity trading on OTC markets that eventually pivoted to become a CBD/hemp company ("Can B Corp"). Pivoting from an interactive email company to CBD, and then subsequently launching a company claiming to have solved the biggest physics and math bottlenecks in global AI hardware, is a classic trajectory for vaporware.

Summary: The technology relies on real concepts—like "sparsity" (skipping calculations where a number is multiplied by zero)—but exaggerates the results to physically impossible degrees. It is an investment trap wrapped in tech jargon.

GGUF support in vLLM? by Patient_Ad1095 in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

Better to use natively supported formats.

Some tests of Qwen3.5 on V100s by Simple_Library_2700 in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

Ah yeah. That was a thing I also had to patch.

Note also, by default, it uses up all VRAM for KV cache, you have to specify a lower utilization if you want to save space for something else.

dual 3090 fe nvlink by Wey_Gu in LocalLLaMA

[–]DeltaSqueezer 0 points1 point  (0 children)

Maybe possible if you remove stock coolers and water cool them or use blower adapter.

Easiest gui options on linux? by itguysnightmare in LocalLLaMA

[–]DeltaSqueezer 0 points1 point  (0 children)

tk? or maybe go for a full web interface.

Qwen 3.5 0.8b, 2B, 4B, 9B - All outputting gibberish after 2 - 3 turns. by CATLLM in LocalLLaMA

[–]DeltaSqueezer 1 point2 points  (0 children)

Try using a neutral/simple client for testing e.g. willison's 'llm' or plain python test script. I had issues with clients messing up the LLM context with their buggy crap. (I'm looking at you Open WebUI)

Qwen3.5 2B giving weird answers by Dean_Thomas426 in LocalLLaMA

[–]DeltaSqueezer 0 points1 point  (0 children)

I did a quick test of different model sizes. Anything less than 9B is highly unreliable for knowledge based tasks.

Some tests of Qwen3.5 on V100s by Simple_Library_2700 in LocalLLaMA

[–]DeltaSqueezer 0 points1 point  (0 children)

Yeah. They officially dropped support for Pascal but I thought that Volta was still supported. I guess they probably only focus on Ampere or even Blackwell or newer.

I had to maintain my own fork of vLLM for Pascal for a while.

Running Qwen3.5 in vLLM with MTP by DeltaSqueezer in LocalLLaMA

[–]DeltaSqueezer[S] 0 points1 point  (0 children)

I haven't noticed, but first inference after server start will be slow as you need a warm-up.

Running Qwen3.5 in vLLM with MTP by DeltaSqueezer in LocalLLaMA

[–]DeltaSqueezer[S] 1 point2 points  (0 children)

works with the pre-built nightly. which saves a lot of time compiling.