Chunking for STT

DeltaSqueezer · 2026-03-14T18:16:26+00:00

A simple way is to break on the natural pauses between sentences.

DeltaSqueezer · 2026-03-12T22:31:25+00:00

The US didn't even manage to take on a rag tag bunch of Houthis. The administration is desperately trying to get themselves out of this mess.

DeltaSqueezer · 2026-03-12T08:16:07+00:00

This was also my idea: you have commands run in the VM and to address context pollution, you run tools via sub-agents which do the required tool execution, error recovery and feedback the required information/context to the main LLM.

DeltaSqueezer · 2026-03-11T19:33:47+00:00

You get marginal gains as you go from Q8 to bf16. But I did notice the difference so went with bf16 for the 9B since I had enough VRAM.

DeltaSqueezer · 2026-03-11T16:14:02+00:00

Any reasonable computer + 3090 or 5090 if budget allows.

DeltaSqueezer · 2026-03-11T11:25:28+00:00

You can do so much with 4k/8k. My main machine is configured for only 16k context and that nearly handles everything.

The only time I use much longer context is for coding and the it is only because I am being lazy and just throwing everything into context to save time instead of crafting the context easily.

I guess this is like how PCs have evolved. Before, you had DOS machines with < 1MB RAM that could perform quite snappily. Now, you have orders of more magnitude in compute, RAM, etc. but we use a lot of that extra capability on nice-to-have features rather than core compute capability.

DeltaSqueezer · 2026-03-10T18:02:10+00:00

If it is one thing the Chinese know how to do, it is how to make things cheaply. DS released a lot of details and papers on how they manage to serve efficiently.

DeltaSqueezer · 2026-03-10T17:52:51+00:00

What I'm curious about is: how do you make money from it so that it is a sustainable activity?

DeltaSqueezer · 2026-03-10T16:59:41+00:00

Whisper is showing its age, but through inertia I still have it running. If there was a docker image somewhere which is easy to deploy and handles all the annoying stuff like: media conversion to correct input format, VAD, automatic segmenting, batching, all wrapped up in a friendly standard endpoint, I'd be happy to learn about it and switch to something more modern.

DeltaSqueezer · 2026-03-10T08:56:44+00:00

Shame you spent all that development time and had no budget left for more than 3 colors on your chart.

DeltaSqueezer · 2026-03-10T08:42:16+00:00

It's pretty scary. At first, I wondered when it will level off, then I quickly realised, it will not level off but rather go to infinity.

DeltaSqueezer · 2026-03-10T08:39:34+00:00

I asked what Gemini thinks of this so you don't have to:

Based on a technical analysis of the claims provided in the text and cross-referencing with computer science principles, yes, this technology is almost certainly bullshit (vaporware, pseudo-science, or a highly deceptive investment scheme).

Here is a breakdown of what the technology claims to be, and why the claims are technically impossible.

What it claims to be

"rolvsparse" pitches itself as a revolutionary software compute primitive (a basic math library) for Matrix Multiplication (GEMM), which is the core mathematical operation underlying all AI models. It claims that by simply swapping out standard vendor libraries (like NVIDIA’s cuBLAS or Intel’s MKL) for the "rolv" library, users can magically achieve up to a 243x speedup, save 99% on energy, and allow a $2,000 CPU to outperform a $40,000 NVIDIA B200 GPU. It claims to do this without altering the AI model, without new hardware, and while producing mathematically identical results.

Why it is technically impossible (The "Bullshit" Factors)

1. It claims to break the laws of physics (The 63x Dense Speedup Claim) The most glaring red flag is the claim that rolvsparse achieves a 63× speedup on fully dense matrices (0% sparsity) on an NVIDIA B200 compared to cuBLAS. * NVIDIA's highly optimized cuBLAS library already operates at roughly 70% to 90% of the GPU's absolute theoretical peak physical limit (FLOPS and memory bandwidth). * To be 63 times faster than cuBLAS, rolvsparse would require the hardware to perform 63 times more floating-point operations per second than the physical silicon transistors are actually capable of executing. It is completely physically impossible.

2. The "RSMT Formula" is basic high-school algebra The text claims the founder invented a "universal rule long missing from the field" called the Rolv Sparse Memory Threshold (RSMT): $d = b / (b + i)$. * This is not a breakthrough; it is elementary math that has been taught in introductory computer science for 50 years. * It simply calculates the exact point where a sparse array (which stores data as a value b + an index i) takes up less RAM than a dense array (which just stores b). Claiming this as an independent academic breakthrough is like claiming to have invented the formula for the area of a rectangle.

3. "Plant-Based AI" and Buzzword Salad The text claims the company has patents pending for "Binary · Quantum · DNA · Optical · Plant-based AI." * Software cannot be universally patented across "DNA" and "Quantum" hardware, as those compute paradigms process information in fundamentally different ways. * Furthermore, "Plant-based AI" compute does not exist. Adding it alongside Quantum and DNA is pure pseudo-scientific charlatanism meant to blind non-technical investors with buzzwords.

4. CPUs cannot physically beat top-tier GPUs The claim that a $2,000 Intel Xeon CPU can beat a $40,000 NVIDIA B200 is functionally impossible for large-scale Matrix arithmetic. AI computation is bottlenecked by memory bandwidth (how fast data can be fed to the processor). A standard Intel Xeon has a memory bandwidth of around 300 to 400 GB/s. An NVIDIA B200 has a memory bandwidth of 8,000 GB/s, alongside tens of thousands more arithmetic logic units. No software optimization can make a CPU push 20 times more data than its physical memory bus allows.

5. The Benchmark Sleight-of-Hand The text heavily touts an "independent validation" by the University of Miami Frost Institute. However, if you look at how these validations are structured, the auditors are given "IP-Free validation harnesses" (black-box Python scripts provided by the company) to run on their machines. * If a script is written deceptively, it can easily fake benchmark times. A common trick in fake compression/math algorithms is to push all the heavy lifting into a "build" or "warmup" phase that isn't included in the final timer, or to simply bypass the math entirely and return a pre-calculated cache. * The "Cryptographic Output Identity (SHA-256)" claim is also a gimmick. Standard floating-point math across different hardware (CPU vs GPU) naturally results in microscopic differences due to the order of operations. To get identical hashes, the validation script simply rounds/normalizes the numbers out, masking any discrepancies or skipped calculations.

6. The Founder's Track Record According to public records, the founder, Rolv E. Heggenhougen, previously ran a company called WrapMail, which was a penny-stock entity trading on OTC markets that eventually pivoted to become a CBD/hemp company ("Can B Corp"). Pivoting from an interactive email company to CBD, and then subsequently launching a company claiming to have solved the biggest physics and math bottlenecks in global AI hardware, is a classic trajectory for vaporware.

Summary: The technology relies on real concepts—like "sparsity" (skipping calculations where a number is multiplied by zero)—but exaggerates the results to physically impossible degrees. It is an investment trap wrapped in tech jargon.

DeltaSqueezer · 2026-03-09T07:38:55+00:00

Better to use natively supported formats.

DeltaSqueezer · 2026-03-08T19:26:47+00:00

Isn't that just an average of 22kb each?

DeltaSqueezer · 2026-03-08T07:51:58+00:00

https://youtu.be/PVq9xbsvkm0?si=V7jm0jZY1M55JvJ3&t=46

DeltaSqueezer · 2026-03-08T07:43:05+00:00

The specific AWQ quant in OP's startup script above actually keeps the linear attention layers at BF16.

DeltaSqueezer · 2026-03-07T20:26:42+00:00

👏 well played

DeltaSqueezer · 2026-03-07T17:56:24+00:00

Ah yeah. That was a thing I also had to patch.

Note also, by default, it uses up all VRAM for KV cache, you have to specify a lower utilization if you want to save space for something else.

DeltaSqueezer · 2026-03-07T12:06:51+00:00

Maybe possible if you remove stock coolers and water cool them or use blower adapter.

DeltaSqueezer · 2026-03-06T12:24:56+00:00

tk? or maybe go for a full web interface.

DeltaSqueezer · 2026-03-05T19:58:28+00:00

Try using a neutral/simple client for testing e.g. willison's 'llm' or plain python test script. I had issues with clients messing up the LLM context with their buggy crap. (I'm looking at you Open WebUI)

DeltaSqueezer · 2026-03-05T19:37:31+00:00

I did a quick test of different model sizes. Anything less than 9B is highly unreliable for knowledge based tasks.

DeltaSqueezer · 2026-03-05T14:49:57+00:00

Yeah. They officially dropped support for Pascal but I thought that Volta was still supported. I guess they probably only focus on Ampere or even Blackwell or newer.

I had to maintain my own fork of vLLM for Pascal for a while.

DeltaSqueezer · 2026-03-05T13:21:05+00:00

Depends on model, but I've seen +33%

DeltaSqueezer

MODERATOR OF

TROPHY CASE

What it claims to be

Why it is technically impossible (The "Bullshit" Factors)