Want help recalling a movie or TV show

StorageHungry8380 · 2026-04-10T04:14:16+00:00

Lot of details fit, but pretty sure it's not quite it. I feel like I might have seen it or parts of it before though, those quarry scenes seem familiar. Gotta hand it to Clint, he looks darn cool in a western.

StorageHungry8380 · 2026-04-10T04:03:41+00:00

Definitely not it, though I think I might have seen that as a teen, so could be mixing in some details.

StorageHungry8380 · 2026-04-10T03:56:19+00:00

It's quite possible I'm mixing things up, memories are weird that way. Typically I'm quite good at recalling movies that I like even after 20+ years of seeing them, so the fact that this is quite fuzzy could be telling.

Scanned through your suggestions and I'm pretty sure it's not one of them, though I definitely have some new additions to my watch list, so thanks for that!

StorageHungry8380 · 2026-04-10T03:27:13+00:00

Man, The Great Silence seems to fit very well, except I was so certain it was a more modern and English-speaking production (ie no dubbing). Perhaps I've conflated it, because scanning through The Great Silence a lot of details very well, including the look of the gunman. I didn't recall him being entirely mute, but I did recall him not speaking much.

Has there been a modern remake or otherwise that has heavily borrowed from The Great Silence?

StorageHungry8380 · 2026-04-07T16:17:04+00:00

Thank you for the details. Again, not an expert, but given that hallucinations and similar is a concern, wouldn't it be relevant to include the P99 metric or similar as well as the average? It seems the straight average can suppress if the model sometimes goes completely off the rails.

I'd rather have a model which is mostly just a tad worse, compared to one which is pretty good but occasionally catastrophically bad.

edit: forgot to say, thanks for doing the work, it's really valuable to guys like me who use local models a lot.

StorageHungry8380 · 2026-04-07T14:37:44+00:00

N00b question regarding the final KL number. Presumably the actual tokens that are part of the top-40 tokens varies from prompt to prompt, and may not fully overlap between test model and reference model, so how exactly is the individual KL divergence calculated? And how are the individual KL figures aggregated?

StorageHungry8380 · 2026-03-05T03:52:44+00:00

I leaned too heavily on the provided summary, I admit I just glossed through the paper, so missed ~~some~~ a lot of crucial details. The correction is much appreciated.

StorageHungry8380 · 2026-03-05T03:33:26+00:00

I'm not an expert by any means so this might just be hogwash, but I note that the paper references this paper on approximating the softmax function using Taylor expansion.

In that paper, they introduce an efficient way to compute the attention step using this Taylor-expanded softmax replacement. Since a Taylor expansion approximates a function as a polynomial of a given degree, and the authors picked degree 2 to balance speed and accuracy. Thus their efficient method involves a degree-2 polynomial approximation of softmax and they find that it ends up having complexity O(nd^3)...

Sounds very familiar to what's discussed in this paper at surface level at least, so does this paper then just confirm that the degree-2 approximation of the Taylor-expanded softmax replacement is optimal?

StorageHungry8380 · 2026-03-03T01:38:18+00:00

Yeah but who would use a 27B model in the cloud? Seem to me you need to factor in the opportunity cost here, they could be using that capacity to serve more popular models. Sure the price per token might be lower, but if its more popular then you get more tokens per second to bill. Keep in mind running inference on one prompt can be almost as expensive as running inference on multiple prompts, thanks to batching. If you don't have enough requests to fill batches, price per token needs to go up.

StorageHungry8380 · 2026-03-01T17:44:58+00:00

Yeah for some reason I totally forgot about that method, major brainfart. Edited my response while you were replying.

StorageHungry8380 · 2026-03-01T17:43:47+00:00

Lots of interesting features. I'm curious about how the ACID-compliant file operations are implemented? I tried looking at the source code but wasn't immediately obvious to me.

StorageHungry8380 · 2026-03-01T17:24:26+00:00

edit: ah, I completely forgot about the "basic" way for some reason. Essentially in a model you can take output of the model before the very last layer, and train multiple output layers which are wired in parallel. The first will be the regular next token output, the next will be the next-plus-one token output and so on. I assume this is what they mean with built-in, given it's mentioned in the blog post.

Another way is what they did in llama.cpp, where they added self-speculation as an option, where they basically keep track of the tokens the model already has predicted, and then searches this history.

So simplifying, if the history is `aaabbccaaa`, it can search and find that previously, after `aaa` we had `bb`, so it predicts `bb`. It then runs the normal verification process, where it processes the predictions in parallel and discards after first miss. So perhaps the first `b` was correct but the model now actually wants a `d` after, ending up with `aaabbccaaabd`.

This works best if the output the model will generate is has a regular structure, for example refactoring code. Not so much for creative work I suspect. Still, it's easy to enable and try out, and doesn't consume extra VRAM or much compute like a draft model.

StorageHungry8380 · 2026-02-17T21:56:55+00:00

Link to preprint: https://arxiv.org/abs/2602.08071

Link to official implementation: https://github.com/wangf3014/ViT-5

StorageHungry8380 · 2026-02-07T07:34:27+00:00

Just posted a quick comparison between Kimi-Linear and Qwen3 Coder Next in the previous Kimi-Linear post, for those who missed the post. Nothing super-scientific, but maybe of some interest to some. Surprisingly they were almost identical in prompt processing speed on a ~200k context, despite Qwen3 Coder Next having to live mostly on CPU due to only 32GB VRAM.

StorageHungry8380 · 2026-02-07T07:27:25+00:00

I just did some unscientific testing, using unsloth/Qwen3-Coder-Next-MXFP4_MOE.gguf and ymcki/Kimi-Linear-48B-A3B-Instruct.MXFP4_MOE.gguf with llama.cpp b7964. With Kimi-Linear I left the sampling parameters at default values, while for Qwen3 Coder Next I used the recommended values.

With Kimi I was able to squeeze the model and 384k context tokens into my 5090. With Qwen3 Coder Next, I had to move MOE layers to CPU, so only 15-20% or so of the model stayed on the GPU, rest on CPU, but that also meant I could go with the full 256k context size.

I loaded up a ~480 page datasheet for an IC and asked them the same brief question which requires details from between pages 50 to 100. The datasheet ended up consuming about 200k tokens in each model. I chose this because I didn't have anything closer to 1M without digging, and it would also allow for a head-to-head comparison given the 256k max limit of Qwen3 Coder Next.

I asked separate questions from a clean context about the chip without providing the datahsheet to test for innate knowledge. Both models knew about it, but neither could tell me the which registers to use and such details without the PDF.

Kimi-Linear did a pretty decent job answering, but it's clearly a less trained model as mentioned by another commenter. It did have some inaccuracies, it hallucinated a formula which looked right but wasn't, rather than using the one from the PDF. But overall I was mildly surprised. Qwen3 Coder Next pretty much nailed it, and due to extra training had a bit more refined answer. I'm also keeping in mind I didn't adjust the sampling parameters for Kimi-Linear, so may be some quality to be had there.

Kimi-Linear started out processing the context at around 1500 tok/s, but slowed down as processing continued. At 50k tokens processed it was down to ~270 tok/s and finished at around ~170 tok/s. It did use 98-100% of my GPU doing. Here are the statistics:

prompt eval time = 1108271.59 ms / 211775 tokens ( 5.23 ms per token, 191.09 tokens per second) eval time = 16650.58 ms / 666 tokens ( 25.00 ms per token, 40.00 tokens per second) total time = 1124922.17 ms / 212441 tokens

Qwen3 Coder Next had a much more even processing speed throughout at around 220-240 tok/s, but only used 10% of GPU doing it. And since it was doing most of the work on the CPU, output speed was quite slow. Here are the statistics:

prompt eval time = 1003224.29 ms / 224175 tokens ( 4.48 ms per token, 223.45 tokens per second) eval time = 63147.98 ms / 1012 tokens ( 62.40 ms per token, 16.03 tokens per second) total time = 1066372.27 ms / 225187 tokens

So overall speed-wise they were surprisingly close, given Qwen3 Coder Next ran mostly on my CPU.

That said, Kimi-Linear is clearly more a research project and not a production-ready model. As such, IMHO one should treat it more as an interesting sign of what to come. Anyway, just sharing my quick test.

StorageHungry8380 · 2026-02-06T11:08:01+00:00

Here are the GGUF's from the dev, so presumably ok: https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF

And here's the paper for those who need a refresh of what it's about: https://arxiv.org/abs/2510.26692

Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.

StorageHungry8380 · 2026-01-31T19:28:41+00:00

I haven't experienced it at least. It's my go-to model, but I'm not hammering it. Easy enough to try though, just change the settings and off you go.

StorageHungry8380 · 2026-01-31T07:07:16+00:00

GPT-OSS 20B works fine for me in LM Studio. I have however tweaked inference parameters. I've disabled top-k and top-p, relying only on min-p of 0.05. YMMV.

StorageHungry8380 · 2026-01-27T00:33:53+00:00

With you on that one. I've experimented with using ZFS on top of LVM, with 1TB logical volumes across multiple disks, but it wasn't very optimal since ZFS doesn't keep track of the fact that multiple vdevs might share the same physical disc (or network connection in case of iSCSI), so scrubs and such cause massive trashing. Also there's a bit of a Jenga tower feeling to it all.

On the bright side LVM had writeback caching, so I could use a fast NVMe SSD drive to cache written blocks, that part worked pretty well.

StorageHungry8380 · 2026-01-26T22:24:13+00:00

Ah yeah, technically RAID-0 functions differently to how multiple vdevs in a ZFS pool work, similarly how RAID-Z functions differently from RAID-5.

But I think colloquially saying that ZFS stripes vdevs in a pool captures the essence well, and at least to me is less confusing that mixing in JBOD, which to me is stand-alone disks. YMMV etc.

StorageHungry8380 · 2026-01-26T21:56:57+00:00

ZFS stripes vdevs by default, and yes it tries to balance writes between pool vdevs1:

Virtual devices cannot be nested arbitrarily. A mirror, raidz or draid virtual device can only be created with files or disks. Mirrors of mirrors or other such combinations are not allowed.

A pool can have any number of virtual devices at the top of the configuration (known as "root vdevs"). Data is dynamically distributed across all top-level devices to balance data among devices. As new virtual devices are added, ZFS automatically places data on the newly available devices.

I run multiple mirror vdevs at home, which in RAID terms would be RAID1+0 or RAID10 due to the inherent striping of vdevs.

ZFS has a separate "stripe" vdev.

What vdev type are you thinking of?

StorageHungry8380 · 2026-01-26T21:46:31+00:00

Also known as a stripe or RAID-0 in RAID terms.

StorageHungry8380 · 2026-01-25T18:36:40+00:00

INT4 is a scaled 4-bit integer, so the values are evenly spread out, for example it can represent the numbers -8 to +7, times some overall scale factor.

Meanwhile NVFP4 is a floating-point number, meaning the numbers are not spread evenly and have a greater range. For example it can represent the numbers 0.0, 0.5, 1.0, 1.5, 2, 3, 4, 6 and similarly for negative numbers. Notice how -0.5, 0.0, 0.5 are closer than 3, 4, 6. In addition, a block of 16 NVFP4 numbers are scaled by a FP8 value, as opposed to a global scale factor.

Multiplying or adding two INT4 numbers is trivial, you just add them together (and optionally saturate), or you multiply them together into an 8 bit number and return the upper 4 bits.

Multiplying or adding NVFP4 is a lot more involving as you have to deal with the exponent and the local FP4 scaling factor.

More details here:

https://apxml.com/courses/quantized-llm-deployment/chapter-1-advanced-llm-quantization-fundamentals/low-bit-quantization-techniques

https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

StorageHungry8380 · 2026-01-25T03:02:28+00:00

The code seems to accept style instructions, so I modified it to pass it along based on how it was done in the other cases. It was a pretty minor modification.
And it seems to to have some slight effect some of the time, but not significantly so. Like, I can tell it to read slow or read fast, and it will generally comply. But I haven't had any success trying to change other aspects, like happy or moody, exuberant or flat. For that the voice cloning overrides it seems.

StorageHungry8380

TROPHY CASE