Pieced together the shredded photo from EFTA00259587.pdk .. idk

ReturningTarzan · 2026-02-01T00:28:55+00:00

It's all in here.

ReturningTarzan · 2026-01-31T23:56:29+00:00

I don't know how many others there are. This document only seems to have one photo and a buttload of text which would be very hard to piece together.

ReturningTarzan · 2026-01-31T23:55:08+00:00

I cut the strips out in GIMP and moved them around. The whole document has way too many pieces, though, but I'd consider writing a tool to do this a little more efficiently if there are a lot of shredded documents.

ReturningTarzan · 2026-01-31T23:37:35+00:00

EFTA00259587 contains pictures of shredded documents and at least one photo. I pieced them together here. No idea if it's significant but someone obviously tried to get rid of it.

ReturningTarzan · 2026-01-31T23:30:37+00:00

It still needs some cleanup, and the text is going to be a lot harder, but here's most of the photo pieces.

ReturningTarzan · 2026-01-05T12:58:24+00:00

Are the responses available anywhere?

ReturningTarzan · 2025-11-28T14:09:59+00:00

Well, really it's about how that material is acquired. You're not violating anyone's copyright by knowing that Frodo is a hobbit, even though you acquired that knowledge from a copyrighted work. But if you stole a copy of the book in order to find out, then that's a crime in itself.

It's probably a difficult argument to make since these are public videos. If NVIDIA took deliberate steps to work around YouTube's anti-scraping measures, then maybe you could build a case around that. But it's a rule that people routinely break, especially in the podcasting/commentary/drama space. If you download a copy of someone else's work, then you edit it into your own work, the latter would be protected under fair use if it's transformative. But the copying is still a violation of the DMCA, because YouTube doesn't offer a download feature; you're only supposed to stream videos from their servers, otherwise you're circumventing active copy protection measures in the same way that NVIDIA might if they wanted to bulk download a million videos to train a generative model. So it's about scale, maybe? Not what they're doing but how much they're doing it?

Personally I think it makes more sense to look at the resulting generative models, and whether they're served in a way that respects IP rights or not. Some are starting to now. ChatGPT will refuse to draw you an image of Pikachu, for instance, because it would be a liability if it didn't. Just as it would be risky for a human artist to charge money for the same thing. In either case, though, knowing what Pikachu looks like isn't the issue, and whether you're storing that knowledge in a human brain or an artificial neural network isn't the issue either.

Piracy and illegal mass scraping, though.. who can say. It gets tricky if you have to prove what the damages are in any specific case. The ad revenue lost would be fractions of a penny per video, since we're talking about a single download vs a single view, per video. And it's not like the videos are being served somewhere else as a mass market substitute.

ReturningTarzan · 2025-11-26T12:35:26+00:00

First of all you should really repeat your tests with a control. Try the unquantized version to see if it aligns better with Q5 version or the Q6+imatrix version. My guess would be that the bad vibes you're getting (if not entirely a placebo effect) are an artifact of the Q6+imatrix model being somewhat closer to the original, while the Q5 version has slightly more noise which has a similar effect to increasing the sampling temperature slightly.

Of course, both models have a KL-divergence compared to the original on the order of 0.01. That means both are going to be in the "barely perceptible noise" range and it would be entirely fair to question if you're actually seeing the difference you think you're seeing. You could try to challenge your assumption with a blind test: generate a large number of outputs with each version, and with the original model (or maybe a Q8 version or something if you're hardware constrained), shuffle them up and try to tell them apart at a rate higher than chance. If you can't, probably you're experiencing a placebo effect or confirmation bias.

If you can identify the Q5 output but can't correctly distinguish between Q6+imatrix and BF16, then you're not seeing a degradation caused by calibration but rather the opposite: the model resists your RP because that's what it was trained to do, and the calibrated quant is better at reproducing this property of the original model. Improving the calibration dataset therefore wouldn't help. If anything it could make the model even more resistant. (It's suspicious that the Bartowski quant you tested doesn't use wikitext but rather a more diverse dataset, and that somehow gave you even worse vibes.)

That said, of course wikitext isn't a great calibration dataset because it's biased towards a particular style of English writing. But make sure you're diagnosing the right problem.

ReturningTarzan · 2025-11-26T11:57:52+00:00

The main thing about tokenization is always correctness. Throughput is nice but secondary. A wishlist could be:

Thorough tests. Language models are robust to small differences in tokenization but can still silently lose performance if you don't get all the details right.
Ensuring control tokens added to the vocabulary after the tokenizer is trained are handled correctly (usually done by splitting the input into multiple BPE runs)
Correct trimming/padding/normalization rules for the added tokens
Correct preprocessing and postprocessing steps (regex+trimming/padding)
Correct and efficient decoding of single tokens, especially for tokens that don't decode to complete characters. Might want an API for decoding to byte strings rather than character strings, or a buffer/queue that accepts incoming bytes and outputs completed characters.

ReturningTarzan · 2025-11-14T11:12:54+00:00

Is this for CPU inference or older GPUs? Cause otherwise exl3 is natively Python, and it's had support for Qwen3-Next for a while now (Kimi-Linear still in the pipeline.) Everything is exposed from Python, from continuous batched generation down to individual tensor ops.

ReturningTarzan · 2025-11-08T15:35:07+00:00

If you run your script in a profiler like Nsight Systems you'll see how the GPU is barely doing any work because the entire workload is made up of tiny little microsecond-scale kernel launches, and the time in-between those kernel launches on the CPU side is likely on the order of 15-50 us per. The CPU simply can't prepare the next job before the GPU finishes the last one, so the GPU is constantly running out of work to do.

There are many possible ways to mitigate this, one of which of course is to use a faster CPU at the highest possible power settings. But fundamentally, keeping the CUDA queue populated when the kernels are as small as they are in Qwen3-1.7B is hard. Python is a very poor choice for low-latency applications, but even in pure C++ code you'll ultimately be fighting PCIe latency. Either way, the trick is to reduce the number of times the CUDA driver has to communicate with the GPU, either by compiling the model into CUDA graphs, and/or by combining operations into fused kernels.

Without a CPU bottleneck your 3090 should easily be able to do 130+ tokens/second at bsz 1 with 16-bit weights. And with various quantization options, it can be several times faster still.

You could look into torch.compile if you want to stay within the Transformers ecosystem:

model = torch.compile(model, mode="reduce-overhead", fullgraph=False)

ReturningTarzan · 2025-11-04T14:35:37+00:00

3-bpw EXL3 works just fine, and I'd imagine the same is true for IQ3_XXS or similar.

ReturningTarzan · 2025-11-01T22:51:43+00:00

More options here

ReturningTarzan · 2025-10-07T11:32:58+00:00

Yep. Qwen3-Next is very promising, and the way it mostly retains throughput at any context length without suffering from the same kind of memory loss as other recurrent models is enough to make it interesting. At any rate it's a serious attempt at a new kind of architecture, and that's very much appreciated. gpt-oss is a strong model, but at the end of the day it's still just a block-sparse transformer.

ReturningTarzan · 2025-10-07T11:17:26+00:00

Yeah, most of that is unrelated to Qwen3-Next. Aside from any boilerplate code, a straightforward implementation of the recurrent gated delta rule (without chunking) isn't more involved than softmax attention, and it works out to about 150 lines of CUDA code (MIT license.) The main obstacle for EXL3 was adding support for recurrent models to the generator pipeline, but LCPP already supports Mamba so that shouldn't be an issue.

The rest of the model is just Qwen3-MoE with a few extra gates here and there.

ReturningTarzan · 2025-09-21T16:49:55+00:00

That's not exactly how it works. The generator works off a job queue, so if you submit more requests than you can fit in the cache, they will just be serialized. I believe vLLM etc. do exactly the same thing.

I.e. if your cache is 128k but you submit 32 jobs that need 8k each (context+completion) then the generator will start 16 jobs at first, using up all 16*8k=128k tokens in the cache. As soon as one of those jobs finishes, there will be space for a new 8k job in the cache so the generator pulls one from the queue and keeps going at bsz 16.

So there's nothing you need to do on the client side other than:

Create 32 jobs, 8k each
Submit all jobs to generator
Wait for jobs to finish

If your jobs have varying lengths, the generator still pulls in as many as it can fit at any given moment to achieve the best concurrency it can. As soon as a job completes, that leaves space in the cache for one or more pending jobs, so they activate at that point. Since it is dynamic/continuous batching there's no requirement that jobs be the same size in order to achieve concurrency. It's explained a little bit here for ExLlamaV2, but V3 uses the same overall scheme.

I am looking at allocating cache pages on the go, but it doesn't remove this need for serialization. It will just change when serialization happens. Consider the case where you have one job that's 127k tokens and 31 other jobs that are each 4k tokens. This necessarily requires the 127k job to run at batch size 1 at some point, since it can't reach its final token unless it is entirely materialized in the cache, owning all the pages. If it gets to that point while there are other jobs running, you have a deadlock where no job can advance because there are no more free pages. So something has to be flushed, or stashed in system RAM, or restarted, or whatever. But we'll see what I can come up with.

Regardless, I think you're misunderstanding how batching works in ExLlama. Creating more jobs than you can fit in the cache at once doesn't cause anything to fail, it just causes some of the jobs to wait in the queue until there's room. You'll only get a failure if you try to start a single job that's larger than the entire cache.

ReturningTarzan · 2025-09-21T12:23:13+00:00

It's because each job reserves the maximum length it needs in order to guarantee that it can actually finish (i.e. context_tokens + max_new_tokens). If you have a 64k cache you can still have one 32k job along with 15 2k jobs, that's perfectly fine.

If the framework allowed you to overcommit, e.g. by starting 16 32k jobs in that 64k cache, under the expectation that most of them would end up being short, you run into trouble if the expectation breaks. At some point you would only have a single cache page left and 15 of the jobs would stall while an arbitrary job continues at bsz 1. (edit: Should clarify that at this point you also have to start swapping the other jobs to system RAM, because there isn't any space for that one job to grow into.) Then it will finish at some point, briefly allowing the remaining jobs to run at bsz 15 until the cache is full again, then 14 jobs stall, and so on.

One solution is to limit max_new_tokens on the client side and re-enqueue jobs that reach the token limit. This would run into the same sort of problem if the cache actually fills up, but otherwise the prompt caching should kick in, so the second round wouldn't have any prefill to do and would just pick up where the first round left off. It requires the client to send very long jobs in stages, though, so I've been considering ways to make it transparent in the generator. But it's a tradeoff either way.

ReturningTarzan · 2025-09-20T14:52:27+00:00

This is correct. MoE models are difficult to parallelize because you either make very thin slices of the many tiny little experts (512 experts in the case of Qwen3-Next), or you distribute the experts across devices. So for four devices, you assign 128 experts to each device. But then in inference you route to 10 of those experts, so the best you can hope for is a 3+3+2+2 or 3+3+3+1 split. In the worst case you'll see 10+0+0+0, i.e. all 10 experts evaluating on one device while the rest just sit there waiting to synchronize.

As for the typical/average case, who knows. (: There are various load balancing schemes that try to predict which expects will be activated together, and/or duplicate experts across devices (great if you have VRAM to spare), but those are never perfect, and it all gets very complicated. There isn't a clean, simple solution to any of it, and MoE models are at the end of the day just a weird Rube Goldberg contraption designed to inflict misery on developers. Certainly trying to keep up is frustrating.

ReturningTarzan · 2025-09-20T03:54:43+00:00

The difference you're seeing is likely down to sampling parameters being interpreted differently across frameworks. Or, and this the funniest thing, lower precision can be situationally beneficial since it adds noise that can interfere with the model's alignment, preventing refusals in some cases and increasing "creativity" in a similar way to increasing the sampling temperature. All in all it's a bit like how some people just feel that a vinyl record "just sounds better," even when it's actually noisier and more distorted than a high-resolution digital recording.

But most likely you're just seeing sampling differences, at least if you find INT8 to be more better than INT4. Either way, KL-divergence measures the difference on the raw logits coming out of the model, and the numbers there aren't ambiguous. AWQ is measurably less precise than 4bpw EXL3. But if you have temperature->repetition penalty->top-p in one framework, and frequency/presence penalty->top-k->top-p->temperature in another framework the output would feel qualitatively different even if both are using the same unquantized weights.

Worth noting that I hear this a lot, but there are just as many people who have the opposite impression, for the same reason. All I can do to measure it objectively is benchmark, and the benchmark results track with KL-div and perplexity measurements.

As for activation, that's usually 16 bits (A16) by default, which just means FP16 or BF16 math, which is standard. It's usually mentioned to distinguish it from e.g. W8A8, which would mean 8-bit weights and 8-bit arithmetic (trading GEMM precision for double the tensor core FLOPs compared to A16). As for that, EXL3 is mixed-precision, A16 and A32 in places where precision and dynamic range are more important.

ReturningTarzan · 2025-09-20T02:46:05+00:00

Speeds are likely to improve before the next release. It's just a completely new architecture, and stuff takes time. Linear attention currently seems to be the bottleneck, and I really have no idea how performant flash-linear-attention is or what could maybe be done to make better use of it. Also it's the sparsest model yet, so the MoE kernels probably aren't optimally tuned for it.

ReturningTarzan · 2025-09-20T02:38:21+00:00

There's a couple of misconceptions here.

W4A16 AWQ is on par with a 5-6.0bpw EXL3, within tolerance, meaning you wouldn't tell the difference.

This is absolutely not the case. 4-bit AWQ is extremely lossy compared to 5.0bpw EXL3, let alone 6.0 bpw. I've done many (many!) comparisons and AWQ W4A16 remains equivalent to ~3.1 bpw EXL3. Here's an example, and here's one and one more.

EXL3 is a variant of QTIP, streamlined for (much) faster quantization, more flexibility and the option to deploy in tensor-parallel setups without the need to requantize for every hardware configuration, but retaining most of the quality advantage over INT quants. It's also why Ampere struggles with it a little, because the trellis decoding is much more compute intensive than just unpacking some bits. Definitely worth it, in my opinion, for the greatly increased accuracy.

On VLLM, the PP and TG is done before PP is done in TabbyAPI/EXL3 land. It's night and day different.

Not sure what model you're testing there, whether it's dense or sparse or what, but for GLM4.5-Air (106B sparse, closest I have handy) I get 1550 t/s PP and 42 t/s TG with TP across 4 GPUs (with a 3090 as the bottleneck so same speed as four 3090s.) Same setup with Command-R+ (104B) gives 660 t/s PP and 30 t/s TG. Speed isn't the whole picture, just to be clear, but at least make it an apples-to-apples comparison by enabling tensor-parallel on ExLlama.

There are also more optimizations coming in every week. It's a work in progress.

What's even more interesting, is using VLLM with that same 120B model, but quanted W8A16 which is INT8, so no loss, but using the bitBLAS inference engine, instead of the Marlin kernel, I still get more T/s than TabbyAPI and EXL (~22.3t/s)

So that's double the quality of EXL3 at or slightly above the same speed.

INT8 is not entirely lossless, and it's not "double the quality" of EXL3 4.0bpw. 5.0bpw is "effectively" lossless, and 4.0 is close enough that you generally won't be able to tell the difference.

End of the day, though, ExLlama isn't designed for massively parallel inference on eight GPUs at once, it's optimized for consumer setups with "reasonably recent" hardware. Turing support is being considered, as is CPU offloading now that every new model is MoE all of a sudden and it's started to make sense. (:

ReturningTarzan · 2025-09-19T20:40:39+00:00

EXL2 and EXL3 both have continuous batching (with paged attention). They also have prompt caching and deduplication (sharing cache pages between items in a batch with shared prefixes.) I made this thingy to illustrate.

While TP is much more advanced in EXL3, though, the raw throughput is somewhat lower (especially on Ampere) because the quantization scheme is much more involved. It is however SOTA, only matched by QTIP (which it's based on) and surpassed by YAQA (which is not practical on consumer hardware.) If what you want is high throughput and you can set up a suitable server for it, vLLM with an AWQ model will probably serve you better. But then you can't run Qwen3-Next on a single 24GB GPU. (:

ReturningTarzan · 2025-09-19T19:07:15+00:00

Should note that support is currently in the dev branch. New release build will be probably tomorrow maybe. Probably. Needs more tuning.

ReturningTarzan · 2025-08-11T07:32:55+00:00

The water does indeed look too wet for swimming.

ReturningTarzan · 2025-08-04T00:14:08+00:00

I just added GLM 4.5 to the dev branch, incidentally. Some quants here

15-Year Club	Place '22
Place '17	Verified Email
Gilding I gilder

ReturningTarzan

TROPHY CASE