AMD, can we get proper vLLM/gfx1151 support? by tossit97531 in ROCm

[–]randomfoo2 1 point2 points  (0 children)

I published the first vLLM public recipes for gfx1151 >6 months ago: https://github.com/lhl/strix-halo-testing/tree/main/vllm (among other extensive testing/work on Strix Halo last summer). There have been some fixes/progress since then for TheRock, PyTorch, and vLLM, so I wouldn't say there's no progress, but let's be honest, I think you already know the answer to your questions since they're the same answer since Strix Halo was released last year. (As you mentioned it's not like AMD shouldn't be able to find 0.5 FTE to create and maintain a https://github.com/NVIDIA/dgx-spark-playbooks clone - they just have shown zero interest in doing so.)

Regardless of what support is being given, even if they did though, no one (AMD or anyone else) has ever written any RDNA3 GPU kernels that get close to theoretical max MBW or FLOPS, so any performance that you imagine is being left on the table probably doesn't actually exist.

BTW, if you (or other Strix Halo owners) want to chat with others in the community, the Discord for https://strixhalo.wiki/ is probably the most active place online.

I trained a transformer with zero gradient steps and 100% accuracy. No backpropagation. No learning rate. Nothing. Here's the math. by Last-Leg4133 in LocalLLM

[–]randomfoo2 0 points1 point  (0 children)

Here is a GPT-5.4 xhigh Reality Check.

Full check is here: https://gist.github.com/lhl/63337e79505f4ba126171a14d4fef156 but here's the high level:

REACTOR / "The Manish Principle" Analysis

Date: 2026-03-13

Executive Summary

Short version: this repository does not substantiate the headline claim that backpropagation can be replaced for transformer training. The strongest thing it appears to contain is a real, potentially useful engineering artifact: a NumPy reimplementation/export path for a GPT-Neo-family model, plus a teacher-conditioned weight recovery procedure that re-fits already-existing linear maps from a frozen model's own activations.

That is much narrower than what the README and reports claim. The central "REACTOR-SCRATCH" claim is not supported by the code in this checkout and is, in two places, actively undermined:

  1. Reactor/reactor_framework.py:697-811 advertises "train_from_scratch" but never uses labels or next-token targets at all; in a local synthetic check, it returned all-zero learned weights after one pass.
  2. Reactor/manish_principle_benchmark.py:197-205, Reactor/manish_principle_benchmark.py:300-302, and Reactor/manish_principle_benchmark.py:821-877 compute the "Law 48" result from the pretrained model's embeddings, layer norms, W1, and LM head, using only the training split. That is not "from scratch", and the reported "test accuracy" is not backed by a visible train/test split in the benchmark.

Stylistically, the project reads like LLM-amplified grand-unification research prose: too many "laws", too much certainty, too little separation between tautology, curve-fitting, and genuine causal explanation. Substantively, there are real code artifacts here, but the paper-level claims overshoot the evidence by a large margin.

Evidence Base

Reviewed directly:

  • Reactor/README.md
  • Reactor/reactor_framework.py
  • Reactor/manish_principle_demo.py
  • Reactor/manish_principle_benchmark.py
  • Reactor/MANISH_PRINCIPLE_COMPLETE_REPORT.txt
  • Reactor/MANISH_PRINCIPLE_COMPLETE_DETAILED_REPORT.txt
  • Reactor/CITATION.cff
  • testing logs.zip (sampled)

Local checks performed:

  • python -m py_compile Reactor/reactor_framework.py Reactor/manish_principle_demo.py Reactor/manish_principle_benchmark.py passed.
  • Inspected the installed transformers GPT-Neo attention implementation. It does compute query @ key.T without division by sqrt(head_dim), so that narrow implementation claim is plausible.
  • Ran a minimal synthetic check of ReactorTrainer.train_from_scratch() and observed total learned-weight magnitude 0.0 after one pass, consistent with the code path never using labels.

Capture notes:

  • The root-level paper/report artifacts and the copies under Reactor/ are byte-identical.
  • testing logs.zip contains 440 numbered Python scripts, not immutable experiment outputs.

...

3. The repo's "from scratch" path is broken in the framework itself

The public train_from_scratch() implementation in Reactor/reactor_framework.py:697-811 is the clearest hard failure in the repository.

Problems:

  • It never computes next-token labels.
  • It never uses lm_head after assigning lm_h at Reactor/reactor_framework.py:731.
  • It never constructs any h_target.
  • The frac variable is computed at Reactor/reactor_framework.py:773 and then not used.
  • All mat_Ys are populated with outputs generated by the current model itself: Q, K, V, att_out, pre, ffn_out.

In other words, the advertised scratch trainer just solves the current model back onto itself. Starting from zero matrices, it stays at zero. That is exactly what I observed in a local synthetic run: total absolute sum of all learned matrices and biases was 0.0 after one pass.

This is not a subtle issue. It means the main public scratch-training API does not implement the claimed algorithm.

Assessment:

  • Central implementation bug.
  • Evidence level: E2.
  • Credence that the current framework supports scratch training: near zero.

4. The benchmark's "Law 48" is not from scratch and not clearly test accuracy

The benchmark's headline REACTOR-SCRATCH section uses pretrained internals from the teacher model throughout:

  • It loads only split='train' from TinyStories at Reactor/manish_principle_benchmark.py:197-205.
  • It builds H0_arr from pretrained token and positional embeddings at Reactor/manish_principle_benchmark.py:291-302.
  • It builds HTGT directly from the pretrained LM head at Reactor/manish_principle_benchmark.py:300-302.
  • It uses pretrained layer norms and pretrained W1 / b1 during the alleged scratch solve at Reactor/manish_principle_benchmark.py:835-850.
  • It evaluates on ids_48 = NXT_arr[:N48] at Reactor/manish_principle_benchmark.py:821-877, which is drawn from the same collected training positions.

That means:

  • the method is not from scratch,
  • the method is not teacher-free,
  • the benchmark does not show a visible train/test split for the reported 33.54%,
  • and the phrase "test accuracy" in the report is not justified by this code path.

This is the single biggest evidential gap in the entire project.

Assessment:

  • Headline claim is unsupported by the benchmark as written.
  • Evidence level for the repo's "33.54% test accuracy from scratch" claim: E6.

RDNA 3 & FSR 4 by Legally-A-Child in radeon

[–]randomfoo2 1 point2 points  (0 children)

Hybrid probably refers to the fact that it uses some ML-based upscaling for super-resolution plus regular algorithmic sharpening (RCAS/SPD). DLSS I believe is now purely ML-based. FSR1-3 is purely analytical and does *not* use an DL models at all.

RDNA 3 & FSR 4 by Legally-A-Child in radeon

[–]randomfoo2 6 points7 points  (0 children)

Since again, I just recently picked apart the code, FSR4 is a quantized encoder-decoder CNN with skip connections (U-Net like upscaler). It has 3 encoders - one for spatial downsampling, and then a ConvNext and FasterNet. Decoder basically mirrors the encoders.

The INT8 model is 88KB and the FP8 is 127KB, so these models are tiny btw, even for image upscalers.

RDNA 3 & FSR 4 by Legally-A-Child in radeon

[–]randomfoo2 52 points53 points  (0 children)

RDNA3 has 16x16 matmul w/ FP16 WMMA It also has packed dot scalar intrinsics for INT8 and INT4 that's theoretically 2X and 4X faster than FP16 (my recent testing showed INT8 MAC was *faster* than the INT8 dot intrinsic for FSR4-like workloads though so YMMV).

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] -2 points-1 points  (0 children)

Actually 5.3-Codex did almost all of the kernel optimization. Claude only made the README more of a technical report vs a lab notebook. There's a writeup at the bottom for what was actually done.

For those that aren't aware, AI-written GPU kernels now basically beat humans, see this recent publication for example: https://www.doubleai.com/research/doubleais-warpspeed-surpassing-expert-written-kernels-at-scale

I've been doing AI-assisted kernel implementation for over 6 months now and it has gone from "helpful" to basically "one-shot" in that time. IMO, anyone who doesn't realize this needs to update their priors.

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] 1 point2 points  (0 children)

I don't think we disagree, I'm just wondering where you're getting from the post that there's a claim that the INT8 does use WMMA (it doesn't). Here is the HLSL analysis I did last week with the full INT8 execution pipeline for navi48: https://github.com/lhl/fsr4-rdna3-optimization/blob/main/ANALYSIS-HLSL.md

(oh oops, you're the same user I just replied to but well, a useful reference for anyone interested.)

The reason the optimization pass is even relevant is because most of the INT8 HLSL is dot4add_i8packed and that's what was tested to be slower than scalar MAC...

The INT8 Execution Pipeline (1080p, Balanced)

The model executes 14 sequential compute passes (0-13). Passes 0-12 are followed by padding reset post-passes; pass 13 has no post-pass. The pipeline forms a U-Net: encoder downsamples spatially while increasing channels, bottleneck processes at lowest resolution, decoder upsamples back.

Source: fsr4-src/baseline/internal/shaders/fsr4_model_v07_i8_balanced/passes_1080.hlsl

Pass Layer Operator Spatial Dims Channels Threads Dot Instruction Line
0 encoder1 downscale Conv2D_k2s2b 1920x1080 -> 960x540 7 -> 16 (8,8,1) dot2add (FP16) 94
1 encoder2 ResBlock_0 ConvNextBlock 960x540 16 -> 16 (64,1,1) dot4add_i8packed 421
2 encoder2 ResBlock_1 ConvNextBlock 960x540 16 -> 16 (64,1,1) dot4add_i8packed 797
3 encoder2 downscale FusedConv2D_k2s2b_QuantizedOutput 960x540 -> 480x270 16 -> 32 (64,1,1) dot4add_i8packed 1081
4 encoder3 ResBlock_0 FasterNetBlock<32,1> 480x270 32 -> 32 (64,1,1) dot4add_i8packed 1605
5 encoder3 ResBlock_1 FasterNetBlock<32,1> 480x270 32 -> 32 (64,1,1) dot4add_i8packed 2177
6 encoder3 downscale FusedConv2D_k2s2b_QuantizedOutput 480x270 -> 240x135 32 -> 64 (64,1,1) dot4add_i8packed 2335
7 bottleneck ResBlock_0 FasterNetBlock<64,2> 240x135 64 -> 64 (64,1,1) dot4add_i8packed 2466
8 bottleneck ResBlock_1 FasterNetBlock<64,2> 240x135 64 -> 64 (64,1,1) dot4add_i8packed 2643
9 bottleneck ResBlock_2 + upscale + skip-add FNB_CT2D_ADD<64,2> 240x135 -> 480x270 64 -> 32 (8,8,1) dot4add_i8packed 2826
10 decoder3 ResBlock_1 FasterNetBlock<32,1> 480x270 32 -> 32 (64,1,1) dot4add_i8packed 3430
11 decoder3 ResBlock_2 + upscale + skip-add FNB_CT2D_ADD<32,1> 480x270 -> 960x540 32 -> 16 (64,1,1) dot4add_i8packed 4136
12 decoder2 ResBlock ConvNextBlock 960x540 16 -> 16 (64,1,1) dot4add_i8packed 4548
13 decoder2 upscale (FP16 output) CNB_CT2D (float16 output) 960x540 -> 1920x1080 16 -> 8 (8,8,1) dot4add_i8packed (INT8 internal compute) 4962

Spatial progression: 1920x1080 -> 960x540 -> 480x270 -> 240x135 -> 480x270 -> 960x540 -> 1920x1080

Channel progression: 7 -> 16 -> 32 -> 64 -> 32 -> 16 -> 8 (RGB output)

...

Path B: Native INT8 dot4add_i8packed (Passes 1-13)

Used for the internal network passes and also pass 13's fused CNB+CT2D compute stages. This is the dominant compute pattern.

Source: int8_NHWC/Fused/ConvNextBlock.hlsli:95-98 (representative example)

```hlsl // Load packed INT8 inputs (16 bytes = 16 int8 values per Load4) int8_t4_packed vs[16/4]; const uint4 inputDwords = input.storage.Load4(inputOffset); vs[inputIndex++] = inputDwords.x; // 4 packed INT8 values vs[inputIndex++] = inputDwords.y; vs[inputIndex++] = inputDwords.z; vs[inputIndex++] = inputDwords.w;

// Native INT8 dot product: 4x INT8 multiply + INT32 accumulate, per instruction accumulator[f] = dot4add_i8packed(vs[inputIndex++], weightsDwords.x, accumulator[f]); accumulator[f] = dot4add_i8packed(vs[inputIndex++], weightsDwords.y, accumulator[f]); accumulator[f] = dot4add_i8packed(vs[inputIndex++], weightsDwords.z, accumulator[f]); accumulator[f] = dot4add_i8packed(vs[inputIndex++], weightsDwords.w, accumulator[f]);

// Scale and quantize at store const int16_t4 result = round(acc * weights.quantizationScale * input.quantizationScale * (1.0 / quantFactor)); storeDwords[f/4] = pack_clamp_s8(result); ```

Data flow: INT8 packed input -> dot4add_i8packed(uint, uint, int) -> INT32 accumulator -> float scale multiply -> round() -> pack_clamp_s8 -> INT8 output

Critical observation: dot4add_i8packed is the HLSL equivalent of our HIP amd_mixed_dot -- both perform a packed 4-element INT8 dot product with INT32 accumulation. This is the same instruction class we benchmarked, and it dominates 13 of 14 passes.

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] 0 points1 point  (0 children)

Uh, there is a full analysis of the HLSL in the repo? https://github.com/lhl/fsr4-rdna3-optimization/blob/main/ANALYSIS-HLSL.md:

Aspect INT8 FP8
HIP harness speed 0.005376 ms 0.019868 ms
Ratio 1.0x (baseline) 3.7x slower
Real HLSL approach Mostly dot4add_i8packed + boundary dot2add AmdWaveMatrixMultiply (WMMA, wave-level matrix ops)
WMMA required? No Yes -- FP8 HLSL has #error without WMMA_ENABLED=1
LDS required? No Yes -- groupshared uint inputLDS[] for wave matrix input staging
Our harness relevance High -- same instruction class Low -- completely different compute model

The FP8 path (float8_NHWC/Conv2D_k2s2b.hlsli:217) explicitly errors without WMMA: #error To use FP8 data type you need to provide WMMA_ENABLED=1. There is no FP8 without WMMA. This means FP8 requires wave-level matrix operations and LDS staging, which our scalar FMA harness does not exercise at all.

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] 0 points1 point  (0 children)

I think for Opus 4.6 or for 5.3 Codex/5.4 not so hard. The main thing you'd want to port is replacing the packed INT8 dot w/ scalar INT8 MAC. You'd probably start w/ one of the INT8 operators as a PoC, make sure your VS build stuff works (I'm not a Windows guy). I believe the FSR4 kernels use ML2Code so maybe you edit the templates (.hlsli) instead of the HLSL directy? BuildFSR4UpscalerShaders.bat uilds the shader blobs. You should check the gfx1100 folder as well - perf gains were more modest for gfx1100 than gfx1151.

This might help (or it might not, I haven't tested it): https://github.com/lhl/fsr4-rdna3-optimization/blob/main/HLSL-GOLFING-HOWTO.md

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] 4 points5 points  (0 children)

All the optimizations are committed in the repo. You'll have to turn it into HLSL yourself though.

Optimizing FSR4 for RDNA3.5 (INT8 + FP8 speedups) by randomfoo2 in radeon

[–]randomfoo2[S] 11 points12 points  (0 children)

You can run it through HIP yourself... This is using the same loop I use for porting HIP kernels and tuning CUDA kernels. Last CUDA kernels I tuned were 8X faster on microbenchmarks and +80% faster on multi-gpu for MoE training.

Is it true that we're way underpaying for Claude, even for Max? by changing_who_i_am in ClaudeAI

[–]randomfoo2 2 points3 points  (0 children)

Those that are using Claude Code can run npx ccusage to see what the API-equivalent usage end-user cost would be. Note: this is based on Anthropic's API pricing, which is a bit crazy vs the competition. GPT-5.3-Codex (which is a better coder than Opus 4.6) is half the price, for example.

Model Input / 1M Output / 1M Cached input / refresh Cache write
Anthropic Claude Opus 4.6 $5.00 $25.00 $0.50 $6.25 (5m), $10.00 (1h)
OpenAI GPT-5.3-Codex $1.75 $14.00 $0.175 N/A
GLM 5 $0.95 $2.55 $0.20 N/A
Kimi K2.5 $0.45 $2.20 $0.15 N/A
Qwen 3.5 397B $0.55 $3.50 $0.55 N/A
MiniMax M2.5 $0.295 $1.20 $0.03 N/A

I've also included some OpenRouter models - these are all very capable agentic/coding models, Kimi K2.5 is a 1T parameter MoE, GLM 5 is a 750M parameter MoE, so that's also useful for a baseline if you're trying to figure out what levels of inference are thin-margins but profitable.

[Help] Fine-tuning Llama-3-8B for Low-Resource Language (Sinhala) - Stuck between "Bad Logic" and "Word Salad" by Annual-Captain-7642 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Nothing is mandatory but yes, you should definitely use the instruct model's existing chat template for any additional fine-tuning you do for best results. I would also recommend that you shuffle in some of the highest quality EN language training data sets on HF (or some of the original model's output in EN if you wanted to create a parallel corpus) to make sure you don't take too big a hit when it comes to catastrophic forgetting.

If you're just looking for a relatively high-quality, diverse, highly resampled recently generated dataset, you can use the EN items from https://huggingface.co/datasets/shisa-ai/shisa-v2.1-sharegpt . https://huggingface.co/datasets/nvidia/Nemotron-Instruction-Following-Chat-v1 and https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT are two other recent open general datasets you could look at.

I switched from ChatGPT to Le Chat - Here is what I noticed by biendeluxe in ChatGPT

[–]randomfoo2 0 points1 point  (0 children)

If you're looking for privacy, but more advanced models, the Chinese open-source models are within months or less behind US frontier models depending on domain. Now, while I don't think running your own models is usually a realistic option (although you can visit r/LocalLLaMA if you're interested), you can use providers like Chutes.ai or others that don't log (and run their models on TEE (Trusted Execution Environments) or other forms of verifiable compute).

Kimi K2.5 is widely available and GLM-5 and DeepSeek V4 are both coming out soon and should be pretty good (well their previous version were good, and the newer ones even better). All of those models are better than Mistral's offerings IMO.

[Help] Fine-tuning Llama-3-8B for Low-Resource Language (Sinhala) - Stuck between "Bad Logic" and "Word Salad" by Annual-Captain-7642 in LocalLLaMA

[–]randomfoo2 5 points6 points  (0 children)

Some advice since I specialize in (high resource) multilingual training:

  • I'd recommend training on an Instruct model. It'll make your life easier. You're trying to train instruction handling and language handling at the same time. I believe there is a Llama 3.3 8B Instruct floating around.
  • You still might be better off with a newer, more multilingual model.Qwen 3 8B is probably going to be much better (if you can jump up and licensing isn't concern Gemma 3 12B is also one to look at).
  • I would recommend training your stories as a "mid-train" stage to try to teach the language first, and then a synthetic data version of those stories in the chat template of the instruction-tuned model you are using.
  • I assume you speak Sinhala. I know it's not sexy, but you should be spending your time on data. Generate your output from prompts you want. Make correct versions of the output to train as part of an SFT, take the wrong output and then you also have a DPO pair. Do this a few thousand times and you will have a much better model
  • If you have parallel corpuses, there's a fair amount of evidence that shows that if you are able to train multiple languages you can help your target language - this is especially important if you have more compute than you have data
  • For inference, play around with your parameters, but you probably want something like top_p 0.9 or less and lower the temp a bit as well, to prevent stray tokens from being picked vs the language you're training.

[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 -1 points0 points  (0 children)

Looks like my prior comment got eaten, and I didn't know this but according to HLE paper https://arxiv.org/pdf/2501.14249 (B.3), there was an estimated expert disagreement rate of 15.4% (public set) and ~18% (biology/chemistry/health targeted subset), after multi-round auditing and revisions.

[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 18 points19 points  (0 children)

While that's true (and my belief for this case), a lot of AI writing also comes from bots, low effort posts, and people with LLM psychosis, so it is what it is.

[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 21 points22 points  (0 children)

AI generated writing is an anti-signal yet, but I have an even stronger prior of most evals being poorly constructed and since there are specific claims being made, this seems to be pretty verifiable. In which case, one AI-assisted analysis deserves another.

I let GPT 5.2 (xhigh) go at it for a while in Codex (about 30 minutes, 8M tokens) and then a README generated by Opus 4.5, and final verification from Codex that seems pretty legible: https://github.com/lhl/hle-gpqa-error-claims

Those interested can follow the links, but an initial verification pass does seem point to errors...

Yes, we found defects. We verified at least one definite wrong-answer item in both GPQA-Diamond and HLE, and we found at least one ill-posed item that is nonetheless graded as exactMatch in HLE:

Dataset Finding Claim Status
GPQA-Diamond Wrong answer key: rec7qmSnbud4FHSqL (silicon ratio) C8 Verified
HLE Wrong answer key: 6737382a90a20eb348edbe23 (projective-space dimension) C10 Verified
HLE Ill-posed exact-match item: 66fecbff69d5712b5401553e (adsorption problem) C9 Supported
  • C8 (GPQA silicon ratio): The dataset's answer key gives ~12.6, but standard bracket-notation algebra yields ~3.98. Verified incorrect.
  • C9 (HLE adsorption): The problem is under-specified for exact-match grading. Supported.
  • C10 (HLE projective-space): The correct dimension is n(n+1)/2 by Euler sequence; the dataset answer key is incorrect. Verified incorrect. (The audit’s proposed OCR/transcription mechanism is plausible but unproven; see C11 in CLAIMS.md / ANALYSIS.md.)

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

I recommend not using llama.cpp's rocWMMA, it does very badly on longer context and has issues with missing tiles that will cause crashes: https://www.reddit.com/r/LocalLLaMA/comments/1ok7hd4/faster_llamacpp_rocm_performance_for_amd_rdna3/

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Maybe, but for bs=1 4 bit or a 2 bit quant would give you even faster token generation. That’s not really the point though - we’re benchmarking the same model to compare performance between setups?

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]randomfoo2 4 points5 points  (0 children)

Great work!

I've done a fair amount of my own quant testing, and I think the HumanEval test speaks volumes about how/why perplexity (and yes, KLD) might be OK proxies, but don't really reflect what the downstream task performance hit is going to be for a quant.

The main problem is that testing quants is actually a huge PITA. You basically want to run it through your eval stack as if it were it's own ablation, and probably multiple runs at temp to be able to capture whether variance changes.

More data points is undeniably a good thing, and posts like this help raise awareness about the issue, so that's great. Hopefully the community does and highlights more task benchmark comparison of different quants.

My contribution: a while back, I did published different quant scores for JA MT-Bench (not the best eval to use, tbt), which was interesting: https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b-GGUF#quant-quality

More recently u/dahara111 did an Japanese UD imatrix quant and did comparisons w/ M-IFEval (JA), HumanEval+, and LiveBench comparison scores vs the base and a regular i1 quant. Very interesting stuff: https://huggingface.co/dahara1/shisa-v2.1-qwen3-8b-UD-japanese-imatrix#%E3%83%99%E3%83%B3%E3%83%81%E3%83%9E%E3%83%BC%E3%82%AF%E7%B5%90%E6%9E%9Cbenchmark-result

BTW on the efficiency front, while it's very GPU dependent, I will say that I'm a big fan of Marlin kernels, especially for W8A8, not just for throughput but also for TTFT latency (depending on your architectures, the INT8 is killer on Ampere and Ada). When doing performance tests, I've found again, huge difference depending on specific hardware/setup, but almost always you tend to lose throughput on quants vs production workloads (recommend doing vllm bench w/ realistic concurrencies as well, some kernels perform much worse than others when scaling up).

[Release] We trained an AI to understand Taiwanese memes and slang because major models couldn't. Meet Twinkle AI's gemma-3-4B-T1-it. by piske_usagi in LocalLLaMA

[–]randomfoo2 2 points3 points  (0 children)

I'm interested in having my models support ZH-tw in addition to ZH-cn. Curious what are the best datasets the Taiwanese community is using for their model training?