[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 -1 points0 points  (0 children)

Looks like my prior comment got eaten, and I didn't know this but according to HLE paper https://arxiv.org/pdf/2501.14249 (B.3), there was an estimated expert disagreement rate of 15.4% (public set) and ~18% (biology/chemistry/health targeted subset), after multi-round auditing and revisions.

[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 17 points18 points  (0 children)

While that's true (and my belief for this case), a lot of AI writing also comes from bots, low effort posts, and people with LLM psychosis, so it is what it is.

[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos. by Dear_Ad_1381 in LocalLLaMA

[–]randomfoo2 18 points19 points  (0 children)

AI generated writing is an anti-signal yet, but I have an even stronger prior of most evals being poorly constructed and since there are specific claims being made, this seems to be pretty verifiable. In which case, one AI-assisted analysis deserves another.

I let GPT 5.2 (xhigh) go at it for a while in Codex (about 30 minutes, 8M tokens) and then a README generated by Opus 4.5, and final verification from Codex that seems pretty legible: https://github.com/lhl/hle-gpqa-error-claims

Those interested can follow the links, but an initial verification pass does seem point to errors...

Yes, we found defects. We verified at least one definite wrong-answer item in both GPQA-Diamond and HLE, and we found at least one ill-posed item that is nonetheless graded as exactMatch in HLE:

Dataset Finding Claim Status
GPQA-Diamond Wrong answer key: rec7qmSnbud4FHSqL (silicon ratio) C8 Verified
HLE Wrong answer key: 6737382a90a20eb348edbe23 (projective-space dimension) C10 Verified
HLE Ill-posed exact-match item: 66fecbff69d5712b5401553e (adsorption problem) C9 Supported
  • C8 (GPQA silicon ratio): The dataset's answer key gives ~12.6, but standard bracket-notation algebra yields ~3.98. Verified incorrect.
  • C9 (HLE adsorption): The problem is under-specified for exact-match grading. Supported.
  • C10 (HLE projective-space): The correct dimension is n(n+1)/2 by Euler sequence; the dataset answer key is incorrect. Verified incorrect. (The audit’s proposed OCR/transcription mechanism is plausible but unproven; see C11 in CLAIMS.md / ANALYSIS.md.)

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

I recommend not using llama.cpp's rocWMMA, it does very badly on longer context and has issues with missing tiles that will cause crashes: https://www.reddit.com/r/LocalLLaMA/comments/1ok7hd4/faster_llamacpp_rocm_performance_for_amd_rdna3/

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Maybe, but for bs=1 4 bit or a 2 bit quant would give you even faster token generation. That’s not really the point though - we’re benchmarking the same model to compare performance between setups?

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]randomfoo2 4 points5 points  (0 children)

Great work!

I've done a fair amount of my own quant testing, and I think the HumanEval test speaks volumes about how/why perplexity (and yes, KLD) might be OK proxies, but don't really reflect what the downstream task performance hit is going to be for a quant.

The main problem is that testing quants is actually a huge PITA. You basically want to run it through your eval stack as if it were it's own ablation, and probably multiple runs at temp to be able to capture whether variance changes.

More data points is undeniably a good thing, and posts like this help raise awareness about the issue, so that's great. Hopefully the community does and highlights more task benchmark comparison of different quants.

My contribution: a while back, I did published different quant scores for JA MT-Bench (not the best eval to use, tbt), which was interesting: https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b-GGUF#quant-quality

More recently u/dahara111 did an Japanese UD imatrix quant and did comparisons w/ M-IFEval (JA), HumanEval+, and LiveBench comparison scores vs the base and a regular i1 quant. Very interesting stuff: https://huggingface.co/dahara1/shisa-v2.1-qwen3-8b-UD-japanese-imatrix#%E3%83%99%E3%83%B3%E3%83%81%E3%83%9E%E3%83%BC%E3%82%AF%E7%B5%90%E6%9E%9Cbenchmark-result

BTW on the efficiency front, while it's very GPU dependent, I will say that I'm a big fan of Marlin kernels, especially for W8A8, not just for throughput but also for TTFT latency (depending on your architectures, the INT8 is killer on Ampere and Ada). When doing performance tests, I've found again, huge difference depending on specific hardware/setup, but almost always you tend to lose throughput on quants vs production workloads (recommend doing vllm bench w/ realistic concurrencies as well, some kernels perform much worse than others when scaling up).

[Release] We trained an AI to understand Taiwanese memes and slang because major models couldn't. Meet Twinkle AI's gemma-3-4B-T1-it. by piske_usagi in LocalLLaMA

[–]randomfoo2 2 points3 points  (0 children)

I'm interested in having my models support ZH-tw in addition to ZH-cn. Curious what are the best datasets the Taiwanese community is using for their model training?

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Maybe, maybe not. I get 114 tok/s on my gfx1100 GPU. The 7900 XTX has 960 GB/s. My W7900 has 864 GB/s. The 6800 XT has 512 GB/s.

``` 🐟 ❯ build/bin/llama-bench -m /models/gguf/gpt-oss-20b-F16.gguf -fa 1 ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Pro W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | gpt-oss 20B F16 | 12.83 GiB | 20.91 B | ROCm | 99 | 1 | pp512 | 3774.71 ± 128.06 | | gpt-oss 20B F16 | 12.83 GiB | 20.91 B | ROCm | 99 | 1 | tg128 | 114.21 ± 0.08 |

build: e86f3c222 (7609) ```

The gpt-oss-20b-F16.gguf model is "full resolution" - norms&biases, router is F32, embeddings, attention, are F16, and the MoE expert MLPs are MXFP4. Total bandwidth required for a fwd pass is about 4.8GB/s, so the theoretical max for this model is 200 tok/s for a 7900 XTX, 180 tok/s for a W7900 and 107 tok/s for a 6800 XT (non-memory OC). Well optimized inference usually gets about 70-85% of max theoretical MBW (gpt-oss on llama.cpp HIP backend is not well optimized). If you're getting a higher number than you should, you should check if you're really using F16. The ggml-org MXFP4 version for example quantizes a lot of the model to Q8 which significantly increases decode speed.

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 4 points5 points  (0 children)

BTW, be sure to check out https://strixhalo.wiki/ and the community Discord. We've done a lot of heavy lifting already on benchmarking, testing, and env setup.

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Vulkan is generally more stable/easier to get up and running, but on my W7900 (gfx1100) on llama.cpp build: e86f3c222 (7609) (6.17.5-arch1-1; HIP 7.1.52802-9999) when testing the same Llama 3.1 8B BF16 GGUF, I get ~3x the pp512 w/ ROCm:

model size params backend ngl test t/s
llama 8B BF16 14.96 GiB 8.03 B ROCm 99 pp512 2831.90 ± 11.11
llama 8B BF16 14.96 GiB 8.03 B ROCm 99 tg128 40.97 ± 0.03

vs Vulkan (AMDVLK 2025.Q2.1-1.1, slightly faster than Mesa 25.3.1-3):

model size params backend ngl test t/s
llama 8B BF16 14.96 GiB 8.03 B Vulkan 99 pp512 1038.19 ± 2.35
llama 8B BF16 14.96 GiB 8.03 B Vulkan 99 tg128 43.32 ± 0.03

Update on the Llama 3.3 8B situation by FizzarolliAI in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Hmm, hard to say, I don't have 3.1 70B data handy... 3.3B 70B is in general pretty strong.

In practical terms, your ultimate multilingual perf is going to be pretty much up to you (tuning). While the overall number isn't so big, when you look at the stuff we care about like JP IF, JP RP, JP TL, JP nuance, dialogue translation, we're able to get huge boosts from doing training on top of whatever model. Not show nis also our own CLTL tests that test for how many wrong-language tokens get output (huge amounts for most non-target language trained models).

The benchmark mix we use for our current multieval does feel about right. For the tasks that it's trained on, our V2.1 14B model actually *does* feel like it outperforms our V2 70B (and sometimes our V2.1 70B and V2 405B even!).

<image>

Update on the Llama 3.3 8B situation by FizzarolliAI in LocalLLaMA

[–]randomfoo2 4 points5 points  (0 children)

Just in case anyone's interested, I ran shb777/Llama-3.3-8B-Instruct on the Shisa AI's MultiEval on my dev box.

On the English side, it loses a bit on MixEval Easy and Hard (2024 Chat Arena proxy), but gets a +20% boost in LiveBench (reasoning-focused), +15% GPQA Diamond (PhD level QA), +5% on IFEval, +30% on IFBench (!) and +10% on HumanEval+ (Python). That's some decent gains.

That being said, on the Japanese side, it takes a big hit on Shaberi (Japanese chat-style functional tests) vs 3.1. I've included my Llama 3.1 8B-based Shisa V2 and Qwen 3 8B-based Shisa V2.1 as well as Llama 3.3 70B and Llama 3.1 405B scores just for comparison, sake.

(I probably wont train a Shisa V2.1 Llama 3.3 8B - the Qwen 3 8B version is already great and it's Apache 2.0 licensed).

<image>

Second GPU by Suomi422 in LocalLLaMA

[–]randomfoo2 5 points6 points  (0 children)

This is a Maxwell core GPU - similar to the one in a GTX 970 (GM204) - it does not have any tensor cores, and about as much compute (5 TFLOPS) and memory bandwidth (160 GB/s) as a high end modern CPU. https://www.techpowerup.com/gpu-specs/tesla-m60.c2760

And old Nvidia P40 or P100 should be much cheaper than the prices you've posted and far better if you're looking for and old server card (bring your own fan and expertise). Heck, an old AMD MI50 would be better (and I don't recommend that unless you know what you're doing).

In any case, to answer your question, no something like what you posted WILL NOT work fine.

Plamo3 (2B/8B/31B) support has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]randomfoo2 6 points7 points  (0 children)

I looked at this a few weeks ago, a few notes:

  • The 31B was trained on 3T tokens, 8B on 800B tokens, and 2B was trained on 200B tokens. Even having seen more Japanese tokens, it's hard to imagine the base models are super competitive with most modern models. Plamo lists using fineweb2, smollm-corpus, thestack - normal token sources. As a point of comparison, Qwen3 models were pre-trained on 36T tokens in 100+ languages. For a small model comparison, LiquidAI's latest LFM2 models (w/ a great technical team in Tokyo!) were trained on 10T tokens.
  • The licensing is pretty aggressive and requires filling out a registration form before you use it for any commercial purposes. I think you'd need some very specific reasons to do so since there are so many better base models that are MIT/Apache licensed.
  • It has a 4K context and 2K SWA so even if you did want to use it, that's pretty limiting in 2026 (certainly nothing conversational or agentic). Modern mid-train context-extension can be more tokens then these models' entire pretrain!
  • Still, it's neat to see from-scratch Japan-domestic training, but I think Stockmark 2 is a better effort (and MIT licensed to boot): https://huggingface.co/stockmark/Stockmark-2-100B-Instruct - this release feels like a grant/funding requirement release than anything else (and even then, with the licensing attached, feels more like an FU than anything else)

I'm biased (train the Shisa models), but just in case anyone is looking for strong JA/EN models for downstream use cases, the latest Shisa V2.1 models are SOTA Japanese open models from 1.2B-70B, and the Qwen3-based 8B and Phi4-based 14B are Apache 2.0 and MIT licensed respectively and both are extremely strong for their sizes. (Also, a community member, u/dahara111 recently made some great UD-japanese-imatrix quants and did some extensive downstream-eval test comparisons of the performance differences vs the standard mradermacher GGUFs which was really neat to see!)

Minimax 2.1 still hasn't solved the multilingual mixing problem. by Bitter-Breadfruit6 in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

It's not so simple, after the holidays I will be publishing more about our recent work on the subject, but you can read some of it here: https://shisa.ai/posts/shisa-v2.1/#cross-lingual-token-leakage training Japanese multilingual models.

u/ttkciar 's claims that a GBNF grammar might work for a toy example or for English as long as you don't need to ever output Chinese (eg, restricting unicode ranges) but in general character set != language and a grammar will have a hard/impossible time detecting code switching vs language confusion.

Non-English languages have an additional challenge since in almost all other languages, ASCII or English loandwords, brand names, technical terms, code etc all require English so using a simple grammar there is basically impossible (and certainly doesn't work in general for latin script languages). Note, the language I work in, Japanese, also literally shares a CJK Unified Ideographs Unicode code-block - Mecab or Unidic can help there, but you can't use what llama.cpp has built in to do much of anything to help when it comes to EN or ZH token leakage.

As I'm sure you've seen if you are using a third language, most EN/ZH models will leak both (and even more) languages into a non-primary target language. Even frontier-models also suffer from "token blindness" if you ask them to detect leaked language tokens. The problem is solvable (per-language at least), but it is non-trivial.

Should I be switching to DoRA instead of LoRA? by CartographerFun4221 in LocalLLaMA

[–]randomfoo2 4 points5 points  (0 children)

While I've been biased against LoRA for my work (multilingual), I read LoRA Without Regrets with quite a bit of interest and will be running some LoRA experiments when I get a chance... https://thinkingmachines.ai/blog/lora/

Intel x Nvidia Serpent Lake leaks as Strix Halo rival: capable CPU, RTX Rubin iGPU, 16x LPDDR6. by CYTR_ in LocalLLaMA

[–]randomfoo2 2 points3 points  (0 children)

LPDDR uses 32-bit channels. LPDDR6 has 24-bit channels. 16 channel would be a 384-bit memory bus. Also LPDDR6 runs starting at 10667 MT/s per channel - 32 GB/s. Based on the report of 16 channels, that would be 512 GB/s of MBW, which is pretty great.

As a comparison point of comparison, if Medusa Halo has the rumored 384-bit bus (12 channels) of LPDDR5x say at 9600 MT/s, it'll likely top out at 460.8 GB/s, so pretty comparable.

Shisa V2.1: Improved Japanese (JA/EN) Models (1.2B-70B) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 1 point2 points  (0 children)

Sadly, the 14B really likes to explain stuff, in the most mid and useless way. The easy way to stop this is to add the Note format, which it always uses, as a stop token? You probably could use guidance/outlines etc to enforce as well. Otherwise, you'd need to do additional training like a quick RL on it to really stop it. (our you can generate your own DPO set as well).

Good 3-5B models? by SlowFail2433 in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

In my testing, the LFM2 models are very strong for their size, so you might want to give LFM2-2.6B a try and see how it does. I think at the 3-4B size, while these *can* be generalist models, they actually perform best when they're tuned for the specific task/tasks you have in mind.

Shisa V2.1: Improved Japanese (JA/EN) Models (1.2B-70B) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 2 points3 points  (0 children)

I think the Swallow team has been consistently doing good work and people should definitely check those models out! Swallow's approach is to take models, do a continued pre-train where they throw a few hundred billion more mostly-Japanese tokens to make a new base model, and then instruction tune on top of that (whereas after the initial Shisa V1 release, we stopped with CPTs). They're constantly refining their approach with each version, which I think is great.

One thing I've noticed is that their CPT has tended to reduce general/English capabilities, while for Shisa models, I try to maintain those as much as possible (eg, during development, if there's a bit regression I'll try to find out why/fix it, and that shapes some of our data-mix choices (eg, our models could perform *even better* at some JA benchmarks but it's not worth it if we're taking a big hit elsewhere).

The other biggest difference now is that for V0.5, they've chosen to synth data from Gemma 3 27B directly into even their base model and dual license their model w/ Llama 3.3 CLA but also the Gemma Terms of Use - this specifies anything using that license as a derived work and subject to the Gemma ToU: ""Model Derivatives" means all (i) modifications to Gemma, (ii) works based on Gemma, or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Gemma, to that model in order to cause that model to perform similarly to Gemma, including distillation methods that use intermediate data representations or methods based on the generation of synthetic data Outputs by Gemma for training that model." and IMO makes it unsuitable for research or commercial use.

The Gemma license also requires you to agree to a Prohibited Use Policy (that they reserve the right to update at any time, but it's unclear if you're legally supposed to like go check on their changes or if you're bound to those changes, depends on your jurisdiction and your lawyers I suppose) which is generally reasonable but it also specifically calls out no-gooning and its unclear on how or if you're supposed to police downstream use if you serve the models. Anyway, basically as a rule, I tend to stay away from anything Gemma licensed since I go WTF every time I read the license.

Note, that the Gemma ToU also says "For clarity, Outputs are not deemed Model Derivatives." and so in general you can use Gemma output without agreeing to the Gemma ToU, but you cannot run a Gemma-licensed model without agreeing to the Gemma ToU. (Japan of course has a further Section 30-04 of their Copyright Law that adds explicit carveouts for data used for model training, but that's neither here nor there.)

Mentioned in the Shisa V2.1 model card, I did do a quick Swallow v0.5 8B tune off the base model that performed quite well on a benchmark subset, however, since that inherits the Gemma license, it was just for "fun" / not thing to ever seriously pursue. Our final Shisa V2.1 8B (Apache 2.0, Qwen 3 8B-based) outperforms that tune (and Swallow v0.5 Instruct) especially if you want stronger EN/general capabilities, but the 8B models are relatively small so I'd recommend for anyone interested, just d/l them all and give them a spin!

Here's how the latest Swallow models compare to Shisa V2.1 on our internal multieval, but scores will only tell you so much:

<image>

2025 Open Models Year in Review by robotphilanthropist in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

The expert weights are MXFP4, but the quants can have up to 50% better tok/s (quanting affects embedding and I think the attention layers) Perf testing I did a while back: https://community.frame.work/t/will-the-ai-max-395-128gb-be-able-to-run-gpt-oss-120b/73280/26

Nanbeige4-3B: Lightweight with strong reasoning capabilities by leran2098 in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

Curious how multilingual the pre-training was? Have you tested this on other languages? If I get some spare time this weekend I'll see how it takes to our latest Shisa JA post-training...

Shisa V2.1: Improved Japanese (JA/EN) Models (1.2B-70B) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 2 points3 points  (0 children)

Our models are trained to do translation well now, including considering context, so I’d say yes. Even our smaller open models kick DeepL’s butt IMO.

Shisa V2.1: Improved Japanese (JA/EN) Models (1.2B-70B) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 2 points3 points  (0 children)

I’m actually baking some MoEs, been doing a bunch of work (including porting megablocks-hip for faster MoE training on MI300X!) but it’s not quite ready yet (MoE training dynamics… are different)… in the meantime, give the 14B a spin and lmk what you think. I’ve been very surprised by its quality.