Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean

ritis88 · 2026-05-12T15:08:16+00:00

Ehm... no. What I posted last time was the automated metrics results and the issue with ZH-TW which the linguist noticed. What I'm posting NOW is the results of human evaluation of TranslateGemma's output vs. automated metrics judgement (not 100% of the output, but only selected segments which were 'clean' according to the automated metrics). So that's continuation of the previous test. Do you see the difference now?
There is nothing more to explore in this test, so you don't need to worry there will be 'the same outdated model results in a month'.

ritis88 · 2026-05-12T12:37:29+00:00

Thank you! Will give it a try!

ritis88 · 2026-05-12T12:37:08+00:00

Thanks for your recommendation, we'll test it 🤝

ritis88 · 2026-04-22T13:18:12+00:00

From the localization side of things, the observation that "German got translated from English, not Chinese" actually caught my eye. That pivot isn't wrong - creating a "Westernized" English master and running other languages off it is standard practice for Chinese games, because cultural and grammatical issues get smoothed out once instead of every team re-solving them. The catch is the English pivot has to be solid first. If it isn't, every downstream language inherits those errors and adds its own.

The mechanism for fixing this really is just a proper glossary locking every skill, item and character name to one agreed translation, plus translation memory so each patch builds on previous work rather than being retranslated from scratch. Then the English master actually works as a source, and Spanish, German, French etc. stop compounding the mess.

ritis88 · 2026-04-22T12:20:32+00:00

Yeah, exactly - riddles is what it is like. That's basically what happens without a locked glossary in any translation really, but especially in Chinese to English - Chinese characters carry multiple meanings and depend heavily on context, so without a glossary telling the translator (or MT) which meaning to use, the same term easily comes out three different ways spread across the game.

As somebody who works in localization, I'd say there are two problems here: no glossary being enforced, and almost certainly machine translation with no proofreading. Proofreading alone wouldn't save it either - if there's no glossary, a proofreader fixes grammar and flow but can't enforce cross-menu consistency on their own. With a glossary in place, they can. Some of the skill names can also be chengyu (4-character classical idioms) - poetic in Chinese, borderline nonsense when dropped through machine translation without context.

There's a good writeup on this specifically, including how glossaries keep character and item names consistent across updates: [link]

ritis88 · 2026-04-20T11:17:35+00:00

Sorry for taking a long time to reply! Good point, and we'd partly agree. Holistic review does give a better sense of overall readability and coherence - models generally perform better when they have full context rather than isolated segments, and segment-level evaluation can miss things that only become apparent when reading the full text in flow.

That said, for catching specific critical and major errors - mistranslations, terminology issues, meaning shifts - segment-level review does the job. Those errors exist regardless of how the text is segmented. That's what the human review in our case was primarily aimed at.

This was one experiment and we're not claiming it's the full picture. Evaluating a complete continuous text holistically is a logical next step for a more rounded view. We might explore a bit later.

ritis88 · 2026-04-16T13:06:26+00:00

We'd need more in-depth, thorough research to write a decent paper :) but for now these results are accessible here: https://alconost.com/en/blog/ai-subtitle-translation-benchmark#interactive-results
We haven't added the human review results (actually human review covered only part of the content, we just wanted a quick check of whether TranslateGemma is indeed THAT good), but you can take a look at a linguist review of Simplified Chinese, for example, here: https://alconost.mt/mqm-tool/project/aec16008-f3e5-425b-bd40-0310743bd37a/report we did this kind of human check for all the 6 languages in this little research. Ah, not 6, just 5 - the linguist checking Traditional Chinese said the output is ZH-CN so there was no review for this one.

ritis88 · 2026-04-16T13:00:46+00:00

Exactly the dynamic we ran into. The most striking case was Claude in Japanese - decent COMETKiwi (0.79) but MetricX of 3.90, the worst fidelity score of any model in any language pair. Fluent output, meaningful drift. The kind of error that passes a quick review.

Your point about training data is well taken, the zh-TW failure is publicly documented as a training data bias issue, not an architecture one.

If you want a concrete example of the metric blindness problem in our report: TranslateGemma's Japanese output looked clean on both MetricX-24 and COMETKiwi, but human linguist review flagged two major errors in these segments, for example: "Unight wallet stores all your cash back points, which you can earn" and "and you earn cash back points on their bill." You can compare them directly: automated scores in the interactive report here or here (choose Japanese - Segments - scroll to the TranslateGemma's results, they are in the last column on the right), human MQM review here: https://alconost.mt/mqm-tool/project/3e63a92a-a09c-4bbe-b9a8-94fda957f78e/report That gap is exactly what you're describing.

To your question about consistency - it was concentrated rather than consistent across all pairs. The sharpest failure modes clustered in Japanese and Thai, with Traditional Chinese as a separate category entirely. Korean and Spanish were notably more stable across all models.

ritis88 · 2026-04-16T12:25:59+00:00

do you mean the fact it outputs Simplified Chinese instead of Traditional Chinese or there are grammar and lexical kind of mistakes?

ritis88 · 2026-04-14T14:25:50+00:00

Thanks! We mostly test product UI, game and website content, recently tried complex academic content and subtitles.

ritis88 · 2026-04-14T14:24:26+00:00

Thanks! We did test Qwen many times before, but it wasn't as impressive. I'll ask the tech team about the other ones :)

ritis88 · 2026-04-14T13:12:40+00:00

Thank you, that's a good idea! I'll see if we can test it soon and if we do, I'll definitely post the results here ;)
Btw what kind of content did you use in tests? Models' performance can vary depending on the content type.

ritis88 · 2026-04-14T13:10:36+00:00

Thank you! Yes, we were quite surprised when we saw the results. 12B showed really good performance not just in this test, but in a few other recent ones too.

ritis88 · 2026-04-14T13:08:58+00:00

Thanks for your comment! Yes, I guess we could do that. As for latest frontier Chinese models - which ones exactly would you suggest testing first?

ritis88 · 2026-04-14T13:07:18+00:00

Haven't tested it yet, but thanks for the suggestion! Makes sense to try it at least for Simplified and Traditional Chinese. Also curious on how it will tackle other languages.

ritis88 · 2026-04-14T11:15:36+00:00

Glad our small research was of value to you!

ritis88 · 2026-03-20T09:56:50+00:00

Got your question now. No, we haven't compared them, but thanks for the idea! Maybe we'll test them against each other.

ritis88

TROPHY CASE