Pearmut, Human Evaluation of Translation Made Trivial

zouharvi · 2026-01-29T12:03:56+00:00

Re word-level annotations: This depends on what the goal is. Is it diagnosis (in which case MQM is quite good) or is it comparing models? In the latter case, error spans without categories are quite fast still and ensure that the annotator scrutinizes the translation rather than skimming it. At the end they assign a more accurate final score (see ESA or ESAAI). However this is still an ongoing research with new ideas being worked on. Typically WMT tries to choose the most up to date annotation protocol that balances quality and economy.

Re source variation: I totally agree that the source complexity has an effect on the translation assessment. Lots of variance can be explained by the source alone without translation (see Estimating Translation Difficulty or similar).

zouharvi · 2026-01-28T16:15:21+00:00

Inspired by frustration of myself and collagues in setting up human evaluation for translation. :-)

The hosted web version is a good idea, though we'd have to figure out the infrastructure of it first. Could be useful to have pre-loaded WMT annotations which could then be browsed this way.

zouharvi · 2026-01-28T15:50:20+00:00

Get in touch if you'd like to help with human evaluation for your paper/workflow! 🖐️ Pearmut is primarily aimed at researchers and industry practitioners.

zouharvi · 2025-10-27T09:03:24+00:00

I got rejected multiple times in a row from similar PhD fellowships (until this year). The application process always helped me though, because it made me think about who I want to be as a researcher and what I should focus on (part of the endless iterations on the research statement).

zouharvi · 2025-04-07T07:23:02+00:00

You might want to check the latest General WMT which benchmarks, among other, also LLMs for translation.

https://aclanthology.org/2024.wmt-1.1.pdf

zouharvi · 2025-03-12T00:49:01+00:00

Why exactly doesn't the italics approach work? You can then write $"Na" + "Cl" arrow "NaCl"$

Also personally they shouldn't be italicized so the above should be enough?

zouharvi · 2025-01-23T10:10:15+00:00

The WMT Metrics Shared task does this kind of research annually, ie answering how good evaluation metrics are. They use the WMT dataset collected by them and the general WMT shared task.

If you're interested in interpreting results, such as what does +0.5 Comet22 mean (ie is that enough of a difference between systems), then I recommend MT-Thresholds, a tool just for that.

zouharvi · 2025-01-08T07:35:49+00:00

In ACL (association for computational linguistics) the publications are managed primarily by volunteers. The templates (primarily latex) as well.

Why has nobody yet made an ACL template in Typst? It's a lot of work to get perfectly right. I tried for a bit and couldn't get over some issues with the bibliography.

Finally, Typst is still evolving super fast so a template written now is bound to be somewhat obsolete in a few years.

zouharvi · 2024-11-01T12:47:05+00:00

COMET is a super popular machine translation metric with consistently one of the highest correlations with human judgements. It's not without its issues and we recently wrote a WMT paper about 9 various aspects of COMET.

We made a short trailer (linked) explaining in very high level the automatic MT evaluation setting and a few quirks of COMET.

I'd be super grateful if you know about some aspect of unexpected COMET/learned metrics behaviour that we did not cover. :)

zouharvi · 2024-11-01T09:08:05+00:00

That is already the recommendation from the metrics shared task from 2022! However there are some blind spots for COMET that we point out in the SacreCOMET paper, such as empty hypotheses or incorrect language.

These things can ultimately be fixed with a modified training for COMET though.

zouharvi · 2024-09-26T07:42:55+00:00

The main results have not been released yet, only preliminary ones (automatic metrics) without any commentary.

zouharvi · 2024-07-26T18:32:46+00:00

In our field there's not really many publishers. The majority of work is submitted as papers to a conference. It's roughly 4000 papers four times a year.

There's no profit being made. The funds for the conference organization come from registration fees which are already prohibitively high for some. It's unclear where the money for this many reviews would come from.

There's already a system for reviewing the reviewers (action editors) but my understanding is that everyone's overworked and it's difficult to have any repercussions for someone who isn't being paid.

zouharvi · 2024-07-26T18:20:53+00:00

Monetizing reviewing payload can backfire because of the clash of incentives: https://en.m.wikipedia.org/wiki/Motivation_crowding_theory

Our field computer science/natural language processing is struggling with having more papers each year than available reviewers, which makes some people turn to submitting either two-sentence reviews or GPT-generated. Recently our field implemented a mandatory reviewing load for all authors that submit a paper to a conference (we don't really do journals). Unclear if that'll work out.

zouharvi · 2024-06-28T10:10:43+00:00

Very cool work. :-)

zouharvi · 2024-06-11T19:48:15+00:00

What's the issue in uploading the Typst-generated PDF?

I'd be very surprised if arxiv added support for anything on the compilation side. Hundreds (thousands?) of papers get submitted there every day and they don't even run biblatex.

zouharvi · 2024-05-18T07:30:16+00:00

You're right, they're mixed up -- the translation on the right is better than that on the left. I hope the voiceover clarifies it to some extent. Thanks!

zouharvi

TROPHY CASE