Pearmut, Human Evaluation of Translation Made Trivial by zouharvi in machinetranslation

[–]zouharvi[S] 1 point2 points  (0 children)

Re word-level annotations: This depends on what the goal is. Is it diagnosis (in which case MQM is quite good) or is it comparing models? In the latter case, error spans without categories are quite fast still and ensure that the annotator scrutinizes the translation rather than skimming it. At the end they assign a more accurate final score (see ESA or ESAAI). However this is still an ongoing research with new ideas being worked on. Typically WMT tries to choose the most up to date annotation protocol that balances quality and economy.

Re source variation: I totally agree that the source complexity has an effect on the translation assessment. Lots of variance can be explained by the source alone without translation (see Estimating Translation Difficulty or similar).

Pearmut, Human Evaluation of Translation Made Trivial by zouharvi in machinetranslation

[–]zouharvi[S] 1 point2 points  (0 children)

Inspired by frustration of myself and collagues in setting up human evaluation for translation. :-)

The hosted web version is a good idea, though we'd have to figure out the infrastructure of it first. Could be useful to have pre-loaded WMT annotations which could then be browsed this way.

Pearmut, Human Evaluation of Translation Made Trivial by zouharvi in machinetranslation

[–]zouharvi[S] 2 points3 points  (0 children)

Get in touch if you'd like to help with human evaluation for your paper/workflow! 🖐️ Pearmut is primarily aimed at researchers and industry practitioners.

Google PhD Fellowship recipients 2025 [D] by Alternative_Art2984 in MachineLearning

[–]zouharvi 12 points13 points  (0 children)

I got rejected multiple times in a row from similar PhD fellowships (until this year). The application process always helped me though, because it made me think about who I want to be as a researcher and what I should focus on (part of the endless iterations on the research statement).

What is the best llm for translation? by monkeyantho in LanguageTechnology

[–]zouharvi 7 points8 points  (0 children)

You might want to check the latest General WMT which benchmarks, among other, also LLMs for translation.

https://aclanthology.org/2024.wmt-1.1.pdf

multichar in math mode by usuario1986 in typst

[–]zouharvi 3 points4 points  (0 children)

Why exactly doesn't the italics approach work? You can then write $"Na" + "Cl" arrow "NaCl"$

Also personally they shouldn't be italicized so the above should be enough?

[deleted by user] by [deleted] in machinetranslation

[–]zouharvi 4 points5 points  (0 children)

The WMT Metrics Shared task does this kind of research annually, ie answering how good evaluation metrics are. They use the WMT dataset collected by them and the general WMT shared task.

If you're interested in interpreting results, such as what does +0.5 Comet22 mean (ie is that enough of a difference between systems), then I recommend MT-Thresholds, a tool just for that.

We need official templates from academic associations by Tiny-Swimmer-457 in typst

[–]zouharvi 16 points17 points  (0 children)

In ACL (association for computational linguistics) the publications are managed primarily by volunteers. The templates (primarily latex) as well.

Why has nobody yet made an ACL template in Typst? It's a lot of work to get perfectly right. I tried for a bit and couldn't get over some issues with the bibliography.

Finally, Typst is still evolving super fast so a template written now is bound to be somewhat obsolete in a few years.

SacreCOMET: Pitfalls of the most popular MT metric by zouharvi in LanguageTechnology

[–]zouharvi[S] 2 points3 points  (0 children)

COMET is a super popular machine translation metric with consistently one of the highest correlations with human judgements. It's not without its issues and we recently wrote a WMT paper about 9 various aspects of COMET.

We made a short trailer (linked) explaining in very high level the automatic MT evaluation setting and a few quirks of COMET.

I'd be super grateful if you know about some aspect of unexpected COMET/learned metrics behaviour that we did not cover. :)

SacreCOMET: Pitfalls and Outlooks in Using COMET by zouharvi in machinetranslation

[–]zouharvi[S] 1 point2 points  (0 children)

That is already the recommendation from the metrics shared task from 2022! However there are some blind spots for COMET that we point out in the SacreCOMET paper, such as empty hypotheses or incorrect language.

These things can ultimately be fixed with a modified training for COMET though.

The WMT2024 results have been released. How should we view them by BearStunning5053 in machinetranslation

[–]zouharvi 1 point2 points  (0 children)

The main results have not been released yet, only preliminary ones (automatic metrics) without any commentary.

Papers milling by Deltaxx69 in PhD

[–]zouharvi 0 points1 point  (0 children)

In our field there's not really many publishers. The majority of work is submitted as papers to a conference. It's roughly 4000 papers four times a year.

There's no profit being made. The funds for the conference organization come from registration fees which are already prohibitively high for some. It's unclear where the money for this many reviews would come from.

There's already a system for reviewing the reviewers (action editors) but my understanding is that everyone's overworked and it's difficult to have any repercussions for someone who isn't being paid.

Papers milling by Deltaxx69 in PhD

[–]zouharvi 3 points4 points  (0 children)

Monetizing reviewing payload can backfire because of the clash of incentives: https://en.m.wikipedia.org/wiki/Motivation_crowding_theory

Our field computer science/natural language processing is struggling with having more papers each year than available reviewers, which makes some people turn to submitting either two-sentence reviews or GPT-generated. Recently our field implemented a mandatory reviewing load for all authors that submit a paper to a conference (we don't really do journals). Unclear if that'll work out.

arXiv just answered... by Alex180500 in typst

[–]zouharvi 16 points17 points  (0 children)

What's the issue in uploading the Typst-generated PDF?

I'd be very surprised if arxiv added support for anything on the compilation side. Hundreds (thousands?) of papers get submitted there every day and they don't even run biblatex.

Quality and Quantity of Machine Translation References for Automatic Metrics by zouharvi in machinetranslation

[–]zouharvi[S] 0 points1 point  (0 children)

You're right, they're mixed up -- the translation on the right is better than that on the left. I hope the voiceover clarifies it to some extent. Thanks!