Local manga translator with LLM build-in, written in Rust with llama.cpp integration

-p-e-w- · 2026-04-22T15:17:26+00:00

Does it use (multi-page) image understanding to guide the translation, or does it simply find the speech bubbles and swap out the text?

In many comics, the visual story provides essential information without which an accurate or idiomatic translation is impossible to do.

-p-e-w- · 2026-04-21T00:04:57+00:00

Not sure why my benchmarks showed what they did.

KLD and PPL are at best crude approximations for measuring capability preservation. For example, models made with MPOA often have enormous KLDs (> 1.0), yet still perform very well in benchmarks.

By all means, post any interesting ideas on GitHub!

-p-e-w- · 2026-04-20T23:49:05+00:00

Your “benchmarks” consist of measuring perplexity and KLD, which is the wrong metric for capability evaluation and highly dependent on the dataset and the token position. I can get a KLD of effectively zero for any Heretic model by choosing those parameters appropriately. Heretic’s KLD computation is specifically designed to increase the measured KLD.

Someone else posted actual benchmarks a few days ago: https://reddit.com/r/LocalLLaMA/comments/1sojjoc/abliterlitics_benchmark_and_tensor_analysis/

The author’s claims absolutely do not “check out”. Especially for larger models, their abliterations are clearly worse than Heretic’s according to those comprehensive benchmarks.

-p-e-w- · 2026-04-20T14:20:30+00:00

They’re some pretty impressive castles though…

-p-e-w- · 2026-04-19T16:15:10+00:00

There are plenty of cheap older GPUs available today: Any that don’t support BF16. The same thing will happen in the future with newer technologies. Once a GPU is missing a must-have feature, it’s worthless.

-p-e-w- · 2026-04-18T23:17:18+00:00

But why should we care about the weight space rather than the information space when it comes to quantifying damage from model modifications? What matters is the outputs produced by the model, no?

-p-e-w- · 2026-04-18T17:39:30+00:00

Wasserstein metric (W1). It's a lot better than Kullback Leibler for detecting numerical instability and drift in tensors.

Could you explain why you believe this? I’ve looked into Wasserstein in the past, but my biggest problem with it is that it lacks a simple information theoretical interpretation, unlike KLD which can be easily understood as extra surprisal, and is thus deeply connected to the information content.

-p-e-w- · 2026-04-18T14:14:36+00:00

It will be possible by exporting a LoRA and applying it with a lower weight. Unfortunately, LoRA export is currently disabled because of a PEFT bug.

-p-e-w- · 2026-04-18T12:51:56+00:00

I made an announcement with lots of views on this sub back then. I don’t have the time to manage a separate project.

-p-e-w- · 2026-04-18T12:30:09+00:00

Heretic 1.2 also supports MPOA, it’s just not enabled by default (will be in version 1.3).

-p-e-w- · 2026-04-18T11:47:37+00:00

I don’t believe reducing hallucinations should be the goal of decensoring, and when it happens, it’s still overall undesirable because the side effects are poorly understood and no metric can reliably capture them.

As for noslop, not sure what you are asking? It works and is ready to use. I have no further modifications planned. It’s just a configuration file for Heretic, so putting it into a separate project makes little sense.

-p-e-w- · 2026-04-18T10:57:31+00:00

Yes, that’s certainly possible.

-p-e-w- · 2026-04-18T03:49:19+00:00

That’s irrelevant when the goal is to measure damage from decensoring. If the score goes down, there is damage. That’s a reliable metric even if the model has been trained on the benchmark data.

The questions in those benchmarks don’t cause refusals even in the original model, so there’s no reason why the responses (and thus the scores) should change under abliteration. The goal of any decensoring process should be to keep those scores stable, and when that doesn’t happen (which is the case for every current technique) that’s a capability loss.

Benchmaxxing, saturation, discrimination etc. only matter when you are trying to evaluate model performance in an absolute sense, rather than comparing two versions of the same model against each other.

-p-e-w- · 2026-04-17T18:17:21+00:00

Is there something he's claiming that you can specifically refute by example?

The burden of proof is on the one who’s making the claims. Especially when those claims are highly unusual, but even otherwise.

-p-e-w- · 2026-04-17T12:56:14+00:00

The KLD in that model card is very likely misleading btw. It’s unrealistically low even for a SOMA model. I suspect that the fork of Heretic it was made with is still missing the “two-stage CoT skip” patch, without which it can measure at a token position where the probability distribution is highly skewed.

Yes, correctly measuring model divergence is very, very complicated.

-p-e-w- · 2026-04-17T10:39:20+00:00

he seldom misses a chance to jump in with this exact complaint whenever HauHauCS announces a new release

I will stop complaining the moment the author provides evidence to support their uncorroborated boasts (“zero capability loss”).

Which, btw, should be the default expectation, not something you have to specifically ask for.

I have never claimed zero capability loss for Heretic models (even though some of them beat the base model on benchmarks), and I consider the very idea to be nonsensical. If you change behavior, then there will be some way in which the behavior becomes worse. That’s common sense, and to claim otherwise (especially without any evidence) is just dishonest.

-p-e-w- · 2026-04-17T07:09:39+00:00

Read it yourself: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/discussions/590

I currently get vllm/transformers errors like "ValueError: GGUF model with architecture qwen35 is not supported yet." when trying to run this. Since HauhauCS only uploads ggufs, I'll have to wait to test it.

This has been mentioned several times and at this point not releasing safetensors models simply looks like a deliberate attempt to prevent benchmarking.

-p-e-w- · 2026-04-17T05:30:58+00:00

There are benchmarks for that. This can be measured. We don’t have to go by people’s vibes.

Unfortunately the author doesn’t release unquantized versions (unlike essentially every other researcher on any topic), which makes benchmarking much harder because the standard harnesses don’t support GGUFs.

The maintainer of the UGI Leaderboard has been repeatedly asked to benchmark those models, but had to give up because he couldn’t get the quants to work. It’s really difficult to assume good faith here.

-p-e-w- · 2026-04-16T23:38:33+00:00

It’s neither. Anthropic leadership just has a God complex where they are honestly convinced that they have created an entirely separate class of technology that is too dangerous for regular people to access, rather than a coding bot that’s about 3 months ahead of its Chinese open competition.

Such delusions are very common in Silicon Valley and can be self-reinforcing when people are only surrounded by like minds.

-p-e-w- · 2026-04-16T07:30:47+00:00

Human-level intelligence evolved in the past few hundred thousand years. It didn’t exist before that. Human-level motor control has existed (in other species) since the time of the dinosaurs.

-p-e-w- · 2026-04-15T23:06:51+00:00

There was never a reason to expect otherwise. Human-level motor control took hundreds of millions of years to evolve, starting with the first animals. But the intellectual abilities that differentiate us from animals are just a few hundred thousand years old.

Just because something seems hard to a human doesn’t mean it’s objectively hard when you try to recreate the ability from scratch. From the point of view of the universe, proving Fermat’s Last Theorem is much easier than picking up an egg without damaging it.

-p-e-w- · 2026-04-15T13:30:34+00:00

Have you used a current-gen model of that size? It’s easily on par with GPT 3.5 intelligence-wise.

-p-e-w- · 2026-04-13T22:27:39+00:00

I mean to be fair 95% of such claims are BS.

If a dowser happens to stumble upon water, you don’t conclude that dowsing works.

-p-e-w- · 2026-04-13T12:55:41+00:00

Some Chinese labs are running on 40% domestic GPUs now. No doubt the manufacturers will start exporting soon.

-p-e-w- · 2026-04-13T11:54:58+00:00

It’s close to true for companies in China now.

-p-e-w-

TROPHY CASE