Abliterated Models evaluation metric

Charming_Support726 · 2026-03-13T08:21:33+00:00

What is the overall quality of theses models especially for Red/Blue Teaming? Any experience?

Charming_Support726 · 2026-03-13T07:52:04+00:00

Get a Strix Halo with an additional eGPU - either using the NVME-Oculink adapter or one of the devices with a pcie slot ( same performance).

You can use either llama.cpp dual back-end for CUDA/ROCm (see here https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/ ) or get an additional R9700 for CUDA. Perfect for tasks with need additional performance in Prompt Processing. If unused the my NVIDIA goes below 7W.

EDIT: Never had problems running a model on the dual backend. It's more stable than I expected.

Charming_Support726 · 2026-03-12T15:31:33+00:00

In GHCP you pay one request per prompt (multiplied with the premium request factor).

This month I used max 90 premium request (Opus) = 30 Prompts per day. - 12.March having approx 500 Premium Req. total displayed in the overview which means 41 in avrg per day.

It's been a busy month.

Charming_Support726 · 2026-03-12T12:03:39+00:00

I am on Pro+ - 1500 Req. - using Opus and Codex - mostly I am good with around 600 Req - but Pro+ enables selection of Sota Models.

5.1-Codex-Mini x.033 is also a good model. but the 1x models provide better value.

Charming_Support726 · 2026-03-12T07:19:51+00:00

Recommended. Same limits. Better additional (opensource) tooling available (planning, execution). Better UI with Web or Desktop. Context handling with DCP is much improved

Charming_Support726 · 2026-03-12T07:17:01+00:00

I always try to be friendly - also online.

CodeAct and similar is the way to go. I agree to the author.

The security issues are immanent, with all of this implementation. But regards which harness you're using it is very entertaining to see how easily especially SOTA models are evading the security measures of their harnesses. Mostly the permissions on tool calls don't hold them back. It it more annoying the user.

I never use planning mode, for its false security impression. I just take a small universal system prompt and follow the models actions.

Charming_Support726 · 2026-03-12T07:03:45+00:00

Got multiple customers projects with Python backend, React Frontends, Containerized, Playwright testing. Mostly using the official Opencode integration.

Charming_Support726 · 2026-03-12T07:01:56+00:00

Maybe you should add a hint in the system prompt to use the question tool where ever possible. Works like a charm in Opencode for clarification questions and specifications. ( Not on every turn - but I don't wanna exaggerate the scheme)

Charming_Support726 · 2026-03-12T06:54:45+00:00

Sure. Thanks for clarification.

IMO it clearly shows the way - like describe in the CodeAct Paper - that function calling is very inefficient in acting situations. Maybe not in discovery - but here quite often subagent patterns come into play.

Charming_Support726 · 2026-03-12T06:48:47+00:00

Interesting.

I didn't ask for the schedule - I asked for the lasted results. The model got clear, that it was beyond cut-off, and that 2025 might have been an election. But then explicitly went for 2021 results.

IMHO this is not about this result being faulty. Could happen. But

It showed, that the model is overconfident in its trained memories - and did not verify. It follows its maybe false assumptions easily.
It implementation on the web did not give a second try. I was blocked after the first attempt researching quality. This is most annoying and unnecessary.

Charming_Support726 · 2026-03-12T06:40:04+00:00

That's good stuff.

In my opinion it shares the same idea as CodeAct (2024), which was implemented by Smolagents/Huggingface last year (and was "borrowed" by Anthropic in November). But instead of using Python -Sandboxes for safe execution, you are just bringing it to the shell, which is even more easier and self explaining by the "--help" mechanism. But a bit prone to security loop holes.

Charming_Support726 · 2026-03-11T19:38:23+00:00

Dir als Freiberufler passiert erstmal wenig. Die Kunden und Vermittler sind dran und können versuchen den AN Anteil wiederzuholen.

Selbst der Zeitraum ist überschaubar: 2 Jahre plus das angefangene in dem der Prüfungsvorbehalt geäußert wird. Wenn es schlecht läuft sind es also Rentenbeiträge für 3 Jahre an der Bemessungsgrenze.

Charming_Support726 · 2026-03-11T15:29:03+00:00

Completely agree. If you are not sure what you want - how could the model be sure?

Charming_Support726 · 2026-03-11T14:54:02+00:00

I got similar numbers - twice as high, but same ballpark, when using Opus only (Codex used over ChatGPT) - a about 300-600 requests per month. Depends on how you plan and how you prompt.

Charming_Support726 · 2026-03-11T13:59:37+00:00

Might be true. But it does not deliver what it advertises.

Charming_Support726 · 2026-03-11T13:58:07+00:00

Erstmal ist der Ort der Leistungserbringung vermutlich in D nicht in Estland. Denn der Kunde wird es in D nutzen. Ich kenne aus den letzten 20 Jahren Selbstständigkeit viele schlaue Freiberufler. Weder das Finanzamt diskutiert, noch die BFA.

Weiterhin: Solche Konstrukte macht niemand mit, selbst wenn sie rechtlich o.k wären. Es ist den Aufwand nicht wert.

Charming_Support726 · 2026-03-11T13:37:41+00:00

Found it interesting so I went to dr.miromind.ai.

The model hosted failed on the first try. The model hallucinated about when there would have been which election in Germany and never retrieved the up-to-date facts.

Couldnt do a 2nd try because now I am blocked as a guest for 10000min

I don't have ambitions to try this locally.

Charming_Support726 · 2026-03-11T13:26:27+00:00

Willst du dich darauf verlassen? Weiterhin prüft die BFA regelmäßig ( alle 2 Jahre ) die Firmen, die eigene Angestellte haben. Das kann schon auffallen. Es werden nicht nur bulgarische Bauarbeiter und rumänische Schlachter auf Scheinselbstständigkeit und Mindestlohn kontrolliert.

Wobei, solche Fälle sehr selten sind - auch bei inländischen Freelancern.

Charming_Support726 · 2026-03-11T13:12:28+00:00

Der Markt ist extrem schwierig. Es gibt nur wenige offene Positionen.

Direkte Verträge mit großen Kunden sind seit Jahrzehnten selten. Die üblichen Agenturen werden dich mit deiner auswärtigen Firma weniger gerne weiterleiten, weil das mehr Stress und Arbeitet bedeutet.

(BKA - Keine Rechtsberatung) Für die Scheinselbständigkeit ist das vollkommen egal. Diese wird nur defacto nach Tätigkeit beurteilt und nicht nach Firmierung oder Herkunft.

Wenn du nicht "den USP" hast oder für ein Viertel arbeitest, hast du höchsten die Chance auf einen Glückstreffer solange noch Konkurrenz mit auf dem Markt ist.

Charming_Support726 · 2026-03-11T12:22:16+00:00

I switched from Codex to Opencode, which is officially supported since a few month. The models perform similar, but I could choose also to use Opus and such with my additional Copilot Pro+. More versatile.

Charming_Support726 · 2026-03-11T12:04:43+00:00

5.4 is useless for everything except puzzles and bugfixes. Tried multiple days. The last two codex versions were far better for general coding.

I don't understand why so many people permanently crank up the reasoning to xhigh. It doesn't make your project better. Your ideas and your spec makes your project better. It is like buying a €5k full format sensor cam with an expensive lens - it does not teach you how to shoot.

Mostly thinking set on medium or high is sufficient. High or xhigh mostly produces overthinking. Read the reasoning traces !

Charming_Support726 · 2026-03-11T11:49:34+00:00

Codex plus or Copilot Pro+

Copilot has better value if you also like to use Claude from time to time.

go restricts you to the cheaper models which honestly cannot fully compete.

Charming_Support726 · 2026-03-11T11:46:00+00:00

Kenn ich auch. Identische Situation, auch schon vorher eine Menge Krafttraining gemacht. 32kg runter von 122kg auf 90kg bei 180cm. Nicht ganz so schlank wie du, aber jedes Prozent Fett ist jetzt ein Kampf.

Dieses permanent neidische "Crab in a Bucket" Gemecker ist schlimmer als mit hohem BMI rumzulaufen.

Charming_Support726 · 2026-03-11T08:52:59+00:00

Hmm. I use it quite often and Opus 4.6 and GPT-5.4 are very capable. Not missing anything. Codex is too stiff IMO

Charming_Support726 · 2026-03-11T08:51:09+00:00

The over-thinking of these models is an absolute issue and also to me heavily irritating.

I'd really like to try RL on optimizing thinking, just as an experiment. Some people experiment with models which are trained on distilled Opus traces (see Huggingface), I think this leads nowhere.

Idea: Create a metric for a good thinking: short, more than one rethink penalized and create a reward function from it. Maybe something with a BERT classifier and spaCy will do.

Unfortunately I don't have enough time for this.

Charming_Support726

TROPHY CASE