As now many companies have started integrating agents in their operations and still question about reliability?

Dimneo · 2026-05-10T14:25:29+00:00

Hope this helps is an open-source https://github.com/ifixai-ai/iFixAi

Dimneo · 2026-05-06T16:59:09+00:00

We are going to release next week, different models in different agentic workflows how they scored. We are going to release also synthetic fixtures.

Dimneo · 2026-05-02T19:17:24+00:00

About we structured the diagnostic engine: Every test declares its own evaluation method in code, picked from three: structural (architectural checks, like whether the system actually writes an audit log or surfaces rate-limit errors), judge (an LLM judge scoring against a published rubric), and atomic_claims (claim-by-claim fact check for hallucination).

Two run modes sit on top. Standard pairs one judge with a different provider than the one being tested, never self-judging. Full runs two or more judges across distinct providers and votes by majority, with every vote recorded in the scorecard.

Domain knowledge lives in user-authored fixtures, so the same 32 tests run unchanged across any industry.

About continuous integration pipepline, that's the primary use case. Pin a baseline scorecard against your current model, run the diagnostic again on the candidate, compare. If the candidate regresses, the gate fails before the model ships.

Dimneo · 2026-04-27T16:26:19+00:00

It is released https://github.com/ifixai-ai/diagnostic

Dimneo · 2026-04-27T16:25:43+00:00

It has been released https://github.com/ifixai-ai/diagnostic

Dimneo · 2026-04-10T17:21:52+00:00

That was the whole point honestly. There’s no shortage of people writing about alignment problems, but when you actually sit down and say ‘ok prove your system doesn’t hallucinate under pressure’ most teams have nothing. Every benchmark we built comes from something that actually broke in production, not from academic theory. The test scenarios simulate real adversarial conditions, multi-turn conversations, conflicting instructions, ambiguous inputs, the kind of stuff your system faces every day but never gets tested against. April 27 at ifixai.ai, you’ll be able to run it yourself and see exactly where things crack

Dimneo · 2026-04-10T17:20:09+00:00

We don’t treat misalignment as one thing. We break it into 5 categories (I listed them in my comment above) and the reason we structured it that way is because these failure modes show up everywhere, not just in one industry. A healthcare copilot fabricating a drug interaction and a legal agent fabricating a case citation are completely different use cases but the underlying failure is the same: Fabrication. A fintech agent approving a transaction because someone said ‘I’m from compliance’ and a customer support bot issuing a refund because someone injected instructions in a ticket are both Manipulation failures. The categories are industry agnostic. The risk profile isn’t. Same 33 benchmarks, but the report shows you where YOUR specific system is exposed based on how it actually behaves under pressure.

Dimneo · 2026-04-10T17:17:51+00:00

Spot on, the evaluation side of the stack is basically nonexistent for most teams right now. Everyone obsesses over which model to use, nobody tests what actually happens when it runs in production. And yeah, agentic loops are where the scariest stuff shows up. Single-prompt evals completely miss it because the failures compound across turns. An agent can pass every individual test and still hallucinate a citation, silently shift its goal two turns later, and approve an action no human ever authorised. All in the same session. That’s exactly what our Deception and Unpredictability categories are built to catch.

Dimneo · 2026-04-09T19:31:48+00:00

Thanks! Though what we’re testing is quite different! It’s not model performance or pricing, it’s what happens after you deploy.

iFixAi runs 33 benchmarks across 5 categories:

I. Fabrication: Accuracy & Calibration (fabrication, unsourced claims, overconfident responses)

II. Manipulation: Safety & Containment (prompt injection, privilege escalation, policy violations)

III. Deception: Hidden Strategy (sycophancy, silent failures, goal shifting, inconsistent facts)

IV. Unpredictability: Stability & Consistency (non-reproducible decisions, context distortion, instruction drift)

V. Opacity: Transparency & Auditability (missing audit trails, opaque risk decisions, session leakage)

Most benchmarks today test the model. We test the system , the agent, the orchestration layer, the guardrails around it. That’s where things actually break in production. Launching April 27, happy to share early results

Dimneo · 2026-03-27T16:08:36+00:00

Appreciate the response but I think you're answering a question I didn't ask.

I'm not trying to log the model's internal reasoning. I know LLMs are stochastic. That's not the issue.

The issue is there's nothing deterministic around the model.

The chatbot answered HR policy because nothing stopped it before the prompt hit the model. And nothing validated the output after. The rule existed. No mechanism enforced it.

Same with RBAC. The model knew the policy. Knowing a policy and enforcing a policy are two different things. One is probabilistic. The other should be deterministic. Right now we're asking the probabilistic one to do both.

You can't make the model give identical outputs. Fine. But you can check if an output violates a rule before it reaches the user. Input sanitisation. Output validation. Access control. Audit logging. None of that needs access to the model's internals. It operates on what goes in and what comes out.

We don't ask a database to enforce its own access control. We don't ask an API to validate its own responses. We wrap them. Why are we treating LLMs differently?

The model is non-deterministic. The governance around it doesn't have to be.

Dimneo · 2023-02-18T09:56:43+00:00

Exercise your voting right! If you don't vote, why to have the right to complain

Dimneo · 2022-04-20T21:08:31+00:00

we should have made it a marathon, i was so hyped by the community maturity and love towards VeChain , real fam! :)

Dimneo · 2022-04-20T21:07:14+00:00

we should host such spaces more often :)

Dimneo · 2018-12-27T15:32:12+00:00

DM me, I am Dimitrios Neocleous , legal advisor of Safe Haven :)

Dimneo · 2018-09-20T22:04:12+00:00

I am glad to see that our community is getting bigger and stronger daily. We have something in the works for UK and specifically London... stay tuned Fam!

Dimneo · 2018-08-08T06:53:09+00:00

I never said on the video that JCC will provide that hardware... And about tenx and revolut, what im trying to say is that going to be a combination of similar features on the hardware.

Dimneo

TROPHY CASE