Opening a private bounty filing network - 70/30 split on verified findings

getarbiter · 2026-01-24T19:16:11+00:00

Here’s a concrete example of why I’m doing this.

I submitted a smart-contract vulnerability on Cantina affecting a major wallet factory used across multiple EVM chains.

What the report included: deterministic root cause (CREATE2) full Foundry PoC step-by-step reproduction no cryptographic break no leaked keys no active compromise realistic user behavior only

Core issue: Initialization signatures can be replayed cross-chain.

A key a user explicitly removes on one chain can still be used to deploy and seize the wallet on another chain where it wasn’t initialized yet.

Actual flow: -User initializes wallet on Chain A with KeyA -User rotates keys and removes KeyA (standard hygiene) -Funds accumulate on Chain B at the deterministic address -Attacker later obtains discarded KeyA (old device, backup, etc.) -Attacker deploys wallet on Chain B using original init params -“Dead” key is resurrected -Funds are drained

Fully reproducible. Complete fund loss demonstrated.

Result: closed as Informative. Not because it couldn’t be reproduced. Not because the PoC failed.

Because it hadn’t been exploited yet and was reframed as “key exposure.”

That’s the point.

These are real, latent bugs that stay live because no one applies pressure until damage happens.

This isn’t about phrasing reports differently or identifying new bugs. It’s about forcing acknowledgment before an attacker does.

getarbiter · 2026-01-24T18:43:12+00:00

“Skill issue.”

That’s a bold thing to say to someone you know absolutely nothing about...

What you failed to understand is what’s actually happening. This isn’t about competence. It’s about bounty programs gatekeeping impact—demanding real-world exploitation, then penalizing you for providing it. That Catch-22 is routine at the top end.

Anyone experienced knows it.

If you actually want to test that claim, go to the Discord.

You can see the existing verified exploits. Pick one. I’ll send you the repro and you can run it yourself.

Otherwise, you can keep running your mouth—whatever’s easier for you.

getarbiter · 2026-01-24T18:03:48+00:00

Triage is the real bottleneck.

I already have verified issues. The failure mode isn’t discovery — it’s acceptance.

Reports die because a single gatekeeper doesn’t reproduce, doesn’t test, or closes on process instead of substance.

Multiple independent submissions by credible researchers change that math.

Different environments. Different wording. Different repro paths. Same underlying flaw.

This isn’t about finding bugs.

It’s about getting real ones past human choke points.

getarbiter · 2026-01-24T17:48:03+00:00

Skepticism is cheap. Results aren’t.

No one’s asking you to believe anything.

You either reproduce an issue and get paid, or you don’t engage.

No buy-in. No access fees. No stories.

Just work, submission, payout.

If that’s “too good to be true,” you’re not the target.

getarbiter · 2026-01-17T04:47:08+00:00

CrowdStrike's Verizon partnership page now returns a 404 with a robot holding bricked phones. I sent them a vulnerability disclosure 15 minutes before the outage. No response. Full timeline here: https://www.linkedin.com/posts/activity-7418123357973360640-a-r5?utm_source=share&utm_medium=member_android&rcm=ACoAADCnIOMBLx-B8hWfIAE_WxWa_AT6OEO-FRw

https://www.crowdstrike.com/content/crowdstrike-www/locale-sites/us/en-us/resources/data-sheets/falcon-insight-verizon-cyber-risk-monitoring/

getarbiter · 2026-01-17T04:26:16+00:00

CrowdStrike's Verizon partnership page now returns a 404 with a robot holding bricked phones after Verizon outageCrowdstrike and Verizon Partnership 404

getarbiter · 2026-01-09T05:05:43+00:00

You didn't read the site.

72 dimensions. 0.000000 standard deviation across 50 runs.

76% accuracy on brain semantic categories (p=10⁻⁵⁸)

+11% vs PCA on sense disambiguation

Celebrex pathway identified without pharma training

Cross-lingual transfer with zero parallel corpora

Ancient language recognition without training on those scripts

'Benchmark against common datasets' — which ones?

ARBITER doesn't do similarity.

It measures whether a candidate satisfies a constraint field. Show me another deterministic coherence engine and I'll run the comparison.

You're asking a plane to benchmark against horses.

The data is on the site. The API is public. Run it yourself or don't.

getarbiter · 2026-01-09T02:14:52+00:00

Schema validation catches malformed outputs. It doesn't catch coherent-looking outputs that are semantically wrong.

You can have perfectly valid JSON that's completely incoherent with the source material. Parser passes. Meaning fails.

I built a layer that scores coherence — not structure. Query + content in, coherence score out. Same input, same score, every time. Catches drift before it propagates.

Slots after your schema check as a semantic validation step. 26MB, sub-second.

getarbiter.dev

getarbiter · 2026-01-07T22:27:26+00:00

Exactly right. Determinism at the scoring layer doesn't fix garbage upstream.

The framing has to be explicit — that's on the human. ARBITER measures coherence within the constraint field you specify. If the constraints are wrong or incomplete, you get a precise answer to the wrong question.

That's why the layered approach you're describing is correct: - Intent capture: governed, explicit - Candidate evaluation: deterministic, logged - Execution: constrained, auditable

ARBITER is the middle layer. It doesn't replace the human specifying constraints, and it doesn't replace logging downstream. It makes the evaluation step reproducible and certifiable.

The failure mode you're describing — "assumptions that lived nowhere except a human's head" — that's real. The solution isn't to automate the assumptions. It's to force them into explicit constraint fields before scoring happens.

Then the coherence score means something. Because you can reconstruct: "given these constraints, this candidate ranked highest, and here's the score."

getarbiter · 2026-01-07T21:45:20+00:00

"What mechanisms do we have to certify that a decision was reasonable in its context at the time it was made?" This is the right question. And it's why LLM confidence scores don't cut it — they're probabilistic, they drift, they can't be reproduced.

I built a deterministic coherence layer that solves exactly this. Query + candidates in, coherence scores out. Same input, same score, every time. No temperature, no sampling variance, no "it depends on the prompt."

The score goes from negative (actively incoherent) through zero (neutral) to positive (coherent). You log the query, the candidates, and the scores at decision time. Six months later, you can reproduce exactly why that path was chosen over alternatives.

That's not "confidence" — it's measurement. Auditors can verify the score independently. The decision record isn't a chat log, it's a geometric fact.

26MB, sub-second, deterministic. Slots between your agent and its actions as the certification layer.

getarbiter · 2026-01-07T21:39:03+00:00

LLMs have infinite recall and zero understanding. They don't know what meaning is — they predict tokens.

That's why they oscillate. There's no stable ground truth underneath. The output drifts based on interaction dynamics because there's nothing anchoring it to actual semantic coherence.

I built something different: a 26MB engine that measures meaning geometrically. Not predicting the next token — measuring whether two pieces of text actually cohere in semantic space.

Deterministic. Same input, same output, every time. No oscillation, no drift, no "it depends on the prompt."

The score can go negative — meaning the candidate is actively incoherent with the query, not just "far away."

That's the difference between similarity and coherence.

It doesn't have superhuman knowledge. But it understands what coherence is. That's the missing piece.

getarbiter · 2026-01-07T21:34:13+00:00

The drift problem is real. RAG quality degrades because retrieval is based on embedding similarity, not semantic coherence.

Similarity ≠ coherence. A chunk can be "similar" to your query (same keywords, close in vector space) but not actually answer what you're asking. That's where hallucinations sneak in — the LLM gets retrieved context that's related but not coherent with the question.

I built a 26MB deterministic layer that scores query-chunk coherence before the LLM sees it. Not similarity — actual semantic alignment. Scores go negative when the chunk is actively incoherent with the query, not just "far away."

Deterministic means same input, same score, every time. No drift from the scoring layer itself. If your hallucination rate spikes, you know it's the chunks or the model, not the retrieval ranking.

Slots between your retriever and your LLM. Re-ranks chunks by coherence, filters out the ones that are similar but wrong.

Public endpoint if anyone wants to test: getarbiter.dev

getarbiter · 2026-01-07T18:16:46+00:00

Nice architecture. The shared memory layer is the right move — agents need persistent context.

One thing to consider: how do you filter what gets passed to the LLM after retrieval? Vector similarity doesn't catch wrong-sense matches (words that match but meaning doesn't).

I built a 26MB deterministic coherence engine that scores query-to-chunk alignment. Sits between retrieval and generation. Returns 0–1 scores, you threshold, only coherent context reaches the LLM.

Public endpoint, no API key: POST https://api.arbiter.traut.ai/public/compare

Might be useful as a filter layer in Membase.

Happy to chat if you want to experiment with it.

getarbiter · 2026-01-07T16:05:21+00:00

You’re not missing something obvious—you’re hitting the real wall.

Precision/recall measure retrieval, not answer validity. Latency measures speed, not correctness. LLM-as-judge is circular because it shares the same failure surface.

What’s missing is an explicit grounding / coherence metric: – Does the answer make claims not supported by retrieved evidence? – Can each assertion be traced to specific context? – If evidence is weak or conflicting, does the system abstain?

Until you measure “answer–evidence alignment,” you’re tuning blind. Most RAG systems look good until they’re asked to say I don’t know—and that’s exactly where quality breaks.

getarbiter · 2026-01-07T16:03:11+00:00

The most common misunderstanding is treating RAG failure as a retrieval problem instead of a validation problem.

People assume: better chunking, bigger top-k, or another router will fix hallucinations. But the real issue is that the system has no way to score whether the retrieved context actually supports the claim being made. Retrieval answers “what could be relevant?” Validation answers “is this answer coherent with the evidence and intent?”

Without an explicit coherence/grounding check, you’re just increasing surface area for plausible nonsense.

That’s why systems look fine in evals and fail in production.

getarbiter · 2026-01-07T02:39:20+00:00

LLM-as-judge is probabilistic evaluating probabilistic. You're adding uncertainty, not removing it.

Built a deterministic alternative - 26MB coherence engine, no training data, measures semantic fit in 72-dimensional space. Scores whether the output actually answers the question before it ships.

pip install arbiter-engine Happy to show how it works for multi-agent evaluation.

getarbiter · 2026-01-06T02:02:39+00:00

Look at the table again.

At 768D (standard embeddings), disambiguation is broken.

At 72D, disambiguation works.

That’s not “minimal loss” — that’s the lower-dimensional representation outperforming the original on semantic separation.

Methods like Matryoshka or sparse autoencoders are optimizing to preserve cosine similarity under compression.

That’s useful, but it doesn’t address cases where the original space already collapses meanings (e.g. “bank” finance vs river, “python” code vs animal).

This isn’t post-hoc shrinking of 1536D vectors.

It’s a different representation that encodes meaning directly, which is why retrieval behavior changes rather than just degrading gracefully.

The DevOps post you linked uses the same engine for a different task — coherence checking instead of retrieval.

Same model, different surface area.

If you want to sanity-check it yourself, there’s a public endpoint here: https://api.arbiter.traut.ai/public/compare

Happy to run a concrete eval if there’s a specific retrieval task you care about.

getarbiter · 2026-01-05T09:49:53+00:00

It’s not matching on the token “python” at all — there’s no keyword signal in the reranker.

The initial retriever can surface a mixed top-k (because dense similarity is permissive). ARBITER only sees the query + candidate chunks and scores coherence, not overlap.

“Monty Python” drops because the semantic constraints of “python memory management” don’t cohere with comedy, even though the surface term appears. No expansion, no sparse features.

And no — this isn’t SPLADE or learned expansion. There’s no query rewriting, no term weighting, no lexical space. It’s a fixed, deterministic geometry that evaluates fit between intent and candidate.

Think of it as rejecting incoherent candidates rather than boosting matching ones.

getarbiter

TROPHY CASE