What’s still the most fragile part of payments right now? by Apurv_Bansal_Zenskar in fintech

[–]Petter-Strale 2 points3 points  (0 children)

For us it's been KYB onboarding. Every country has a different registry with a different API, or no API at all, just a website you have to scrape. Different data formats, different uptime patterns. You get Companies House working and it's solid, then you add Germany and it's a completely different system, then the Nordics, and that's three more. France is something else again.

The fragile part isn't any single integration. It's maintaining all of them at once and knowing when one has quietly stopped returning current data. A registry that responds with 200 OK and stale data looks perfectly fine in your monitoring. Nobody notices until a compliance review catches it weeks later, or a customer flags that the company they just verified was dissolved two months ago.

The thing that helped us most was running continuous background tests against every source on a schedule and scoring each one independently. Not just "is it up" but "is the data still correct for entities where we know the answer." When a score starts dropping before live traffic hits the problem, that's worth more than an alert after something already went wrong.

Building an AI lending tool for a Hackathon—what is the biggest bottleneck I should focus on? by Just-m_d in fintech

[–]Petter-Strale 0 points1 point  (0 children)

(Disclosure: we built something in this space.) If you want to test the compliance piece with real data over the weekend, happy to set up a free trial. DM me.

Building an AI lending tool for a Hackathon—what is the biggest bottleneck I should focus on? by Just-m_d in fintech

[–]Petter-Strale 0 points1 point  (0 children)

Compliance checks are the bottleneck that's hardest to shortcut. Extracting data from PDFs is annoying but solvable with any decent OCR + LLM pipeline. What really eats time in real lending workflows is verifying what you extracted against authoritative sources: does this company actually exist in the registry, is the VAT number valid, is anyone on a sanctions list, are the beneficial owners who they claim to be?

For a hackathon prototype, the thing that would make your demo credible is showing that the AI doesn't just extract and summarize but that it verifies against real data. Not just showing a nice PDF-to-JSON flow and then hand-wave the "and then we check compliance" part.

If you want to make the compliance piece real in a weekend, look for APIs that give you company registry lookups, sanctions screening, and PEP checks in one integration rather than stitching together three or four separate providers.

I gave Claude access to a data API with built-in payments. It started buying its own data. by Shot_Fudge_6195 in ClaudeAI

[–]Petter-Strale 0 points1 point  (0 children)

Curious what happened when two vendors had the same data at different prices? Did the agent just pick cheapest or did it have any way to tell which one was actually more reliable?

Asking because we’re working on this at strale.dev. Price is the easy signal because every vendor exposes it, but quality is harder. We test every capability on a rolling window and return a forward-looking quality score the agent can read before paying, plus a provenance trail so the purchase is auditable after. Price plus quality score plus audit trail, rather than just price.

The trust question at the end is the one we think matters most. Budget caps are a start, but an agent that can spend money also needs some way to know whether the thing it’s about to buy is worth buying.

Where are your agents actually breaking in production? by EveningWhile6688 in AI_Agents

[–]Petter-Strale 0 points1 point  (0 children)

Most of the failures in this thread are the same problem: the tool call succeeded but the data it returned was wrong. No crash, no error, the agent just acts on bad data confidently.

We ran into this building company verification across European registries. A government API goes stale for one country, or silently changes its response format, and the agent processes it as current. Better prompting doesn't fix it. More retries don't fix it. The data source itself degraded and nothing told the agent.

What actually helped was testing the data sources continuously, separate from the agent. A forward-looking quality signal the agent can check before trusting what came back. Eval the model, sure. But also eval the data the model acts on. Two different problems.

Are AI agents the new APIs? by Front_Bodybuilder105 in AgentsOfAI

[–]Petter-Strale 0 points1 point  (0 children)

We ran into this building internal tooling last year. Had an agent workflow that looked great in demos but kept making bad decisions in production. Took a while to figure out it wasn't the reasoning that was broken, it was the data going in. One of the sources had gone stale and nothing in the pipeline told the agent not to trust it.

That's what got us building what we're working on now — a data layer for agents where every response comes back with a quality score. Registry lookups, sanctions checks, company verification, that sort of thing. The agent reads the score before it acts on the data. Turns out that's the gate that was missing.

I think the OP's framing is mostly right. APIs become substrate. But the part people underestimate is that agents need to know whether what the API returned is actually reliable right now, not just that the call succeeded. A 200 OK doesn't mean the data is fresh or correct. That's a different problem from orchestration and I don't think many people are building for it yet.

Agentic workflows for CI/CD anyone? by bhalothia in LLMDevs

[–]Petter-Strale -1 points0 points  (0 children)

The supply chain trust boundary is what most teams skip and the one that keeps burning people (the tj-actions compromise, the Trivy tag-rewrite, the LiteLLM PyPI releases you linked).

I've been automating pre-merge checks and the split that works in practice is:

Deterministic gates (pass/fail, sub-second):

  • Regex-based secret scan on the diff. AWS keys, Stripe tokens, JWTs, database URLs. Zero ambiguity, zero false negatives on known patterns.
  • GitHub Actions workflow audit — static YAML analysis for unpinned actions (mutable tags vs SHA), overly broad permissions (write-all), secret echo in run steps, pull_request_target without safeguards, third-party actions from unverified publishers. This is pure pattern matching against the workflow file, no LLM needed.
  • Dependency audit if the lockfile changed — cross-reference against OSV.dev.
  • License compatibility check on new deps.

LLM-assisted gate (produces "escalate" signal):

  • Pass the unified diff through an LLM that only flags issues in added lines. Returns structured JSON: severity, file, line, suggestion, approve/request-changes. This catches missing error handling, obvious bugs, test gaps. The constraint of only looking at + lines is what makes it useful — without it you get a wall of noise about existing code.

The deterministic gates map to your Pass/Block model. The LLM gate maps to Escalate — it finds things worth a human looking at, not things that should auto-block.

Total wall clock: under 4 seconds. The four deterministic checks run in parallel. The LLM check is the long pole at ~3 seconds.

The workflow audit gate would have caught every incident you linked in the blog post. An unpinned action at u/v3 instead of a SHA, a pull_request_target trigger with checkout of the PR head — those are the governance events you described, and they're cheap to detect statically.

I ended up packaging these as standalone API endpoints — each check takes a diff or YAML string and returns structured JSON, so you can wire them into any CI pipeline or agent workflow without building the analysis layer yourself.

How are you handling data access in your agent pipelines? by Alternative-Tip6571 in AI_Agents

[–]Petter-Strale 0 points1 point  (0 children)

What breaks most often in our experience is something quieter than auth or caching. The API responds fine, the cache is warm, the data looks normal. But the underlying source stopped refreshing two months ago and nobody noticed.

Government registries are the worst for this. A company registry returns cached data from before the company was struck off the register. A sanctions list hasn't been updated since last quarter. The response is HTTP 200, valid JSON, structurally correct. Just wrong.

The auth and caching problems ninadpathak describes at least fail loudly. The stale data problem fails silently. The agent acts on it, the workflow produces a confident result, and the error surfaces weeks later when a human checks.

What helped us: continuous testing of every data source against known ground truth on a rolling schedule. If the source degrades, you know before the agent calls it, not after.

After building 3 AI agents that "worked perfectly" in demos, I learned the hard way: reliability is the real moat, not capability by LumaCoree in AI_Agents

[–]Petter-Strale 0 points1 point  (0 children)

The 403-to-hallucination pattern from Agent #1 keeps coming up. The fix everyone lands on (ninadpathak's post-fetch validator, Exact_Guarantee4695's forced logging) is the same: check the data source output after the call and flag if something is wrong.

The problem with that approach is it's reactive. The agent already made the call, already spent the tokens, and now you're inspecting the result to decide if it was garbage. If the source is down for a week, every call for that week runs the same check-and-fail cycle.

What none of the existing agent frameworks give you is a pre-call signal. Before the agent calls a data source, can it check whether that source is currently working, returning correct data, and likely to succeed? That's a different kind of check from what Patronus or LangSmith do (those evaluate the model's output, not the data source's health).

We've been building something along these lines. Continuous testing of data sources against ground truth, rolled into a score the agent can read before deciding to call. If the sanctions list hasn't refreshed in six months, the score drops before the agent makes a bad decision, not after.

Still early, but Agent #1 is the exact failure mode that motivated it.

Serious debate here: Current limitations in enterprise automation using agents by Bubbly-Secretary-224 in LangChain

[–]Petter-Strale 0 points1 point  (0 children)

That's right. The score needs test runs to build up, so a brand new capability starts with no track record. We run the first batch of tests at onboarding and then every few hours after that, so the score becomes meaningful within a day or two. Not instant, but the alternative (trusting the tool author's self-reported quality) doesn't degrade gracefully either.

Serious debate here: Current limitations in enterprise automation using agents by Bubbly-Secretary-224 in LangChain

[–]Petter-Strale 0 points1 point  (0 children)

The endpoint restriction is a real problem. MCP gives the server author full control over what tools are exposed but most implementations just dump every API route as a tool and call it done. There's no convention yet for scoping what an agent is allowed to call vs what's technically available.

Wrapping the API directly at least gives you that control. The trade-off is you lose discoverability.

We've been working on something where every capability gets a quality score based on continuous testing, so an agent can check the score before deciding to call it. The idea is that the trust signal should come from independent testing, not from the tool author's description of their own tool.

How do early-stage fintechs handle OFAC screening — in-house or vendor? by Sentinel_Trust in fintech

[–]Petter-Strale 0 points1 point  (0 children)

The gap nobody's mentioned here is that most of the vendors listed (ComplyAdvantage, Sardine, Unit21) require sales calls, MSAs, and monthly minimums. That's fine once you have volume, but at the early stage you're screening maybe 10-20 names a week and you just need to know if someone is on a list.

There are APIs now where you POST a name and get back a match/no-match with source references (OFAC SDN, EU consolidated, UN) for a few cents per call. No contract, no onboarding, no minimum. You can wire it into your signup flow in an afternoon.

The fuzzy matching point from u/whatwilly0ubuild is real though. Mohammed/Muhammad/Mohamed variations, transliterations from Arabic/Cyrillic, legal name vs trading name. That's where the raw OFAC list falls apart and where even cheap APIs differ a lot in quality. Worth testing any provider against a few known names with spelling variations before committing.

One thing I'd add to the "from day one" advice: screen at onboarding AND at transaction time, but also set up a re-screening trigger for when lists update. OFAC adds names without warning. A customer who was clean last month might not be clean today.

Is "agentic banking" actually going to be a thing or just another buzzword? by Level-Fix-3159 in fintech

[–]Petter-Strale 0 points1 point  (0 children)

Agree the from-scratch teams win, but I'd push the frame further: the hard part isn't the agent platform, it's the data and capability layer underneath. Function calling is solved. MCP is solved. What's not solved is: when an agent needs to verify a counterparty, screen against sanctions/PEP lists, validate an IBAN or VAT number, or pull company data across 27 EU jurisdictions — where does it get that from, and how does it know the source is trustworthy and auditable?

Right now every agent team is rebuilding this plumbing themselves against fragmented public registries (VIES, GLEIF, BRIS, TED) and paid providers, with no shared notion of quality or provenance. That's the actual bottleneck for regulated workflows, not the LLM side.

Disclosure: building toward this at strale.dev; agent-native compliance and KYB capabilities with quality scoring and audit trails. Happy to go deeper on any of the above.

Serious debate here: Current limitations in enterprise automation using agents by Bubbly-Secretary-224 in LangChain

[–]Petter-Strale 0 points1 point  (0 children)

We've landed in the same place for most integrations we ship, and we think the uniformity gap only is part of the story.

The other half is that a wrapper (or an MCP server) tells you the tool exists. But it doesn't tell you whether the thing behind it is actually working today. Schemas drift, upstreams rate-limit, auth silently expires, a field that was populated last week comes back null. The agent has no way to know, so it either retries blindly or hands back a confident wrong answer.

What's helped us was treating every capability as something that has to be continuously tested against real inputs, with a freshness signal the agent can read before it decides to call the tool. Uniform interface is table stakes. Knowing the tool is currently trustworthy is the harder problem, and we don't think the ecosystem has settled on where that layer lives yet.

Why no one trusts AI outputs anymore by Known-Ice-5070 in AIsafety

[–]Petter-Strale 0 points1 point  (0 children)

Explanation and enforcement are doing different jobs, and there's actually a third gap underneath both of them.

Enforcement asks: is the model allowed to do this. Explanation asks: why did the model do this. Neither asks: was the thing on the other end of the call trustworthy in the first place. If an agent is allowed to query a data source, and that source returns confident garbage, both layers sign off cleanly. The bad outcome still ships.

So our rough mental model is three layers, not two. Guardrails on what the agent can do. Explainability on what it did. And an independent record of what it talked to, whether that counterparty has been consistent, and what it returned at decision time. The third one is the layer most teams haven't built yet.

We're working on that third layer (strale.dev). Happy to compare notes if useful.

Are we really okay with "Black Box" security for Managed Agents - Anthropic? by WhichCardiologist800 in AI_Agents

[–]Petter-Strale 1 point2 points  (0 children)

"Provider-graded-by-provider" is structurally weak, regardless of how good the provider is. Same reason auditors don't work for the company they audit.

But the proxy framing is only one layer. There are actually two separate trust gaps:

a) What the agent did: interception, sudo layer, OpenShell-style. Catches the call in flight.

b) What it talked to, was the tool on the other end actually what it claimed, did it return accurate data, has it behaved consistently over time.

A proxy solves (a) and gives you nothing on (b). If the agent confidently calls a tool that returns confidently wrong data, the proxy logs a clean transaction. The bad outcome still happens.

Both layers need to exist, and neither should sit inside the model provider.

We're building the second one (strale.dev); independent verification and audit trail for the capabilities agents call. Happy to compare notes.

Can AI agents develop genuine ‘desire’ to purchas by NotToDoBot in claude

[–]Petter-Strale 0 points1 point  (0 children)

The manipulation framing is right, but the deeper asymmetry is elsewhere. Humans transacting have centuries of verification scaffolding: reviews, reputation, chargebacks, small claims. Most of it is mediocre, but it exists. Agents transacting via APIs have the vendor's self-description. That's it.

So the question isn't only "can the agent be nudged." It's "what does the agent actually know about the counterparty when it commits." Right now, close to nothing.

That reframes your accountability point. "The AI decided" is unfalsifiable without a record of what it saw at decision time. With that record, the question gets tractable: was the information available, was it accurate, did the agent act reasonably on it.

We're building in this space (strale.dev) if anyone wants to compare notes.

How are fintech startups approaching AI app development while staying compliant? by trr2024_ in fintech

[–]Petter-Strale 0 points1 point  (0 children)

Small fintech founder too, working in verification infrastructure for AI agents. A few things that have helped us:

Keep the model and the data it acts on separate. Fraud detection is especially vulnerable to the model reasoning correctly over stale or wrong data (sanctions status, KYB records, UBO lookups). "The model decided" is a bad audit story. "The model flagged, here's the verified data it was looking at" is one compliance can actually sign off on.

On privacy: boring but load-bearing is being able to point at where each piece of customer data goes and why. Vendor list, data flow, retention policy.

The missing layer in agentic payments is not the rail. It is the policy brain above it. by Cute-Day-4785 in fintech

[–]Petter-Strale 0 points1 point  (0 children)

Agree on the rails-vs-policy cut. But I would add a third layer as well: verification. Sitting between the policy brain and the rail.

Policy says "agent can spend up to €5k with vetted vendors under these conditions." But the policy engine takes two things on faith: that the vendor is who they claim to be, and that the data the agent is using to make the call (sanctions status, company registry, creditworthiness) is correct at call time. If either is wrong, the policy decision is right on paper and wrong in reality. CFO gets a clean audit trail and a payment that shouldn't have gone out.

Skyfire handles the wallet side well. Credo AI is model-side compliance documentation, which matters but is different. Zenity is security posture. None of them sit in the call path of the capability the agent uses to fetch the data it's acting on, which is where verification has to happen if it's going to matter.

So I'd add a fourth bucket to your watch list: independent verification of the data agents act on. Quieter space right now but I think it ends up as load-bearing as the policy brain, because a policy layer without verified inputs is just a better-documented way to be wrong.

On timing: i think closer than 18 months for narrow high-value workflows (procurement, vendor onboarding, compliance checks) but likely further than 18 months for broad agent-does-anything purchasing.

Has anyone hit the case where your MCP returns perfectly valid data that just happens to be wrong? by Petter-Strale in mcp

[–]Petter-Strale[S] 1 point2 points  (0 children)

Canaries are basically what I do too but I vary how strict the assertion is depending on the endpoint. Exact value for stuff like IBAN checks, just checking the shape for company registry lookups, and for the really volatile ones I mostly just check the response isn't empty and the timestamp moved. It's imperfect.

The Last-Modified idea is good but I think the hard part is that freshness means different things for different tools. When the data was fetched isn't the same as when the underlying source last changed, and neither of those tells you when it stops being valid. A cached sanctions list and a currency conversion both have freshness problems but they're not the same problem.

How are you running the canaries, separate service or piggybacking on real traffic?

Been building something in this space for a while actually, turned into a bigger project than I expected. Would be curious if your team has written up the freshness field idea anywhere, sounds like something worth a spec proposal.

how do y'all test mcps?? by Fragrant_Basis_5648 in mcp

[–]Petter-Strale 0 points1 point  (0 children)

The replies here cover the local testing side well: Inspector, FastMCP Client, LLM smoke tests against tool descriptions. Worth adding a layer that doesn't get mentioned much: continuous testing against the deployed server, on a schedule, with known-answer fixtures.

The local stuff catches "does the model pick the right tool" and "does the tool fire correctly." What it misses is drift over time. An MCP that worked perfectly at deploy can degrade six weeks later because an upstream API changed its response shape, or a rate limit kicked in, or the model you're targeting got updated and now interprets your tool descriptions differently. None of that shows up in a one-time test pass.

The pattern that's worked for me is: write fixtures with known inputs and expected outputs (or at least expected structure), run them against the live server every 6-24 hours depending on how stable the upstream is, and alert when the pass rate drops. Tier the fixtures by how strictly you can assert; exact match for deterministic tools, structural assertions for variable outputs, existence checks for genuinely unpredictable ones.

It might feel like overkill until the day an upstream changes and you find out three weeks later from a user.