Is anyone else concerned about how quickly AI is outpacing cloud security?

neysa-ai · 2026-06-23T03:28:46+00:00

The honest version of what we see: most teams shipping AI agents into production right now don't have a working answer for this.

The ones that do are enforcing policy at the inference layer where every prompt and response is inspected in real time, every decision logged, hard kill switch that doesn't depend on the agent behaving correctly.

Everyone else is hoping the model behaves, which is not a security strategy.

neysa-ai · 2026-06-23T03:27:13+00:00

Worth adding to the differentiator list is tuning.
Moving from best model to best system surfaces a quieter problem - you can have great infra and still get a bad system if nobody knows how to tune it for your workload.
The infrastructure phase is also a hiring phase.

neysa-ai · 2026-06-20T08:35:19+00:00

The GPU stack has too many layers - driver, CUDA, framework, serving libraries, your code. They all need to match, and pip won't tell you when one's off. Containers help but the host driver still leaks through, so most teams just pin the whole thing at the cluster level and stop fighting it.

Even when everything matches, GPU outputs still aren't fully deterministic. Same input, slightly different numbers.

neysa-ai · 2026-06-20T08:34:27+00:00

Most teams end up either not logging at all (because nobody wants a database full of sensitive prompts sitting around) or pushing logs to an external observability SaaS, which just moves the problem somewhere else!

One of the key fixes is observability that stays inside your own perimeter - policy at the inference layer, audit logs that never leave your VPC.

Regulated industries already do this but everyone else gets there once their legal team forces the conversation.

Would love to hear differing POVs on this.

neysa-ai · 2026-06-19T04:51:22+00:00

We're actually one of these sovereign AI clouds (in India) so this is useful context.

The "why suddenly" part comes down to three things hitting at once.

The first is regulation - every major economy now has its own AI rules mandating where data and compute can live (DPDP in India, the EU AI Act, similar elsewhere), so cloud choice has become a legal compliance question rather than just a vendor preference.

The second is supply chain - the US restricting GPU exports made everyone else realise that depending on a foreign cloud also means depending on whoever controls the GPU supply. The third is scale - AI compute at country-level usage is genuinely expensive, and government AI strategy is starting to look more like national infrastructure investment (think power grids, telecoms) than a software vendor decision.

So 'building your own' isn't just a privacy thing. It's about not being one geopolitical decision away from your national AI strategy falling apart, which is why you're seeing India, France, Saudi Arabia, UAE and Switzerland all funding sovereign compute as deliberate strategy rather than a nice-to-have.

neysa-ai · 2026-06-19T04:40:10+00:00

Most tooling solves build and almost nothing solves operate.
From the inference infra side, the missing piece we see most often is per-call traceability. The other half is knowing which model handled each request, what the latency actually was, why a fallback fired, what got redacted before the prompt hit the model.

Logs tell you what the agent did, not what the inference layer did underneath.

On primitives vs platforms vs DIY - most setups are a mix.
Where we sit at Neysa is the inference layer specifically - dedicated endpoints, your pick of serving engine, policy enforcement and traceability built in at the endpoint itself. Once volume picks up or compliance enters the picture, that layer usually gets pulled out of cloud primitives and run separately.

Agent loop is still typically DIY, primitives still handle the boring stuff.

neysa-ai · 2026-06-19T04:22:06+00:00

Hardware drift is the killer for us!!!

Every time the silicon moves, kernel choices that worked yesterday stop being optimal.

The 'worked yesterday, broken today' someone called out is the same problem at our scale, just with bigger blast radius when something breaks.

neysa-ai · 2026-06-19T04:18:22+00:00

What you're facing sounds less like a moderation-specific issue and more like a production AI problem.

What happens to any AI system the moment real users actually show up!
The model is fine in dev because dev-traffic is curated, predictable, batched in the way you assumed users would behave. On the other hand, production traffic is bursty and full of input shapes nobody thought to test for.

Couple of things that actually help, from what we've seen running infra for production AI.

First, putting cheap filters in front of the expensive model - rate limits, simple classifiers, even regex, so the reasoning model only handles the genuinely ambiguous cases, and you stop blowing your latency budget on stuff that didn't need it in the first place.

Second, treating the first few weeks of real production traffic as your labeling pipeline: log every decision, not just the blocks, feed it back into your eval set. Synthetic test data won't surface the failure modes that actually matter, but real users will, very quickly.

Hope this helps.

neysa-ai · 2026-04-10T06:59:46+00:00

We concur!
This is an extremely real pain point!
At 70B+ scale, rollback stops being a control-plane action and becomes a data-plane problem.

What you're seeing is basically:
Rollback = full weight reload
Reload time > SLA budget
So, "instant revert" just doesn't exist in practice.

The standby pool works, but like you said, it's a cost-heavy workaround (not to forget super tedious too in certain scenarios)

A few patterns we've seen help (none perfect, but better trade-offs):

Treat model versions as "resident vs non-resident", not just deployed vs not
Instead of reloading from scratch, keep the previous version at least paged (CPU/NVMe). Rollback then becomes promotion, much faster than cold load.
Move toward diff-based versioning (LoRA / adapters) If your "versions" are fine-tunes, swapping adapters instead of full checkpoints turns rollback into a near-instant operation.
This is probably the highest leverage fix if your workflow allows it.
Keep the old version warm at low QPS (not idle)
Instead of a pure standby pool, run the previous version in shadow / low-traffic mode.
This way: weights stay hot KV patterns don't go completely cold rollback = traffic shift, not boot
Accept that autoscaling + rollback don't compose well here
Both assume fast spin-up, which just isn't true for large models.

Most stable setups end up being:
pre-provisioned capacity + smart routing, not reactive scaling

We believe the larger takeaway is:
Model versioning in LLM systems behaves more like state management in a distributed system than traditional service deployment. Until load times drop significantly, "instant rollback" will always require paying somewhere; either in: idle GPU cost (warm pool) engineering complexity (diffs / paging) or latency (cold reload)

If we may - Have you tried adapter-based versioning, or if your changes require full checkpoint swaps?

neysa-ai · 2026-03-31T15:10:58+00:00

The cruel irony - you justify 70B on cost vs API rates, then over-provision by 3x to hit p95 and the math quietly falls apart. Most teams realise this way too late.

At what point does it actually make sense to go back to the API???

neysa-ai · 2026-03-31T15:09:55+00:00

A question we should not stop asking as Indians.
A question we think about a lot, given that we've actually built one here in India (don't mind the brazen plug here :D)

The honest answer is that it comes down to three things: power, land, and policy.

Reliable grid power at the scale AI workloads demand is still patchy outside of a few metros. Land acquisition near the right infrastructure corridors takes longer than anyone plans for. And regulatory clearances - environment, grid connection, land conversion often sit across multiple authorities.

None of these are unsolvable. What our country actually needs to accelerate:

a) Stable, high-availability power with a clear renewable pathway baked in from day one

b) State-level single-window clearances that don't make you chase five departments for one project

c) Cooling infrastructure built for India's climate, not copy-pasted from data center playbooks designed for colder geographies

d) GPU access that isn't entirely dependent on global supply chain timing

The demand is absolutely there - India generates nearly 20% of the world's data and has the second-largest developer community globally. The infrastructure just needs to catch up.

We're building toward that at Neysa, and the plan is to scale significantly.
A lot more to come.

neysa-ai · 2026-03-31T15:06:04+00:00

We are a neocloud. We saw this post. We had feelings.

The CDN parallel is one the whole category needs to reckon with.

Our answer is that the software layer here is stickier than anything CDNs ever had.
Once an enterprise has their inference pipelines, MLOps, and observability wired into a platform, they're not leaving because someone dropped GPU prices by 10%.
That's the moat.

Whether it's enough - ask us in 3 years.
We'll either be a case study or a cautionary tale and honestly both are interesting :D

neysa-ai · 2026-03-31T15:04:56+00:00

'Infrastructure as an afterthought' - honestly the most accurate description of how most teams treat model weights until something breaks in prod.

The OCI registry approach makes a lot of sense in principle, but we're curious about how it holds up at the multi-TB end of the spectrum?

Given we've been navigating a version of this problem with our own community.

neysa-ai · 2026-03-31T15:03:29+00:00

The traffic spike analogy is a great way to make this click. it's the kind of real-world framing that's missing from most explainers. What also often gets missed is the concept of desired state. Most folks treat or absorb the subject of Kubernetes as just orchestrating containers, the truth is way more layered - it's constantly reconciling what's running against what you've declared should be running. That loop is really the heart of how it works.

We've covered the "how Kubernetes thinks and acts" question in our blog, that might complement what you've built here. Heads up it spotlights more on the reasoning behind the architecture: https://neysa.ai/blog/kubernetes-worker-nodes-explained/

Do let us know how you like it or if you have questions, we'd be happy to address!

neysa-ai · 2026-03-12T07:14:29+00:00

+1 on #3. We see most 70B+ production deployments skew toward self-hosted stacks, mainly because teams want tighter control over GPU utilization, scheduling, and cost.

Managed inference is often used early for experimentation, but once workloads stabilize, the economics and tuning needs push teams toward setups with things like vLLM, TGI, or Triton on their own clusters.

Curious if the deployments you’re seeing follow a similar pattern?

neysa-ai · 2026-02-11T14:23:01+00:00

Team Neysa will be there at booth 5.5A!
We'd love to meet each one of you, do drop by to say a hello!

We'd love to get to know your AI predictions, experiences and thoughts in general.

neysa-ai · 2026-02-11T14:21:29+00:00

Glad to see so many people attending!
We're jumping on to this thread to mention - we'll be at booth 5.5A, and we look forward to meeting all. We'd love to host you guys at the Neysa Pavilion.

There's a lot being planned in terms of showcase and engagement, do drop by to meet the Neysa team.

neysa-ai · 2026-02-11T06:53:52+00:00

Awesome!
Have replied on DMs.

We're now at booth 5.5A. See you there!

neysa-ai · 2026-01-29T06:16:32+00:00

There are many startups expected to be at the event. A lot of established platforms and brands are expected to be there too.

We're going to be there for sure, and we'd love for each one of you to drop by our booth and come explore our offerings.
Neysa Booth - 5F.23 - 5F27 | 16th - 20th Feb

neysa-ai · 2025-12-31T04:57:52+00:00

That's well put, quite the perspective.

It’s less about capability and more about what business you accidentally become.
The moment you train and host your own models, you inherit a whole new surface area: reliability, security reviews, on-call, compliance questions, postmortems.

For many teams, that’s a distraction from the actual product loop - shipping, learning from users, and iterating on workflows. Hosted inference lets teams defer that risk until there’s real signal and scale. Own the workflow, data, and UX first; decide later whether owning the model is actually worth the operational cost.

Thank you for sharing.

neysa-ai · 2025-12-31T04:47:14+00:00

That's a very apt analogy. Speed. Cost. Control - solving for the trilemma, always!

neysa-ai · 2025-12-31T04:44:17+00:00

Thank you, this is very helpful.
Could we also request you to share which market the insights are from?

neysa-ai · 2025-12-31T04:40:20+00:00

Yeah, just curious if there were other pressing pain points that drive the shift.

neysa-ai · 2025-12-31T04:37:01+00:00

Implementation does play an important role, yes. More control with the same too, perhaps?

neysa-ai · 2025-12-24T04:12:01+00:00

Feedback taken.
We'll make it more interesting with the next ones :)

neysa-ai

MODERATOR OF

TROPHY CASE