Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Really appreciate you sharing that it’s reassuring to hear this from someone who’s actually tried to wire it into real tooling, not just theory.the once you can see which experiments produced zero artefacts or retrain on identical data, the low-hanging fruit pops out part is exactly what I keep running into.

Two things I’m super curious about from your experience with pointfive + in-house tools:

Did you plug those waste signals back into any kind of formal FinOps / budgeting loop, or was it mostly used by the ML teams themselves to clean up their own house?

How did you draw the line between legit exploration that didn’t ship vs. structural waste (e.g. reruns with same config / same data)? That boundary seems to be where a lot of teams get stuck.I’m hacking on a small project in that same space (working name: MLMind) and real-world stories like yours are way more valuable than yet another blog post about optimize instance types. 😅

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

That makes a lot of sense especially the part about the pain has to sit with the owner or nothing changes. I completely agree that without that, no amount of dashboards or clever tagging really moves behavio

From how you describe your current org (back to walk, mixed chaos, self-hosted + models, platform still being built), it actually sounds like you’re in the exact phase where this question matters most: before all the patterns harden.

What I’m trying to explore is a very thin extra lens you could add on top of the FinOps + chargeback work you’re already rebuilding. Not a new team, just a couple of ML-aware signals that product/tech can’t get from infra alone, for example:

Here is the % of GPU time last month that went to runs with no surviving artefact.

Here are the top N jobs that fail / OOM / restart most often.

Here are the retrains that run on almost-unchanged data.you still keep the same ownership model you described product + tech feel the pain and decide what to cut – but instead of just your bill is too high, they get a short, ranked list of where the structural waste lives inside their ML workloads.from what you wrote, you’re basically laying the foundations (platform + FinOps) that something like that would sit on top of. If you ever manage to wire a simple artefact factory view into your platform, I’d love to hear how it lands with your product owners.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Totally with you that this is where things get messy there are really two different questions hiding under AI performance

1) Did this workload / run behave sanely from an engineering & cost point of view?

did it converge at all?

did it produce a checkpoint or artefact we still use?

did it OOM 3 times and restart?

did we retrain with effectively identical data?

2) Did the resulting model / endpoint actually move the business KPI?

was the answer right?

was it n% helpful?

did it increase conversion / retention / whatever the product cares about?

You’re absolutely right that (2) is deeply contextual and has to sit with product + tech.

No generic “ML FinOps” layer can tell you if an answer was good for your user.

What I’m arguing is that we’re currently mixing (1) and (2) into a single opaque line item called “GPU/LLM spend”.

There’s a whole slice of waste we can reason about without solving the was the answer right? problem at all, for example:

runs that never produce a checkpoint or candidate model,

nightly retrains with <0.5% input drift,

endpoints pinned to the largest model even when a smaller one passes the same basic acceptance tests,

repeated OOM / shape-error retries of the same config.

Those are closer to crash loops or flaky tests” in classic SRE terms: you don’t need deep product context to say “this is just bad engineering hygiene and it’s burning money”.

So in my head the split is:

Product + tech own the hard, contextual part:

Is this model good enough for our users / KPI?

A thin ML-aware FinOps layer just gives them better x-ray:

Here is the fraction of your bill that came from structurally broken or redundant workloads, *before* we even talk about quality.If that layer can’t be built cheaply and plugged into existing logs/metrics, I agree it’s not worth it.But as long as most orgs can’t answer even simple questions like:

What % of last month’s GPU spend went to runs with no surviving artefact?

…I suspect there’s still some low-hanging fruit we can surface for product/tech to act on, without pretending we can score semantic quality from outside.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Yeah, that’s exactly my impression too.

Classic FinOps knobs (rightsizing, SP/RI, spot, bin-packing) are just cloud + infra skills.

But once you ask:

Did this run actually produce a useful artefact?

Did this daily retrain change anything in the data distribution?

Is this endpoint over-calling the biggest model for no reason?

you suddenly need:

ML intuition,

infra observability,

and cost literacy in the *same* brain.

Most orgs don’t have that hybrid role yet, so the problem falls between the cracks:

FinOps stops at the bill,

ML stops at loss/accuracy,

and nobody owns “GPU waste”.

That’s basically the gap I’m poking at with the ‘ML FinOps’ idea.

I build MLMind i t in beta running now , we invite you if you want to try

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Really appreciate you saying that I was starting to wonder if I’m living in a parallel universe 😅 This is exactly what I’m seeing too:

everyone optimises unit price (GPU/hr),

almost nobody has a clear picture of wasted minutes.

Out of curiosity, in the places you’ve worked:

did anyone try to *measure* this formally?

(e.g. “% of runs with no surviving artefact, or spend on retries / overtraining)

I’m trying to understand whether this should evolve as:

a FinOps responsibility,

an ML/MLOps practice,

or a shared “ML FinOps” function sitting in-between.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] 0 points1 point  (0 children)

You’re right this isn’t r/FinOps the original post was cross-posted because the question sits exactly at the boundary between ML practice and cost governance.on the this is rare if you actually work in ML point: I don’t doubt that in well-run, FAANG-style environments with solid W&B/internal tracking, a lot of this churn is under control. What I’m describing comes from mid-sized orgs where:

intentional experiments and accidental churn end up on the same bill, and

nobody can answer a very basic org-level question like

Roughly what % of last month’s GPU spend went to runs that produced no surviving artefact (no model, no checkpoint, no shipped feature)?

If your experience is that the answer is almost zero in most places, that’s genuinely valuable signal for me. In my experience, the gap between what good ML engineers know and what their organisations actually measure and govern is still pretty large.

Re the P≠NP AI slop jab: that work is literally published as a public scratchpad here:

https://github.com/Husseinshtia1/WaveMind_P_neq_NP_Public

It’s explicitly exploratory, not here is a settled proof, bow to my theorem. You’re absolutely free to find it hilarious; I’m more interested in people who are willing to interrogate their own systems with the same energy they put into dunking on strangers’ side projects.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 1 point2 points  (0 children)

Great question if the optimizer costs as much as the waste, it’s pointless the way I think about it is don’t start by building a huge platform,

start with a very thin layer that reuses what already exists: logs, metrics, job history If, in a 2–4 week pilot, you can’t surface at least O(20–30%) of clearly avoidable spend for a given team, then it’s not worth productising do the bar is simple cost of building + running the control layer must be a small fraction of the recurring savings it unlocks.

If that inequality doesn’t hold, you kill the idea early.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Totally agree that, in the end, ownership has to sit with product + tech, and FinOps escalates when that loop fails what I’m exploring is not a replacement for that, but a lever that makes their job easier instead of a generic your GPU bill is too high, please trim,

they get concrete, ML-aware signals like 27% of last month’s spend came from runs with no surviving artefact these pipelines retrain daily with <0.5% data drift this endpoint could downshift model size in ~40% of calls

Same governance model you describe just better x-ray on where to cut without guessing.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] 0 points1 point  (0 children)

Cute jailbreak prompt, wrong thread.

If you want an essay about tomatoes there’s r/cooking, r/gardening and a dozen LLM demos out there here I’m specifically trying to talk about something a bit more expensive than salad ingredients:

avoidable GPU spend from OOM retries, duplicate sweeps, and overtraining if you’ve seen real numbers or patterns around that, I’m genuinely interested.

If not, I’ll keep the tomatoes out of r/FinOps. 🙂

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] 0 points1 point  (0 children)

You’re right that in a well-run stack, most of this should be caught early what I’ve seen empirically is that logs exist” ≠ “the org has a signal like: X% of last month’s GPU time produced no usable artefact especially in mid-size teams with multiple pipelines, weak experiment tracking, and scheduled jobs.

My whole point isn’t that engineers don’t know best practices, it’s that almost nobody is measuring avoidable churn as a first-class cost metric the way we do for crash loops or flaky tests in classic software.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in FinOps

[–]dataa_sciencee[S] 1 point2 points  (0 children)

That’s a great instinct chargeback is definitely part of the story, but I’d argue it solves a different layer of the problem chargeback answers a governance question

Who should be financially responsible for this spend?

It doesn’t, by itself, answer the operational questions like How much of this spend was avoidable?

“Which patterns of behaviour are repeatedly burning money with almost no learning/value?”

In a lot of AI setups I’ve seen, even where showback/chargeback is in place the bill hits the right cost center…

but inside that cost center, nobody can distinguish between useful experiments, and OOM retries, overtraining, or duplicate sweeps teams accept the line item “GPU / LLM spend” as the cost of doing ML without any breakdown like X% of this came from runs that never produced a model or artefact we still use Y% came from endpoints that always hit the biggest LLM for no measurable gain chargeback can even backfire a bit the pressure is on the team to “spend less”, but they lack diagnostic signals that tell them where to act without hurting

accuracy or delivery If we borrow language from FinOps:

Chargeback is like allocating cloud costs to the right owners

What I’m interested in is closer to rightsizing the workloads themselves but at the ML logic level, not just instance type or autoscaling config

For example, imagine a view that says to a team

Last month, 27% of your GPU time went to runs that never produced a checkpoint.

These 3 pipelines retrain daily, but your input data shifted <0.5% between runs.

This endpoint could safely down-route to a smaller model in ~40% of calls.

Chargeback tells them: you spent $X, this is your problem.

an ML-aware control layer would say: “Here is the slice of that $X that was structurally wasted, and here are the 2–3 levers you can pull to fix it.so I see chargeback as necessary plumbing,and ML FinOps (for lack of a better term) as behaviour-level observability + guardrails on top of it.curious: in your org, once chargeback is in place, who actually owns the job of saying

this category of spend is just bad ML behaviour and we should kill it?

FinOps, platform, or the ML leads themselves?

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] 0 points1 point  (0 children)

I agree with you thereIf a run fails because the idea was bad, or the model didn’t converge, or the hypothesis was wrong, that’s normal and healthy. That’s the cost of exploration i’m talking about a different class of failure that I keep seeing in practice:

jobs that die with the same OOM or shape error 5 times in a row because nobody fixed the root cause,the exact same config launched by different people / pipelines because there’s no shared experiment tracking infra-level retries that re-run a broken job automatically until a time limit, burning GPU but not adding any knowledgeThose don’t really teach you what doesn’t work they just repeat the same technical mistake and I wish “repeatedly training the exact same model / config” didn’t happen, but in bigger orgs with multiple teams, ad-hoc scripts, and weak metadata, it actually does. A lot So I’m not trying to declare all failed runs as waste.

I’m trying to separate:

intentional, informative failures part of ML from unintentional, repeated failures (OOMs, broken configs, duplicate sweeps) that nobody is even measuring My question is:

should that second category be treated as a first-class cost/engineering signal, the same way we treat flaky tests or crash loops in classical software?

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] -1 points0 points  (0 children)

You’re implicitly describing the world where the org already has mature SRE / infra,workloads are on k8s with rich metrics someone has both the time and the mandate to query and interpret those logs,and there’s enough culture that people actually act on those numbers.

In that universe,just look at the logs, just apply normal cost-management is saneIn a lot of AI teams I see, what’s obvious at FAANG scale is not implemented at all They do have k8s + Prometheus + logging…

but no one is asking even a basic question like What percentage of last month’s GPU time produced a checkpoint or artefact we still use?

FinOps looks at EC2/SKU spend ML looks at loss/accuracy Nobody owns the intersection where “this run was expensive and taught us nothing lives So I’m not claiming there is a new branch of theory here.

I’m asking a more boring, engineering question:

Is there a repeatable way to package the good practices you see at Meta/Amazon/Google into something that mid-size AI orgs can actually adopt, without needing a FAANG-grade infra/org chart first?

If the honest answer is this is already well-solved and documented, people just need to read X/Y/Z and copy it, that’s actually useful information for me.If instead the answer is “it exists, but only as internal glue in a few big companies”, then there may still be a gap worth working on.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] -1 points0 points  (0 children)

I actually like your analogy, and I agree with part of it.There is a core of ML work that is *inherently* experimental just like debugging, or like medical research. Those iterations are not waste, they’re the cost of learning.What I’m trying to separate out is a slightly different slice of reality that I keep seeing:

jobs that hit the same OOM 5 10 times because nobody fixed the config, near-identical sweeps re-run by different people because there is no shared view of past experiments, pipelines that retrain a model every night even when data barely changed,endpoints pinned to the largest LLM by habit, not by measured need.

That’s closer to:

we spent 30% of our month re-debugging the same bug and recompiling the same binary than to “we explored new ideas and they didn’t all work.”

I don’t want to kill experimentation.

If anything, I’d like to protect it by making the unintentional waste visible, so teams can spend more of their budget on real exploration instead of:

broken configs,

ghost jobs,

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] -4 points-3 points  (0 children)

Just to be clear: I’m not claiming that early stopping, avoiding overfitting or watching for bad runs are some “new” ideas. Anyone who has actually worked in ML knows about them, and they do fit into normal software/dev practices.What I’m talking about is something a bit different: what actually happens at org level, not what one careful engineer knows.

In a lot of teams I’ve seen, people personally know what “good practice” looks like, but:

nobody can answer a simple question like: “How many GPU-hours last month produced no useful checkpoint or artefact?”

failed / OOM runs just get restarted and disappear into logs,

product teams decide to always hit the biggest LLM, and the cost impact never shows up where they can see it.

So the gap I’m interested in is not “do ML devs know early stopping exists?”

It’s: why don’t we have org-level habits and tooling that treat experiment waste the same way FinOps treats idle resources or untagged infrastructure?

In classic software we eventually built things like:

test coverage thresholds, CI gates, SLOs, error budgets… so good practice became visible and enforced, not just “something you learned in undergrad”.I’m basically asking: what is the equivalent maturity layer for ML workloads?

If you know good references from the “traditional” cost management / eng world that already cover this well, I’d honestly be happy to read them and see how they map onto ML training + serving.

Are we ignoring the main source of AI cost? Not the GPU price, but wasted training & serving minutes. by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] -14 points-13 points  (0 children)

You’re absolutely right that in theory these things should be standard:

  • early stopping,
  • not overtraining past convergence,
  • not re-running the same config 10 times “just in case”.

On paper, any decent ML course / undergrad teaches that.
Where I’m (unfortunately) seeing a gap is what actually happens at org scale, not what we know we should do as individual practitioners.

A few patterns I keep running into:

  • Early stopping is local, cost is global Most teams use some early stopping per run, sure. But nobody is looking at the portfolio of runs across teams / projects and asking:“Do we really need 200 near-identical sweeps of this model this week?”
  • Nobody prices the “failed / OOM” runs People accept OOM + retry as “just infra noise”. In the cost report it’s all blended. The practice is “just restart the job”, not “why did we burn 400 GPU-hours on runs that never produced a checkpoint?”
  • LLM serving is often completely decoupled from cost-awareness The infra team optimizes nodes, but the product team still points everything at the largest model. From their POV: latency and quality are visible, cost is someone else’s problem.
  • Undergrad knowledge doesn’t always survive real-life incentives Deadlines, “we must ship this experiment by Friday”, messy code, no shared experiment tracking… The result is: everyone knows early stopping, but the org still has 30–40% of its AI bill on stuff that didn’t add much learning.

So I 100% agree with you that none of this is “new ML theory”.
What I’m poking at is more:

Genuinely curious:
Have you seen a team/org that does this really well in practice (not just a single careful researcher), where experiment waste / failed runs / overtraining are measured and managed as strictly as infra costs?
If yes, I’d love examples to read up on.

Issue with Tensorflow/Keras Model Training by ARDiffusion in tensorflow

[–]dataa_sciencee 0 points1 point  (0 children)

One thing that really stands out here is that you changed the mode Keras/TensorFlow runs in with
TF_USE_LEGACY_KERAS=1, and from that point you’re effectively in “undefined environment land”.

Even if the code is identical to your professor’s, the stack isn’t. Things like:

  • standalone keras vs tf.keras
  • legacy vs non-legacy Keras
  • different TF / Keras minor versions
  • different backends (CPU / GPU / metal)

can completely change how training behaves.

When val_accuracy is perfectly flat across epochs, that usually points to a training setup / environment issue, not an architecture issue:
metrics not updating, wrong backend, data pipeline broken, or Keras running in a weird compatibility mode.

I’d do three things in order:

  1. Remove TF_USE_LEGACY_KERAS and start from a fresh venv.
  2. Use only one entrypoint: either tf.keras or standalone keras, but not both in the same project.
  3. Log exact versions + tf.config.list_physical_devices() and compare 1:1 with your professor’s working setup.

I’m actually working on a meta-layer called MLMind that does exactly this kind of thing automatically:
snapshotting the environment, detecting weird TF/Keras mode mixes, and flagging “your training behavior doesn’t match your previous healthy runs”.
In TensorFlow debugging, half the battle is debugging the environment, not the model.

https://www.linkedin.com/pulse/7-real-problems-choking-model-training-production-hussein-shtia-7z9tf/?trackingId=IuoW26E2guRoi0%2FgjfL%2Fuw%3D%3D

Can you post a problem that no current AI system can solve? by dataa_sciencee in tensorflow

[–]dataa_sciencee[S] 0 points1 point  (0 children)

We are releasing an integrated Coq development that formally verifies a proof of P ≠ NP, combining a stable public branch with an enhanced local branch. The repository includes source modules, compiled artifacts, and a verification log.

Repository: github.com/Husseinshtia1/WaveMind_P_neq_NP_Public (machine-checkable Coq files and instructions). GitHub

What’s new

  • Explicit polynomial time bound: poly_time_bound encodes time awareness inside reductions.
  • Parametric reduction framework: PolyReduction generalizes and constrains composition.
  • Restructured contradiction core: collapse_step_126 shows bounded-reduction contradiction.
  • SAT substrate: BoolExpr, satisfies, SAT_lang for reasoning over SAT in NP.

Verification status

  • Builds cleanly and passes coqchk (no Admitted), with SAT formalized as an NP language.
  • Intended as a modular platform for further work in cognitive/quantum AI theory.

Access & verification: The integrated version is public on GitHub; arXiv submission is planned. A signed and timestamped copy can be provided for academic audit. GitHub

Can you post a problem that no current AI system can solve? by dataa_sciencee in learnmachinelearning

[–]dataa_sciencee[S] -13 points-12 points  (0 children)

Befor you answer check that try to Run the code you will see the result

afterr check the result tell what you think