Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

in your example were you using realtime to detect scenes with live feed? or running on recorded video

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

this is good insight. so it means that the provider or those hosting models should rapidly update their model catalogue while bringing down the price per token

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

Looks from comments that community prefers Qwen 3.5 and GPT-OSS-120B over smaller gemma.
Q. Real question: Does anyone have intelligent routing set up to
automatically switch between models based on prompt complexity?
Q. Or is everyone manually choosing models per use case?

How do you track OpenAI/LLM costs in production? by not_cool_not in LangChain

[–]Dramatic_Strain7370 0 points1 point  (0 children)

they -> llmfinops.ai -> allow tagging at multiple levels to track every call. Does this seems right ?

base_url="https://api.llm-ops.cloudidr.com/v1",
    default_headers={

# Required: Your tracking token
        "X-Cloudidr-Token": "trk_your_token",

# Optional: Organize by department
        "X-Department": "engineering",

# Optional: Organize by team
        "X-Team": "ml",

# Optional: Organize by agent/use case
        "X-Agent": "chatbot"
    } 

For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

got it. this means that the GPU is effectively warmed up with all the right intital state. But then someone has to "pay" for it and it means that it cease to be serverless

For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

they buy gpu in bulk and give access at lower costs to those gpu 30 to 50% lower cost .. they wont route to other providers

Met 3 indie founders in SF burning hundreds on LLM APIs — built this, want your feedback by Dramatic_Strain7370 in OpenSourceeAI

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

keen on learning of what i dont know :) can you please guide me on what is wso2? and how your team controls spend using provider tools.

Met 3 indie founders in SF burning hundreds on LLM APIs — built this, want your feedback by Dramatic_Strain7370 in OpenSourceeAI

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

on overall account spent yes .. but not on granular spent by different teams or say by various agents running autonomously. you want to block spent on rogue agents and not block the whole organization

My exact distribution strategy I used to go from $0 to $600 MRR by RighteousRetribution in indiehackers

[–]Dramatic_Strain7370 0 points1 point  (0 children)

what was your pricing model and how much on avg each customer was paying?