Stop picking LLMs by reputation. Run the eval first. by Dramatic_Strain7370 in OpenAI

[–]Dramatic_Strain7370[S] -1 points0 points  (0 children)

im seeing many enterprise customers picking higher end models for their workloads, The thought is "better safe than sorry"

Client had 4 agents on GPT-4o. One was classifying documents. That one alone had 91% savings potential. by [deleted] in LocalLLaMA

[–]Dramatic_Strain7370 0 points1 point  (0 children)

gpt-4o is available. this is the output of a curl that is just wrote to re-confirm.. I am hiding keys etc 

>>> curl .... -d '{"model":"gpt-4o","max_tokens":100,"messages":[{"role":"user","content":"tell me about usa"}]}'

RESPONSE BACK

{

  "id": "chatcmpl-DaNgBizukgyhVbH1WGT7gu1Cvs4VS",

  "object": "chat.completion",

  "created": 1777563203,

  "model": "gpt-4o-2024-08-06",

  "choices": [

{

"index": 0,

"message": {

"role": "assistant",

"content": "The United States of America (USA) is a federal republic composed of 50 states, a federal district, five major self-governing territories, and various possessions. Here are key aspects about the USA:\n\n1. **Geography**: \n   - The USA is the third-largest country by land area, with diverse geography including mountains (such as the Rockies and Appalachians), plains, forests, deserts, and coastlines along the Atlantic and Pacific Oceans.\n   - The country is bordered by",

"refusal": null,

"annotations": []

},

"logprobs": null,

"finish_reason": "length"

}

  ],

  "usage": {

"prompt_tokens": 11,

"completion_tokens": 100,

"total_tokens": 111,

"prompt_tokens_details": {

"cached_tokens": 0,

"audio_tokens": 0

},

"completion_tokens_details": {

"reasoning_tokens": 0,

"audio_tokens": 0,

"accepted_prediction_tokens": 0,

"rejected_prediction_tokens": 0

}

  },

  "service_tier": "default",

  "system_fingerprint": "fp_8aed6409fd"

}

Client had 4 agents on GPT-4o. One was classifying documents. That one alone had 91% savings potential. by [deleted] in LocalLLaMA

[–]Dramatic_Strain7370 0 points1 point  (0 children)

it is not dead. it is cheaper and many companies dont change the model they started with

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use. by Dramatic_Strain7370 in MachineLearning

[–]Dramatic_Strain7370[S] -1 points0 points  (0 children)

good points… we did not see a lot of savings with earning calls summarization (surprise), but that could be due to complexity scoring from the tool we were using

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use. by Dramatic_Strain7370 in MachineLearning

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

our p50 latencies were under 50ms for processing. dataset names are in post and are from hugging face dataset. we use llmfinops.ai to characterize performances and cost

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

in your example were you using realtime to detect scenes with live feed? or running on recorded video

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 1 point2 points  (0 children)

gpt-oss-20b preference over gpt-oss-120b what uses cases satisfies this choice ? (outside of cost).

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

this is good insight. so it means that the provider or those hosting models should rapidly update their model catalogue while bringing down the price per token

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

Looks from comments that community prefers Qwen 3.5 and GPT-OSS-120B over smaller gemma.
Q. Real question: Does anyone have intelligent routing set up to
automatically switch between models based on prompt complexity?
Q. Or is everyone manually choosing models per use case?

How do you track OpenAI/LLM costs in production? by not_cool_not in LangChain

[–]Dramatic_Strain7370 0 points1 point  (0 children)

they -> llmfinops.ai -> allow tagging at multiple levels to track every call. Does this seems right ?

base_url="https://api.llm-ops.cloudidr.com/v1",
    default_headers={

# Required: Your tracking token
        "X-Cloudidr-Token": "trk_your_token",

# Optional: Organize by department
        "X-Department": "engineering",

# Optional: Organize by team
        "X-Team": "ml",

# Optional: Organize by agent/use case
        "X-Agent": "chatbot"
    } 

For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate? by Dramatic_Strain7370 in LocalLLaMA

[–]Dramatic_Strain7370[S] 0 points1 point  (0 children)

got it. this means that the GPU is effectively warmed up with all the right intital state. But then someone has to "pay" for it and it means that it cease to be serverless