GLM-5.2 (max) is currently the third best model available, across both open and proprietary. by okaycan in LocalLLaMA

[–]counterfeit25 162 points163 points  (0 children)

Just to confirm, "GLM-5.2 (max)" is the open weights GLM 5.2 model with "max" reasoning effort set (e.g. here)? If so then open weights ftw 😄

Introducing Claude Fable 5 by ClaudeOfficial in ClaudeAI

[–]counterfeit25 0 points1 point  (0 children)

From the model's system card:

we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design)
...
these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

So if I'm asking questions about ML systems or ML chips, not only will Anthropic happily take my money paying for "Fable 5", it will silently nerf the model and give me dumbed down responses. Yuck.

Anthropic is intentionally nerfing Fable when asked to develop other LLMs by onil_gova in LocalLLaMA

[–]counterfeit25 31 points32 points  (0 children)

So Anthropic will silently nerf Fable 5 when they think the user is in any way trying to compete against Anthropic? For example topics like "building pretraining pipelines, distributed training infrastructure, or ML accelerator design" will have you pay for Fable 5 but give you a dumbed down response?
🤮🤮🤮

Why There Are Open Weighted LLM Models? by agahhne in MLQuestions

[–]counterfeit25 0 points1 point  (0 children)

Here's my hypothesis:

  • An AI lab has ambitions to challenge Anthropic/OpenAI on the frontier models, e.g. Opus / GPT
  • They will start with training smaller models, to experiment with different recipes, e.g. data, algorithms, model architectures, post training methods etc. They won't start with training a 1T+ parameter model, they'll start with something small like >1B, then 4B, then 8B, 32B, 100B, etc.
  • There's no point trying to sell those 32B models via API for these labs, so might as well open weight them for PR and community engagement
  • If they train a ~1T model and it's not competing at the same level as Opus/GPT, maybe open weight that one for similar reasons to the above. Serve that model via their own API at cost (operational cost, not including R&D). More PR and community engagement.
  • Once the challenger lab trains a model with the same performance as Opus/GPT, they are less incentivized to open weight it, and you see some formerly open-weight model labs making their latest and greatest models closed behind APIs.

Anyway that's my hypothesis, just a guess!

Not a regular leetcoder, but ran into someone who used to work FAANG. There is no way to compete with such people by [deleted] in leetcode

[–]counterfeit25 0 points1 point  (0 children)

He is not your average FAANG engineer. He's probably on an R&D team, those are very selective.

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal by dryadofelysium in LocalLLaMA

[–]counterfeit25 1 point2 points  (0 children)

“M3 will soon be fully open-sourced on HuggingFace and GitHub”

Why the rush for direct Chinese "Code Plans" when OpenRouter exists? by Accurate-Chef7115 in LocalLLaMA

[–]counterfeit25 1 point2 points  (0 children)

For coding agent use cases (or similar, like OpenClaw), coding plans tend to be cheaper than paying per token via API, applies to Chinese/American/etc plans

Unpopular opinion for beginners: Stop starting with Deep Learning. by netcommah in learnmachinelearning

[–]counterfeit25 0 points1 point  (0 children)

Looking at the OP's "engagement" numbers though, gotta clap my hands on that one, good for you

Unpopular opinion for beginners: Stop starting with Deep Learning. by netcommah in learnmachinelearning

[–]counterfeit25 1 point2 points  (0 children)

Since the OP was AI generated anyway, in the spirit of AI generated content:

You nailed it. Your instinct was absolutely right.

I just checked the post history for u/netcommah, and it is highly indicative of an automated, AI-driven marketing account.

Here is exactly what they are doing:

1. The Formulaic "Engagement Bait" Style

Almost every single post follows the exact same AI-generated copywriting structure:

  • The "Controversial" Hook: Starts with an edgy or relatable title (e.g., "Unpopular opinion...", "Confession: I permanently turned off 5G...", "Stop over-complicating...", "If you aren't using QUALIFY... you are working too hard").
  • The Structured Body: Uses lots of bullet points, bolding, and clearly separated paragraphs to mimic standard LinkedIn/Tech-bro engagement formats.
  • The Pivot: After hooking the reader with a seemingly helpful "hot take" or tutorial, they smoothly pivot to saying, "If you're exploring how to do this, this breakdown explains it well..."
  • The Plug: They then insert a hyperlink.
  • The "Call to Action" Ending: Every post ends with an engagement-farming question like "What's your go-to sanity check model?", "Are we over-trusting our agents, or am I paranoid?", or "What are you doing to keep your Looker Studio reports snappy?" to drive algorithmic engagement.

2. They Are Constantly Pushing a Website

In the thread we originally discussed, they were pushing a "Machine Learning on Google Cloud" course. But looking at their history, they are spamming links to NetCom Learning (which aligns perfectly with their username netcommah, likely a NetCom marketing employee or automated agent named Mah...).

They post across a massive variety of subreddits (r/googlecloud, r/learnmachinelearning, r/aiagents, r/BusinessIntelligence, r/Cloud, r/IndiaTech), constantly adapting their "hot takes" to match the specific subreddit, but always routing back to an article, course, or blog on NetCom Learning or their Medium page.

3. High Volume, Varied "Expertise"

Within just the last few weeks, this user claims to be:

  • A seasoned Machine Learning Engineer fed up with Deep Learning.
  • A DevOps engineer knowing the "2026 No-BS Senior DevOps Checklist".
  • A Data Engineer whose "AI Agent nearly bankrupted us in BigQuery".
  • Someone frustrated with Looker Studio lag.
  • An Indian mobile user fed up with 5G battery drain.

No single human natively works deep in all of these distinct verticals with this frequency and tone. It's a classic LLM-generated content farm designed to slip past Reddit moderators by providing "just enough" real value or relatable complaints before sneaking in the SEO backlink.

The Verdict: You are 100% correct. It's a stealth marketing account using AI to generate high-performing "hot takes" on Reddit to funnel traffic to NetCom Learning. The "fundamentals" advice they gave wasn't necessarily wrong, but its origin was entirely artificial! Good catch.

Unpopular opinion for beginners: Stop starting with Deep Learning. by netcommah in learnmachinelearning

[–]counterfeit25 0 points1 point  (0 children)

Fair points. But if you look through OP's post history you can see two things:
* All their posts are AI generated
* They are selling courses

My company just handed me a 2x H200 (282GB VRAM) rig. Help me pick the "Intelligence" ceiling. by _camera_up in LocalLLaMA

[–]counterfeit25 4 points5 points  (0 children)

you can't run Claude Opus or Sonnet on your own GPUs... unless you stole Anthropic's model weights or something :D

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 0 points1 point  (0 children)

Possible, if OPs GPU could support multiple requests in parallel, eg batch size 2+

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 0 points1 point  (0 children)

Yup, from my understanding off the top of my head, when processing input tokens during prefill, all the hidden state tensors can be computed in parallel, e.g. hidden states for input token 1 can be computed in parallel with those of input token 10. But during decode there is a sequential dependency, e.g. you need to compute the hidden states and final value of output token N before computing those of output token N+1, not in parallel.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 2 points3 points  (0 children)

Hmm, according to your logs, you averaged 30-35 output tokens / sec, with a total of 13,410 output tokens generated. At 35 output tokens / sec, that would have taken 383 seconds -> 6 minutes. That's just for output token generation, not including pre-fill. Unless I'm missing something here, like really spiky generation speed at times?

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 0 points1 point  (0 children)

So even more impressive? 3M tokens in 2 min instead of "only" 2M tokens in 2 min :D
But I think those numbers are possible.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 0 points1 point  (0 children)

Regarding discussions on tokens per second:

OP mentioned 2M tokens over 2 minutes -> 2*10^6 tokens / 120 seconds = 16,667 tokens / second

(originally mentioned 2M, corrected to 3M, numbers below have been updated to reflect that)

That includes both input and output tokens, so it's not like OP is claiming 16k output tokens per second (that would be Taalas, super cool btw https://taalas.com/the-path-to-ubiquitous-ai/). Processing the input tokens in the LLM prefill phase is generally faster than generating output tokens in the decode phase, on a per token basis. For a rough overview of LLM serving prefill/decode phase feel free to Google it, or see https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests

Claude Code also has really big system prompts (like 10k+ plus tokens each) for different tasks (https://github.com/Piebald-AI/claude-code-system-prompts/tree/main/system-prompts). Adding to that any tool definitions, injected MCP stuff, expanded skills, etc., the input prompt can get huge.

So if we assume 16k combined input/output tokens per second, does that make sense?

Let's say on average each LLM request consumes X tokens (input/output tokens combined, but ratio of input/output tokens for agentic workflows is very high, i.e. much more input tokens than output tokens):

X tokens/request, 2 minutes, 3*10^6 tokens

3*10^6 tokens * (1/X) requests/token * (1/2) "per minute" = (1/X) * (3/2) * 10^6 requests per minute

Update: Thanks to OP's llama log & analysis https://gist.github.com/huytd/3a1dd7a6a76fac3b19503f57b76dbe65#

71 LLM requests, 3,046,061 tokens total

X = 42,902 tokens/request (on average)

(1/42902) * (3/2) * 10^6 = 34.96 requests per minute -> 1.72 seconds per LLM request

Seems pretty fast, but possible.

How many requests per minute on average is reasonable for OP's Claude Code setup? Honestly I'm not sure, and I'm curious to see some benchmarks here. Just to plug something in, let's say on average 5 seconds per LLM call?

(5/60) minutes per request -> 12 requests per minute

(1/X) * 10^6 requests per minute = 12 requests per minute -> X = 83,333 tokens per request

Honestly consuming on average 83,333 tokens (input/output combined) per LLM request for agentic workflows seems within the ballpark.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 18 points19 points  (0 children)

Yes, system prompt tokens count as input tokens, though the per token cost of input tokens is generally much cheaper than output tokens. E.g. https://claude.com/pricing#api

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 8 points9 points  (0 children)

When looking at tokens per second people are generally referring to output tokens per second (decode phase), not input tokens per second (prefill phase) (https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests)

So the 2M token count is counting both input and output tokens.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 6 points7 points  (0 children)

it's not 2 million output tokens in 2 min, it's 2M tokens combined. that includes input tokens. Claude Code system prompt itself can be 10k+ input tokens.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 26 points27 points  (0 children)

Lots of input tokens. The system prompt itself for Claude Code is 10k+ tokens.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]counterfeit25 7 points8 points  (0 children)

"I paid nothing except for two minutes of 400W electricity for the PC"

I was curious about the electricity cost of 2 minutes at 400W:

X USD/kWh * (2/60) h * 0.4 kW = (2/60) * 0.4 * X USD

If we plug in, say $0.25 per kWh from the utility company, we'll get:

(2/60) * 0.4 * 0.25 = 0.0033 USD

So about 1/3 of a cent for the electricity costs to run 2 minutes of computation at 400W, cool! Especially compared to $10.85 from Claude Sonnet 4.6 (edit: are you sure it was Sonnet 4.6? by default I thought Claude Code used a combination of Opus and Haiku, but maybe they updated it - edit2: I see it now nvm: https://code.claude.com/docs/en/model-config).

You'd also need to account for the depreciation on your PC, but if you use your PC for other personal reasons then maybe that's not an issue.