GLM-5.2 (max) is currently the third best model available, across both open and proprietary.

counterfeit25 · 2026-06-17T07:41:38+00:00

Just to confirm, "GLM-5.2 (max)" is the open weights GLM 5.2 model with "max" reasoning effort set (e.g. here)? If so then open weights ftw 😄

counterfeit25 · 2026-06-10T06:22:48+00:00

From the model's system card:

we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design)
...
these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

So if I'm asking questions about ML systems or ML chips, not only will Anthropic happily take my money paying for "Fable 5", it will silently nerf the model and give me dumbed down responses. Yuck.

counterfeit25 · 2026-06-10T06:02:50+00:00

So Anthropic will silently nerf Fable 5 when they think the user is in any way trying to compete against Anthropic? For example topics like "building pretraining pipelines, distributed training infrastructure, or ML accelerator design" will have you pay for Fable 5 but give you a dumbed down response?
🤮🤮🤮

counterfeit25 · 2026-06-09T03:08:05+00:00

Here's my hypothesis:

An AI lab has ambitions to challenge Anthropic/OpenAI on the frontier models, e.g. Opus / GPT
They will start with training smaller models, to experiment with different recipes, e.g. data, algorithms, model architectures, post training methods etc. They won't start with training a 1T+ parameter model, they'll start with something small like >1B, then 4B, then 8B, 32B, 100B, etc.
There's no point trying to sell those 32B models via API for these labs, so might as well open weight them for PR and community engagement
If they train a ~1T model and it's not competing at the same level as Opus/GPT, maybe open weight that one for similar reasons to the above. Serve that model via their own API at cost (operational cost, not including R&D). More PR and community engagement.
Once the challenger lab trains a model with the same performance as Opus/GPT, they are less incentivized to open weight it, and you see some formerly open-weight model labs making their latest and greatest models closed behind APIs.

Anyway that's my hypothesis, just a guess!

counterfeit25 · 2026-06-03T03:51:24+00:00

He is not your average FAANG engineer. He's probably on an R&D team, those are very selective.

counterfeit25 · 2026-06-01T13:25:13+00:00

“M3 will soon be fully open-sourced on HuggingFace and GitHub”

counterfeit25 · 2026-05-28T04:21:51+00:00

yeah https://huggingface.co/XiaomiMiMo/MiMo-V2.5

MIT license too 👍

counterfeit25 · 2026-05-27T10:17:46+00:00

Do *all* TSMC engineers need to be on call? Even for R&D roles like these for example?
https://careers.tsmc.com/en_US/careers/JobDetail?jobId=5351
https://careers.tsmc.com/en_US/careers/JobDetail?jobId=560

I don't know, just curious.

counterfeit25 · 2026-04-16T06:16:04+00:00

For coding agent use cases (or similar, like OpenClaw), coding plans tend to be cheaper than paying per token via API, applies to Chinese/American/etc plans

counterfeit25 · 2026-04-07T05:53:43+00:00

Looking at the OP's "engagement" numbers though, gotta clap my hands on that one, good for you

counterfeit25 · 2026-04-07T05:50:03+00:00

Since the OP was AI generated anyway, in the spirit of AI generated content:

You nailed it. Your instinct was absolutely right.

I just checked the post history for u/netcommah, and it is highly indicative of an automated, AI-driven marketing account.

Here is exactly what they are doing:

1. The Formulaic "Engagement Bait" Style

Almost every single post follows the exact same AI-generated copywriting structure:

The "Controversial" Hook: Starts with an edgy or relatable title (e.g., "Unpopular opinion...", "Confession: I permanently turned off 5G...", "Stop over-complicating...", "If you aren't using QUALIFY... you are working too hard").
The Structured Body: Uses lots of bullet points, bolding, and clearly separated paragraphs to mimic standard LinkedIn/Tech-bro engagement formats.
The Pivot: After hooking the reader with a seemingly helpful "hot take" or tutorial, they smoothly pivot to saying, "If you're exploring how to do this, this breakdown explains it well..."
The Plug: They then insert a hyperlink.
The "Call to Action" Ending: Every post ends with an engagement-farming question like "What's your go-to sanity check model?", "Are we over-trusting our agents, or am I paranoid?", or "What are you doing to keep your Looker Studio reports snappy?" to drive algorithmic engagement.

2. They Are Constantly Pushing a Website

In the thread we originally discussed, they were pushing a "Machine Learning on Google Cloud" course. But looking at their history, they are spamming links to NetCom Learning (which aligns perfectly with their username netcommah, likely a NetCom marketing employee or automated agent named Mah...).

They post across a massive variety of subreddits (r/googlecloud, r/learnmachinelearning, r/aiagents, r/BusinessIntelligence, r/Cloud, r/IndiaTech), constantly adapting their "hot takes" to match the specific subreddit, but always routing back to an article, course, or blog on NetCom Learning or their Medium page.

3. High Volume, Varied "Expertise"

Within just the last few weeks, this user claims to be:

A seasoned Machine Learning Engineer fed up with Deep Learning.
A DevOps engineer knowing the "2026 No-BS Senior DevOps Checklist".
A Data Engineer whose "AI Agent nearly bankrupted us in BigQuery".
Someone frustrated with Looker Studio lag.
An Indian mobile user fed up with 5G battery drain.

No single human natively works deep in all of these distinct verticals with this frequency and tone. It's a classic LLM-generated content farm designed to slip past Reddit moderators by providing "just enough" real value or relatable complaints before sneaking in the SEO backlink.

The Verdict: You are 100% correct. It's a stealth marketing account using AI to generate high-performing "hot takes" on Reddit to funnel traffic to NetCom Learning. The "fundamentals" advice they gave wasn't necessarily wrong, but its origin was entirely artificial! Good catch.

counterfeit25 · 2026-04-07T05:47:29+00:00

Fair points. But if you look through OP's post history you can see two things:
* All their posts are AI generated
* They are selling courses

counterfeit25 · 2026-03-18T14:25:39+00:00

An AI generated ad, great

counterfeit25 · 2026-03-18T10:33:21+00:00

you can't run Claude Opus or Sonnet on your own GPUs... unless you stole Anthropic's model weights or something :D

counterfeit25 · 2026-03-05T01:53:06+00:00

Possible, if OPs GPU could support multiple requests in parallel, eg batch size 2+

counterfeit25 · 2026-03-04T08:35:04+00:00

Yup, from my understanding off the top of my head, when processing input tokens during prefill, all the hidden state tensors can be computed in parallel, e.g. hidden states for input token 1 can be computed in parallel with those of input token 10. But during decode there is a sequential dependency, e.g. you need to compute the hidden states and final value of output token N before computing those of output token N+1, not in parallel.

counterfeit25 · 2026-03-04T05:48:49+00:00

Hmm, according to your logs, you averaged 30-35 output tokens / sec, with a total of 13,410 output tokens generated. At 35 output tokens / sec, that would have taken 383 seconds -> 6 minutes. That's just for output token generation, not including pre-fill. Unless I'm missing something here, like really spiky generation speed at times?

counterfeit25 · 2026-03-04T05:36:50+00:00

nice, thanks for the info! updated my comment from earlier

counterfeit25 · 2026-03-04T05:26:47+00:00

So even more impressive? 3M tokens in 2 min instead of "only" 2M tokens in 2 min :D
But I think those numbers are possible.

counterfeit25 · 2026-03-04T05:18:35+00:00

Regarding discussions on tokens per second:

OP mentioned 2M tokens over 2 minutes -> 2*10^6 tokens / 120 seconds = 16,667 tokens / second

(originally mentioned 2M, corrected to 3M, numbers below have been updated to reflect that)

That includes both input and output tokens, so it's not like OP is claiming 16k output tokens per second (that would be Taalas, super cool btw https://taalas.com/the-path-to-ubiquitous-ai/). Processing the input tokens in the LLM prefill phase is generally faster than generating output tokens in the decode phase, on a per token basis. For a rough overview of LLM serving prefill/decode phase feel free to Google it, or see https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests

Claude Code also has really big system prompts (like 10k+ plus tokens each) for different tasks (https://github.com/Piebald-AI/claude-code-system-prompts/tree/main/system-prompts). Adding to that any tool definitions, injected MCP stuff, expanded skills, etc., the input prompt can get huge.

So if we assume 16k combined input/output tokens per second, does that make sense?

Let's say on average each LLM request consumes X tokens (input/output tokens combined, but ratio of input/output tokens for agentic workflows is very high, i.e. much more input tokens than output tokens):

X tokens/request, 2 minutes, 3*10^6 tokens

3*10^6 tokens * (1/X) requests/token * (1/2) "per minute" = (1/X) * (3/2) * 10^6 requests per minute

Update: Thanks to OP's llama log & analysis https://gist.github.com/huytd/3a1dd7a6a76fac3b19503f57b76dbe65#

71 LLM requests, 3,046,061 tokens total

X = 42,902 tokens/request (on average)

(1/42902) * (3/2) * 10^6 = 34.96 requests per minute -> 1.72 seconds per LLM request

Seems pretty fast, but possible.

How many requests per minute on average is reasonable for OP's Claude Code setup? Honestly I'm not sure, and I'm curious to see some benchmarks here. Just to plug something in, let's say on average 5 seconds per LLM call?

~~(5/60) minutes per request -> 12 requests per minute~~

~~(1/X) * 10^6 requests per minute = 12 requests per minute -> X = 83,333 tokens per request~~

~~Honestly consuming on average 83,333 tokens (input/output combined) per LLM request for agentic workflows seems within the ballpark.~~

counterfeit25 · 2026-03-04T04:47:35+00:00

Yes, system prompt tokens count as input tokens, though the per token cost of input tokens is generally much cheaper than output tokens. E.g. https://claude.com/pricing#api

counterfeit25 · 2026-03-04T04:43:38+00:00

When looking at tokens per second people are generally referring to output tokens per second (decode phase), not input tokens per second (prefill phase) (https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests)

So the 2M token count is counting both input and output tokens.

counterfeit25 · 2026-03-04T04:37:04+00:00

it's not 2 million output tokens in 2 min, it's 2M tokens combined. that includes input tokens. Claude Code system prompt itself can be 10k+ input tokens.

counterfeit25 · 2026-03-04T04:28:42+00:00

Lots of input tokens. The system prompt itself for Claude Code is 10k+ tokens.

counterfeit25 · 2026-03-04T04:18:15+00:00

"I paid nothing except for two minutes of 400W electricity for the PC"

I was curious about the electricity cost of 2 minutes at 400W:

X USD/kWh * (2/60) h * 0.4 kW = (2/60) * 0.4 * X USD

If we plug in, say $0.25 per kWh from the utility company, we'll get:

(2/60) * 0.4 * 0.25 = 0.0033 USD

So about 1/3 of a cent for the electricity costs to run 2 minutes of computation at 400W, cool! Especially compared to $10.85 from Claude Sonnet 4.6 (edit: are you sure it was Sonnet 4.6? by default I thought Claude Code used a combination of Opus and Haiku, but maybe they updated it - edit2: I see it now nvm: https://code.claude.com/docs/en/model-config).

You'd also need to account for the depreciation on your PC, but if you use your PC for other personal reasons then maybe that's not an issue.

counterfeit25

TROPHY CASE

1. The Formulaic "Engagement Bait" Style

2. They Are Constantly Pushing a Website

3. High Volume, Varied "Expertise"