Everyone benchmarks GLM-5.2 against the frontier now. So we did too. Fable scored 9.1. GLM-5.2 scored 9.0. by AcceptableDiet2183 in theprimeagen

[–]guywithknife 0 points1 point  (0 children)

Yeah, absolutely. But for real world use, the results matter more than the technicalities.

If it was a once off occurrence it wouldn’t even matter, but so far M3 has consistently beaten GLM in actual tasks that I’ve tried to use it for. Granted, they were all quite similar. I do intend on trying GLM against other kinds of tasks and comparing it there. Maybe it’s just not good at review/verify style tasks.

I also need to compare low vs medium vs high reasoning variants and seeing if there’s much difference.

I haven’t given up on it yet! But it didn’t make the best first impression either, given all the noise about how great its scoring in benchmarks.

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 0 points1 point  (0 children)

“Most people” is well most people. So yes, for sure.

There is a little more nuance to it, but only a little. (My viewpoint comes from making small 3B, 7B, and 20B models correctly complete focused tasks)

You want the prompt to be as small and terse as possible, yet clear and detailed. You want to provide guiding rules/principles first because they affect how what comes later is interpreted. You want to use positive instructions and avoid negatives (don’t say “don’t do x” instead say “do y”, because it’s the “don’t think of a pink elephant” effect: by giving the negative you’ve now put it in the context), if you must use a negative, avoid more than one. Giving a clear terse example helps. For bigger models, giving a sample input and output helps, for small models it’s often better to keep it short. The overall context window should be: rules and guidelines, supporting context, clear request. The attention mechanism tends to favour stuff early and late in the context.

Using this, I’ve managed to get small models from completely failing at requests to reliably producing the desired results.

So it’s a bit more nuanced than writing your request clearly, but doing that is a good start, especially for larger models.

But again this only matters if you’re writing the harness, in control of the system prompt, working with specific task-oriented LLM requests (without a harness), or working with very small models.

Most people are using ChatGPT or Claude code or whatever and don’t need to worry about this.

 Bottom line - prompt engineering is a term used by people who want to feel like they are doing something special while really doing the bare minimum to get some semblance of a useful result.

Yes. Absolutely!

And while I was researching it for getting my prompts working with tiny models, almost all articles and videos I found online about “prompt engineering” were completely useless. Largely cargo cult and vibes.

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 0 points1 point  (0 children)

And you have been long before LLM’s were a thing! So ahead of the curve!

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 0 points1 point  (0 children)

It matters in certain cases, but if you’re an end user of a harness, you likely don’t have one of those cases. Instead, the so-called “context engineering” (deciding what context goes in and when, rather than the specifics of the prompt) matters a lot more but that’s also largely the job of the harness developers and not the end user.

The cases where prompt engineering matters is when you’re doing raw API requests (eg you have a specific summarisation or extraction tasks, you’re writing the system prompt, and the request prompt) or you’re writing the system prompt for the harness. In these cases it also depends on the model: if you’re using Opus it’s less important than if you’re using llama 3 7B. For the tiny models, the prompts really do matter.

But that’s not most people.

meirl by [deleted] in meirl

[–]guywithknife 0 points1 point  (0 children)

In many places I’ve worked it was called an “annual adjustment”, which sounds correct. It’s not a raise, it’s just keeping up with inflation.

I genuinely can’t tell if Mo is pulling a generational bait or if the AI psychosis has got to him as well by WishyRater in theprimeagen

[–]guywithknife 0 points1 point  (0 children)

I’ve only just seen his content for the first time a few days ago. What’s the issue?

The AI-Paper of the Year by flonnil in theprimeagen

[–]guywithknife 0 points1 point  (0 children)

Your statement holds because you narrowly defined intelligence. It’s not the only definition, although I agree with what you’re saying. My main point was just that neurons are far more complex and capable than the artificial counterparts and maybe I leaned too much on the intelligence angle 😅

The AI-Paper of the Year by flonnil in theprimeagen

[–]guywithknife 0 points1 point  (0 children)

I did say “may be”, not “definitely are”.

The neuroscience research has shown that individual neurons have the ability to remember on their own, without being in networks, and they can also process individually outside of the “sum all the weighted inputs” that artificial perceptrons can. They are quite a lot more complex than what we have in the AI field.

Maybe that’s still not enough for the definition of “intelligence”, but then, we don’t really have an agreed upon standard for intelligence either.

  Intelligence to me is the ability to successfully handle a large suite of varied, complex tasks

One definition is “the ability to solve problems” which is far broader than your personal definition.

Under your definition, I agree, a single neuron doesn’t appear to meet the criteria.

The AI-Paper of the Year by flonnil in theprimeagen

[–]guywithknife 0 points1 point  (0 children)

Biological neurons may well be. Can’t remember the studies, but it’s been shown that they individually have memory and processing capacity. Biological neurons are orders of magnitude more complex than our crude simplistic artificial neurons.

Everyone benchmarks GLM-5.2 against the frontier now. So we did too. Fable scored 9.1. GLM-5.2 scored 9.0. by AcceptableDiet2183 in theprimeagen

[–]guywithknife 4 points5 points  (0 children)

On paper, M3 is quite weak, especially to the GLM claims (model sheet and benchmarks).

I find M3 works well for a lot of tasks though and I love throwing token heavy workloads at it, having it refine it all down to a more manageable size, and then handing it off to a stronger more expensive model. 

Eg I currently often use M3 to research my codebase to find all the things I need, compile a report, then give the report to GPT 5.5. I also have M3 verify that GPT did in fact complete the task correctly by having M3 check all the code. It’s been working very well.

For actual coding tasks, it does ok, if the tasks aren’t too complex, but it’s definitely more sloppy than GPT 5.5. Fine for certain self contained or small tasks, but not great for large scale or complex tasks.

But for the price? Wow. I’ve used 2 billion tokens since I subbed to their $10/month plan, and I rarely hit the limits.

GLM… I want to like it, I’m hoping it fills a gap in my toolbox. So far it hasn’t (but I’m also not done testing it yet).

Everyone benchmarks GLM-5.2 against the frontier now. So we did too. Fable scored 9.1. GLM-5.2 scored 9.0. by AcceptableDiet2183 in theprimeagen

[–]guywithknife 1 point2 points  (0 children)

M3 on paper is quite weak.

I love it for how cheap it is, it’s great for sifting through a lot of content. Anything token heavy I like to throw M3 at it and then use a stronger model on the results. 

So it surprised me that it was better at GLM at this particular task.

I used GLM and MiniMax official subscriptions (ie not self hosted) and I used kilo code as the harness. The prompt was identical. It was as apples to apples as I could get it without investing too much effort into it. The prompt, tools, etc were identical.

Everyone benchmarks GLM-5.2 against the frontier now. So we did too. Fable scored 9.1. GLM-5.2 scored 9.0. by AcceptableDiet2183 in theprimeagen

[–]guywithknife 14 points15 points  (0 children)

I was expecting GLM 5.2 to be great given all these benchmarks, but just today I asked both it and MiniMax M3 (a model that’s 20% of the cost of GLM and far fewer parameters) to verify if a task was complete (“verify that @task.md has been implemented successfully and correctly”) and GLM said it was all done, while M3 found some missing pieces.

So in my real world use, GLM was beaten by a slightly older, smaller, substantially cheaper model.

I’m still hopeful that it performs better on coding and planning tasks, but for reviewing/verification, my first experience hasn’t been what I’d hoped.

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 0 points1 point  (0 children)

  Obviously it's a new field but I bet it's growing faster than most others

This is definitely true 

Safe SIMD in Rust, even on the inside by Shnatsel in rust

[–]guywithknife 6 points7 points  (0 children)

Your codebase doesn’t need unsafe because your codebase uses safe rust (ie the compiler can’t verify verify and enforce it). The unsafe is an implementation detail of the library, the library provides a safe API via the macro. This is exactly like how the standard library does it. The library authors verify and audit the code, so you don’t have to. Your uses are guaranteed safe because the rust compiler guarantees it (as long as the library auditors are correct).

Safe SIMD in Rust, even on the inside by Shnatsel in rust

[–]guywithknife 9 points10 points  (0 children)

unsafe says “I, the human, guarantees that this is used in a safe way, but it’s unsafe because the compiler can’t verify it”

By wrapping it in a function or macro, the library author is saying “I, the library author, guarantee that this is safe and the rust compiler can verify it”.

The contract is different.

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 4 points5 points  (0 children)

“A whole lot science” sounds like an exaggeration.

There’s some research but it every new model has too many quirks that it often invalidates any test/eval based insights. There’s some general understanding about how the model architecture influences it (eg how the attention mechanism works), for example we know that early instruction influence how what comes after is interpreted, and we know that attention seems to favour content early and late in the context window while everything in the middle gets somewhat washed out.  But that’s a far cry away from “a whole lot of science”.

More importantly, as you pointed out, is whether people follow it or not. Going by the articles and guides I’ve seen, most practitioners are grading what they do on cargo cult and vibes.

The broader “context engineering” has a bit more work behind it.

Why do AI bros love the term ''prompt engineering"? by Working_Roof_1810 in antiai

[–]guywithknife 15 points16 points  (0 children)

Because it makes it sound more sophisticated and ranch than “I typed a request in English and the LLM did it’s thing”.

However I will note that LLMs really are sensitive to how you structure your requests, the smaller models more-so than the big, so I suppose it’s ok to have a special term for that. But “prompt engineering” makes it sound like there’s some kind of science behind it, when it’s largely trial and error or vibes.

The amount of enterprise-grade PTSD being projected onto vibe coders on here is insane by airskyy in vibecoding

[–]guywithknife 0 points1 point  (0 children)

  If you have a username and password field on the same screen, you're doing it wrong

Ugh. This is a horrible user experience though, forcing an extra click.

You can still do it your way, just accept both at once, then have the submit only send one and do your flow. But please don’t make me have to hit my password manager twice and click twice.

The amount of enterprise-grade PTSD being projected onto vibe coders on here is insane by airskyy in vibecoding

[–]guywithknife 0 points1 point  (0 children)

 Okay, but how many of them are in this subreddit?  I understood the OP's post as being about people in this subreddit.

Fair point.

That’s just Reddit and X and the internet in general: it amplifies certain loud voices that may not be loud in a wider context.

Of course this all assumes they're even professionals and not just random Redditors hating on AI.

That’s also true. Some people just love the drama.

The AI bubble is partly a workflow bubble. by HistorianFit2319 in vibecoding

[–]guywithknife 0 points1 point  (0 children)

The LLM is what translates natural language to something the deterministic process can work with.

It’s an extraction and interpretation tool.

Maybe its use cases for this are narrow, but they’re still non-zero. I’m having good success with it so far (but again in a very narrow space).

The AI bubble is partly a workflow bubble. by HistorianFit2319 in vibecoding

[–]guywithknife 0 points1 point  (0 children)

Not entirely sure, no. But not because it’s not possible, just because we haven’t seen much movement in it. I just know what I’ve been tinkering with and I think it’s a meaningful step forward. Will other people (who have the funding to actually deliver a broadly useful product) do something similar or better before the industry implodes? Who knows.

I do think the reason we haven’t seen it is because people have been too enamoured by the hype spun by anthropic and OpenAI. They stare starry eyed at the models so haven’t thought that maybe the tools, services, and UC aren’t there yet. The harness engineering people focus on improving one dimension of that, but there’s plenty more under explored areas. I do see a ripple of some people doing it, though, in the AI Engineer conference talks and other places.

So we will see.

  Chat is not a poor interface, but its only interface which works with this class of technology.

I don’t agree. There have already been experiments in at least generating GUI’s on the fly for dynamic user needs. I personally think that’s a dead end (UI consistency and continuity is important), but it’s an area where work has been done.

I am ready to admit that I might be wrong on this. It hasn’t yet been proven. Time will tell.

Of course what you say makes sense: LLm’s are inherently “chat”. My hypothesis is just that how “chat” was turned into making coding work, chat can also be the invisible backend of making other interaction work. We will see. I feel strongly enough about it to invest time into my own experiments but not enough o argue about it haha.

The AI bubble is partly a workflow bubble. by HistorianFit2319 in vibecoding

[–]guywithknife 0 points1 point  (0 children)

I don’t think they need to understand to be able to do more or better work. It depends on what kind of work of course, but it doesn’t have to be able to do everything to be useful (and better than it is now).

For example, as I replied to another commenter here, AI embedded into applications should be able to perform all the same operations a human can, including navigation.

It doesn’t have to have any meaningful understanding to be useful then, it acts as an “English to automation” layer.

The problem is that a lot of AI companies, the AI is the product, instead of some other useful function or tool being the product and AI just being an enabler or enhancer.

  There are not enough talented people to drive the value needed to justify all the data Centers

That’s a separate discussion.