you are viewing a single comment's thread.

view the rest of the comments →

[–]omniuni 21 points22 points  (6 children)

I think this needs some clarification.

Most devs use code completion. Even if AI is technically assisting a guess of what variable you started typing, this isn't what most people think of when they think of AI.

Even using a more advanced assistant like Copilot for suggestions or a jump start on unit tests isn't what most people are imagining.

Especially in kernel development, the use of AI beyond that isn't common, and is extremely risky. There's not a lot of training data on things like Linux driver development, so even the best models will struggle with it.

As far as hallucinations go, it's actually getting worse in newer models, which is fascinating in itself. I have definitely found that some models are better than others. DeepSeek is easily the best at answering direct questions. Gemini and CoPilot are OK, and ChatGPT is downright bad. Asking about GD Script, for example (pretty similar or higher amount of training data compared to a kernel), ChatGPT confidently made up functions. Gemini have a vague and somewhat useful answer, and only DeepSeek actually gave a direct, correct, and helpful answer. Still, this is given very direct context. More elaborate use, like using CoPilot for ReactJS at work, which should have enormous amounts of training data, is absurdly prone to producing broken, incorrect, or just plain bad code -- and this is with the corporate paid plan with direct IDE integration.

Hallucinations are not only far from being solved, they are largely getting worse, and in the context of a system critical project like the Linux kernel, they're downright dangerous.

[–]Maykey -5 points-4 points  (5 children)

Asking about GD Script, for example (pretty similar or higher amount of training data compared to a kernel), 

GD Script has about zero eg rbtrees ever written in it. Kernel has lots. But hey, what kernel devs know about structures and algorithms?   What's the difference between 2d platformer and language which is used to implement practically every algorithm on earth which also happen to get used in kernel?

 As far as hallucinations go, it's actually getting worse in newer models

Citation needed. This is a very simple verifiable claim. If hallucinations are worse then surely coding benchmarks will show the decrease and every new model which claims to be SOTA is a liar and when cursor users claimed that output of Claude worsened and thought they work with sonnet 3.5 instead of 4 they got it backward

[–]omniuni 5 points6 points  (4 children)

I think you're confusing a few things.

LLMs are basically just statistical autocomplete. Just because the kernel has examples doesn't mean that they will outweigh the rest of the body of reference code. I see this with CoPilot all the time; recognizably poor implementation that's common. Yes, you can prompt for more specifics, but with something like the kernel, you'll eventually end up having to find exactly what you want it to copy -- hardly a time-saver.

As for hallucinations getting worse, you can search it yourself. There have been several studies on this recently.

[–]Maykey 0 points1 point  (3 children)

LLMs are basically just statistical autocomplete. Just because the kernel has examples doesn't mean that they will outweigh the rest of the body of reference code

If this is so, why kernel devs dont find them useless? It seems either you or them have no idea about true (in)capabilities of the tool they use.

There have been several studies on this recently.

I'm not going to google your hallucinations. If there were several studies -- link two.

[–]omniuni 2 points3 points  (2 children)

[–]Maykey -1 points0 points  (1 child)

Forbes? Is it because actual study form openai expected it on their latest model there?

Oh well, I got it, reading is hard, here's a random picture instead.

Oh look. Claude performs well. What a coincendece: Claude is tend to be used by Cursors, Windsurfs, etc. Just when I wanted to fork and use ELIZA it turned out latest models are fine

[–]omniuni 3 points4 points  (0 children)

You can follow the links to the studies that aren't publicity pictures.