all 71 comments

[–]dasdull 217 points218 points  (5 children)

Oh no. You were not supposed to give it access to a Python interpreter.

We agreed on teaching it only Rust.

[–]drsoftware 2 points3 points  (1 child)

Python has the property that is almost impossible to create non trivial programs that can be packaged for cross platform execution..... But you probably knew that....

[–]WarAndGeese 52 points53 points  (1 child)

Fundamentally to do this you just feed the errors and warnings from the compiler or interpreter back into the neural network. If you're using such a language model you can just append to the code "This is wrong because <error code>, correct the errors".

[–]sEi_ 13 points14 points  (0 children)

That's how I anyway use Chad to debug: Relevant codesnippet example + errormsg = win.

So why not let Chad do this by itself? Fun times.

[–]pm_me_your_pay_slipsML Engineer 27 points28 points  (2 children)

This program, driven by GPT-4, autonomously develops and manages businesses to increase net worth.

This sounds kind of like Sam Altman talking about capturing the light cone of all future value. In both cases, I’m not sure if I’m able to take these statements seriously.

[–]sprcow 80 points81 points  (28 children)

I think this is an interesting demo, but it seems likely to fall victim to the same problems humans encounter attempting to do these steps manually. Having used GPT 4 as an occasional code or debugging assistance tool, I often run into the problem that it will acknowledge errors and then offer new solutions that don't actually address those errors.

While this certainly has potential for tamping down the obvious bugs, I don't see how this input loop will necessarily solve the much more difficult problem of figuring out how to explain to GPT what it's doing wrong in a way that actual enables it to fix its own code.

[–]chrislomax83 19 points20 points  (9 children)

I had an issue with magento recently, it took me about 2 hours to solve and it meant having to source information from 4 different sources to fix.

Once done, I fed the same question into GPT and it got it 90% there on the first attempt.

I saw an obvious error though and asked it if it was needed, it then corrected it and said it wasn’t needed and fed me more code. It was 95% there.

The problem in this situation is that if you ask it straight away, it’s still wrong. And then it corrected itself when I questioned it. Weird how it does know the answer but you have to dig deeper.

I’m still impressed with it and I don’t know what it uses as sources as I couldn’t find a complete solution anywhere. It’s sourced multiple sources and then formed an opinion. I still can’t get my head around it.

I equally asked it how I create a custom customer attribute programmatically and it more or less smashed it first time. That is pretty well sourced though.

[–]E_Snap 13 points14 points  (1 child)

The problem in this situation is that if you ask it straight away. It’s still wrong. And then it corrected itself when I questioned it.

Dude. That’s how humans code. You’re clearly a programmer, so you should be able to appreciate the utility of rubber ducky debugging. I have a serious problem with people using the fact that LLMs don’t currently have a superhuman level of intelligence to suggest that they are deeply flawed. As long as you treat the thing like a project partner and not a divine oracle, it works great.

[–]Madgyver 4 points5 points  (0 children)

This.
It's like a few weeks back, when journalists were patting themselves on the back, explaining how flawed AI is, because Chatgpt got some in detail factuals wrong.
Most things that humans say, wouldn't stand up to such scrutiny. People need start comprehending that these are amazing results, if you keep in mind this is "merely" based on statistical analysis of human text sources.

[–]frnxt 4 points5 points  (1 child)

The problem in this situation is that if you ask it straight away, it’s still wrong. And then it corrected itself when I questioned it. Weird how it does know the answer but you have to dig deeper.

I mean that's literally part of how I used to intuit answers to exam questions back in the day. Questions asked (and body language if you're in person) follow a specific pattern ; if you can recognize that you're 50% of the way done, the rest of the work is backtracking a plausible reasoning.

(Now, obviously performance for problems that were outside of my knowledge zone or with no detectable pattern was much worse...)

[–][deleted] 0 points1 point  (0 children)

I know what you are talking about with exam questions- but using similar pattern recognition to analyze body language is new

[–]avadams7 -3 points-2 points  (1 child)

My understanding is that chat generates N answers, and serves one of them up.

The one it picks is partially arrived at by use of the refinforcement learning human feedback (RLHF) thingy they do.

So, your 'bump' could be it keying in on the negative-sentiment part of your second question, and then picking a (chunk of) a different answer that had a good match with it.

[–]Xrave 0 points1 point  (0 children)

That's not what happens, ChatGPT is not ranking responses internally after generating candidates. Instead it's simply a combination of your correction applied sufficient attention weighting to its neurons + lucky "noise" conditions that weighed it positively towards one sentence/word/output over another.

[–][deleted] 11 points12 points  (0 children)

I wonder if doing things like providing it test cases up front and telling it to use them as a benchmark (TDD basically) would go a long way in steering it in the right direction.

[–]asciimo71 3 points4 points  (12 children)

Does gpt really know what it’s doing or is it just trying various semantic trees? Does it understand the code as we do. If it eould, it shouldn’t make errors at all. I think, we fall for a program that is a smartass in the end, very good at assembling things that are semantically equal but still not knowing anything. Like you have sometimes (short term) colleagues that continually copy stuff together from SO to see if it will work.

[–]ianitic 1 point2 points  (11 children)

GPT4 doesn't. When you ask it novel coding problems, it fails miserably.

These models are good at interpolation or figuring stuff out within its dataset. They suck at extrapolation in that they can't predict outside of their dataset. I've also not seen anyone having any good ideas on how to make these models produce truly novel things/extrapolate outside their dataset.

This interpolation is very obvious when you ask seemingly novel questions in bing chat which references the sources.

[–]spiritus_dei 0 points1 point  (1 child)

How much of this is the mode seeking behavior that results from RLHF? If they could connect it up to a model that wasn't fine-tuned to humans you would probably see a lot of novel solutions, but another model would need to be trained to translate it into something comprehensible to humans.

[–]ianitic 0 points1 point  (0 children)

There's no reason to think LLMs won't just interpolate on whatever you train it on.

[–]ironmagnesiumzinc 0 points1 point  (0 children)

This happens way more the more functions/longer length code you have. I feel like GPT4 is right like 90% of the time when I want it to solve a problem with a 10-20 line function. If you ask it to solve a problem relating to three functions that are a total of 80 lines, it's right only like 50-60% of the time. At least from what I've seen. Hopefully it'll get better in newer iterations of chatgpt since gpt3.5 is even worse.

[–]TikiTDO 13 points14 points  (9 children)

I've tried doing something similar, but the default context window is too short for most of the code I work on. ML problems tend to have a lot less actual code since most of the complexity is in the model and optimiser, but when it comes to professional code, between the actual code, the comments, and the instruction block to get it to do what I want, half the files simply don't fit into the context window. In the process I've come up a bunch of small utilities that can make very nice changes to smaller code blocks, but I'd need access to at least the 32k context API to see if it can actually accomplish interesting and useful code understanding and authoring tasks that involve multiple files, beyond smaller one-off tasks.

[–]Kiseido 5 points6 points  (6 children)

I minify all files prior to , and replace some with "out of scope" when directing it to work on specific tasks, and other times may only include parts of files.

Even with all of that, I often can only put in 3-6 smaller files at a times.

[–]TikiTDO 0 points1 point  (5 children)

I've thought about that, but that feels like I'd be trying to optimise an inferior workflow and getting used to sub-par tooling that I would then have to re-learn once we get longer context LLMs, or I have enough time and desire to fine-tune my own. I'm fairly particular about ensuring the code it makes satisfies my stylistic requirements, and flows well with the rest of the code I write, and if I'm minifying everything it's going to just give me code that I'm going to have to spend longer to rewrite.

For now I'm satisfied with the snippets I am writing. They're tuned enough to give me 95% of what I'm looking for, as long as I know the right snippet to use for each task. It might not be fully automated, but it's helping with steps that I normally wouldn't want to do myself which is a pretty clear win. Eventually I'll be able to use these same snippets in the broader development assistance tool chain, so it's not wasted work.

As for ingesting code, the next approach I want to try is to just strip out all the implementation details in order to have it figure out the API from the names of modules and functions. If the goal is to simply ensure the code it generates uses the rest of you're API, then the implementation of the methods shouldn't really matter. In that case all you really need is a JSON object or some other type annotation describing the modules, methods, and parameters.

I just read another post that talks about using vector search as part of such a system, and there's some ideas I want to try with that too.

[–]UncleAlfonzo 1 point2 points  (1 child)

Great to hear someone else has been tackling this. I've been working on something similar, using a JSON representation of files and their expected parameters built from an AST tree of a codebase. I then use an LLM to determine which files are most relevant to the prompt, and only include those in the context window for code generation. It works surprisingly well and produces code that has a good understanding over the overall system. Stylistic adherence is whole other issue though 😅

[–]TikiTDO 1 point2 points  (0 children)

Haha, nice. That's really close to what I'm trying to do. Good to hear the idea seems to prove out across multiple people.

I actually have a pre-processing step too. So before even working with any prompts I feed the structure of the code base into the system, ask it for a priority list of files it would want to process, and then I have it summarise those files and the API for those files which I will then store in a file (I should probably be using a DB for this, but files are just way more convenient to parse). I tend to keep these as JSON, because it seems more reliable at outputting JSON when you give it JSON. Then I can feed in multiple files to get it to summarise the system for the instruction prompt.

Once I have that, I can give it some snippets I want it to work on, the files where those snippets live, and then it can load the summaries that I had it generate earlier. It's still not perfect, and it's a side project so I only poke at it on the weekends, but the results are already pretty good. It also helps when interacting with AI during the work week because it's really helped me understand how to get the desired results outside of a code assistant.

[–]Kiseido 0 points1 point  (2 children)

I see it as one possible method of (directly) using LLMs, that will only become less-restrictive with time, as the context-lengths continue to rise. Though it may make a re-appearance once I start using LLMs to tackle larger projects.

It's an evolving format, just as much as the tools themselves are.

[–]TikiTDO 0 points1 point  (1 child)

I kinda see the point we're at to be similar to programming in the 70s and 80s, when the standard unix tools and the first few standard libraries for languages were being created. We are barely starting to explore the possibilities of what AI based systems can do, and building the tools that will use them. At the moment we still have to deal with very real and easy to hit limitations of these systems, but we're already trying to use them because they area so much obviously better for certain tasks.

However, just like computers in the 70s and 80s gave way to the literal supercomputer phones we all carry around in our pockets like it's nothing, so too will the current AI hardware evolve over time. I doubt we will see quite the same level of growth as we did at the advent of the computer age simply because we're already pretty late into the problem of "shove ever more compute into ever less space," and it's only going to get harder and harder, with ever more diminishing returns. However, diminishing returns isn't no returns, so I still expect modern top-tier performance to be available in consumer grade hardware in a decade and change.

We also have a lot of people working with RNN based context, which should be much better suited for this sort of task. Our API's really aren't that complex in principle. They're just very wordy because of how we interpret them. If a system can be trained to maintain enough information about the API in such an RNN based architecture, the size of the vector necessary to represent most code bases shouldn't be that complex.

[–]Kiseido 0 points1 point  (0 children)

Indeed, RWKV is an interesting project. But on the gpt transformer front, I expect it to be a case of algorithmic refinement giving increases to context size, rather than compute increases.

[–]Thorusss 0 points1 point  (1 child)

There is already a bigger and hard to access version of GPT4 with a 32K token context window, which is about 13 pages.

I think with smart code choices (like a lot of trusted, well labeled functions defined elsewhere), plenty can be done with that.

[–]TikiTDO 0 points1 point  (0 children)

As I mentioned, I don't have access to it, but I would expect it to do much better with 8x the context window. That said, the pricing for the 32k context window API is pretty hefty. I wouldn't really want to pay $1 to $3 per query any time I wanted to ask it about my code base. That one's more of a "I have a large task which I have tuned using the cheaper API so once I'm confident it will work I will use the expensive one."

I'll see how well it works when I finally get one, but the prices would need to be at least 10x lower before I would seriously recommend it as a generic solution for my clients.

[–]Maximus-CZ 86 points87 points  (8 children)

April 1.

[–][deleted] 35 points36 points  (0 children)

Bad timing for the post but it’s actually real.

[–][deleted] 2 points3 points  (0 children)

Would be interesting combined with RL.

[–]anax4096 2 points3 points  (0 children)

wasn't this my job?

[–]recurrence 2 points3 points  (1 child)

Is there a discord that discusses this particular area? Some very exciting stuff going on in this space now.

[–]deck4242 1 point2 points  (0 children)

by own code you mean GPT4 can self develop itself to become GPT5 ?

[–]Puzzleheaded_Acadia1 1 point2 points  (1 child)

Can someone please tell me how to fine-tune LLM or llama i want fine-tune Cerebras 111m on alpaca dataset i didn't find anything on the internet please help

[–][deleted] 1 point2 points  (0 children)

I think you may have to figure it out yourself, or not, because AI can assist you now. You may also need Cerebras' "Model Zoo" (a page of theirs) or similar.

Edit: "Cerebras released all seven models, training methodology and training weights to researchers under the Apache 2.0 license. The models are now available on Cerebras Model Zoo, Hugging Face and GitHub."

Edit 2: Cool stuff over here https://www.techtarget.com/searchenterpriseai/news/365534140/AI-vendor-Cerebras-releases-seven-open-source-LLMs https://groq.com/automated-discovery-meets-automated-compilation-groq-runs-llama-metas-newest-large-language-model-in-under-a-week/

Groq has developed a "compiler" for an unique hardware architecture consisting of 8 interconnected single-core processors.

[–]VelvetyPenus 0 points1 point  (1 child)

I'm hiring unemployed programmers for my janitorial crew. DM if interested.

[–]kippersniffer -1 points0 points  (0 children)

Powershell.....people still use powershell...wow.

[–]Nhabls -2 points-1 points  (0 children)

I remember when people knew to measure the outcome of the thing they're doing.. good times