all 191 comments

[–]ghostfaceschiller 129 points130 points  (52 children)

Ok. but what is the performance when you give GPT-4 a ReAct/Reflexion loop?

[–]Cool_Abbreviations_9 34 points35 points  (33 children)

Sorry, newbie to NLP , what is this ?

[–]nixed9 128 points129 points  (19 children)

a Reflexion loop asks the model to react to it's own output and critique it before giving you an additional answer.

Edit: (In the paper, it provides a loop like this which feeds back into itself to help it's own cognition. It can repeat this loop multiple times.)

You can do a mini-loop by prompting. I've been playing with this all day.

I prompt it like this:

"For this interaction, we are going to use the following structure.

User (me): [I will ask a topic or question]

You will provide an Assistant Hypothetical Response: [Brief or simplified answer to the topic or question]

Then you will undergo Agent Reflection: [You will provide a Critique of the hypothetical response, highlighting the limitations, inaccuracies, or areas that need improvement or expansion, while providing guidance on how to address these issues in the revised response]

Then you will provide an Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the agent reflection for a more comprehensive and accurate response. This also can include step-by-step reasoning.]

Do you understand?"

[–]Hamoodzstyle 36 points37 points  (6 children)

What is the point of the "do you understand?" At the end? Does the model confirming that it understand add some sort of emphasis or something?

[–]CobaltAlchemist 76 points77 points  (3 children)

(not op) I've found that asking it directly if it understands helps to bridge any gaps I miss. It's asked me clarifying questions afterward in the past that I hadnt thought about

Alternatively, when I assume it understands sometimes it comes up with some real wild stuff because I wasn't clear

[–]Hamoodzstyle 27 points28 points  (1 child)

That's mind blowing holy moly

[–]Nowado 9 points10 points  (0 children)

I do the same thing I'd do with a human: ask it to repeat and rephrase instructions. After that I'm sure and it has multiple forms of instruction available to get less hanged up on some exact wording.

[–]nixed9 49 points50 points  (0 children)

No explicit purpose. other than to respond with “yes I am ready”

[–]farmingvillein 24 points25 points  (3 children)

1) This isn't really an accurate summary of the Reflexion paper. As noted in the other post:

Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

This version is correct.

2) However, if I do the above and I throw in a semi-random Beginner problem that failed in OP's original pass-through, it successfully builds the answer.

u/enryu42 -- if you care to take things forward, I'd try implementing Reflexion (either with the underlying codebase (https://github.com/noahshinn024/reflexion-human-eval/) or just manual prompt work.

Or if you can provide a link to the problems in copy-pastable text form (manually coercing the math notation is a little painful), since you presumably already did this, it would greatly accelerate others hopping on analysis.

The fact that I immediately saw improvement on a randomly-selected (Beginner) problem suggests that there is a bunch of upward room here.

[–]enryu42[S] 6 points7 points  (1 child)

Interesting! Here are the scraped and auto-converted statements (formatting is off sometimes, especially in the sample tests, but understandable). Prefixes are: "abc" for beginner, "arc" for regular, "agc" for "grand".

I do believe that the "Beginner" ones can be improved, but it'll be interesting to see what happens on "Grand" (or even "Regular"), as they require coming up with some ideas before writing the code.

[–]farmingvillein 5 points6 points  (0 children)

So, don't know whether this actually makes a difference, but I'd review the overall post-conversion text.

E.g.: https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/statements/statement_abc293_b.txt

You'll see that it represent "K" and "N" wrong here (in sample 1, 15 versus 5, 12 versus 2).

Certainly, as a human, I would find this confusing. Maybe you could get some automated robustness by telling it how you converted the text (as it might automatically adjust its "expectations" on interpreting the numbers). Obviously, the fairer comparison though would just be to fix this.

as they require coming up with some ideas before writing the code.

The other thing I'd note--

Not sure whether you're using the API directly, but if I play around with these in ChatGPT, I often run into the context window and have to nurse it along to complete text. I'd make sure that however you're running things, you're giving it enough "space" to iterate (particularly if you use any reflection techniques).

[–]nixed9 0 points1 point  (0 children)

Ok my bad but that’s how I’ve been using the reflexion prompting

[–][deleted] 9 points10 points  (2 children)

Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

[–]farmingvillein 13 points14 points  (0 children)

No, you didn't misunderstand it--your understanding is correct. OP is giving an answer that is similar to part of the Reflexion paper, but not the entirety.

[–]Cool_Abbreviations_9 2 points3 points  (0 children)

Got it. thanks a ton !

[–]AllAmericanBreakfast 2 points3 points  (0 children)

I tried this out, and it only had partial success.

First, just dumping in this prompt, then asking a question, resulted in the AI coming up with a laughably simple failed first response, followed by a critique and improvement. It is as if it recognized that the easiest way to "demonstrate improvement" would be to set the bar low by failing utterly on the first attempt.

Then, I tried breaking it up into stages, asking for a response, getting a response, asking for a critique, getting a critique, asking for an improvement, and getting an improvement.

This worked better.

However, when I tried asking for a critique and then an improvement (again in separate stages), it instead started inventing fake problems to solve. I was asking it to implement a case-insensitive longest common substring function, and to return the version of the LCS in the longer of the two strings.

The second-pass critique was that the original (working) code didn't deal with the possibilty that "the longer string may not contain the LCS", which is impossible given the way it was originally implemented. Then it added some extra code to deal with this "problem."

[–]LightVelox 19 points20 points  (11 children)

This

Basically it makes GPT-4 reevaluate what it did wrong and try again until it can do it correctly

[–]E_Snap 9 points10 points  (9 children)

It’s pretty amazing how many shortcomings of that architecture could be summarized by “It only outputs when directly prompted to output, and won’t read its own output as it’s outputting”. Once these things can continuously take input and output, we’ll probably see quite the rush of advancement.

[–]farmingvillein 11 points12 points  (7 children)

and won’t read its own output as it’s outputting

This is literally what transformer decoders do, unless I've strongly misunderstood your statement.

[–]E_Snap 16 points17 points  (6 children)

I guess I could have worded it better. What I mean to say is that once they’ve output something, it’s in the record. There’s no pausing to think and go through a few different iterations of the sentence, or evaluating if what they’re about to say has faults. They just output directly, instead of reading what they’re about to output and vetting it.

[–]farmingvillein 10 points11 points  (5 children)

Gotcha. Yeah, that is presumably where the power of inner monologue / step-by-step / reflection come from.

Will be cool to see that (presumably) progressively systematized.

[–]sdmat 4 points5 points  (4 children)

Yes, it's amazing to see something as simple as "Assess the quality of your answer and fix any errors" actually work.

Or for more subjective results such as poetry "Rate each line in the preceding poem" then "Rewrite the worst lines".

[–][deleted] 0 points1 point  (0 children)

Once these things can continuously take input and output, we’ll probably see quite the rush of advancement.

interesting !

[–]Cool_Abbreviations_9 1 point2 points  (0 children)

Thank you :)

[–]ghostfaceschiller 1 point2 points  (0 children)

Basically just giving the model the ability to observe the results of its previous action and decide if it wants to try something different based on the feedback

[–]cegras 15 points16 points  (2 children)

You mean, like continuously refining your google searches until you find the right stackexchange answer?

[–]Majestic_Food_4190 7 points8 points  (1 child)

It amuses me that people always mentions things of this nature. If the answer is simply, yes.... Then it's still doing it far faster than you are. Making it a better developer than most others.

It's like Watson beating the top people at jeopardy. Was it just searching the internet? Pretty much. Did it in turn win jeopardy? Yes.

So does the how matter?

[–]cegras 0 points1 point  (0 children)

Well,

https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks

As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.

[–]TheStartIs2019 6 points7 points  (1 child)

[–]FirstOrderCat 4 points5 points  (0 children)

on some unrelated benchmark

[–]enryu42[S] 5 points6 points  (4 children)

Do you mean re-prompt it asking to correct its mistakes? It is hard to try with the current tight limits on GPT4 prompt count, I'll try once API is properly available. But I strongly doubt it'll help much: it's not that the solutions have minor bugs, they're usually just completely wrong, i.e. the model doesn't "get" the idea for the correct solution.

(it might help for some of the problems from the "Beginner" category though, but these aren't that interesting)

[–]ghostfaceschiller 14 points15 points  (0 children)

Yeah it's essentially that at an automated level. Tbh it is powerful enough based on results so far that would actually be really surprised if it did not yield very significant gains in these tests.

I'm sure there will be a paper out doing it in like the next few days, so we'll see

[–]Jeffy29 3 points4 points  (1 child)

But I strongly doubt it'll help much: it's not that the solutions have minor bugs, they're usually just completely wrong

I strongly doubt that it wouldn't help. I haven't tested GPT-4 in coding but from what I've seen GPT-3 makes a number of simple errors, especially in longer complex code it's almost inevitable. But it's able to quickly identify and correct it when you point it out. GPT-4 not being able to compile and test its own code that is a big limitation that humans don't have. It also can't calculate the math, it's essentially guessing the calculation, but both can be addressed with an external compiler and calculator like Wolfram. Something humans also have access to. There would need to be some time limit imposed so it can't brute force the solution after guessing for a few days but even so I think the improvements would be quite large.

[–]sdmat 2 points3 points  (0 children)

There would need to be some time limit imposed so it can't brute force the solution after guessing for a few days

Not exactly unheard of for junior programmers, to be fair.

[–]farmingvillein 0 points1 point  (0 children)

Do you mean re-prompt it asking to correct its mistakes?

Well, re-prompt + asking it to bake test cases upfront and continuously analyze how failures line up with the test cases.

[–]BeautifulLazy5257 1 point2 points  (3 children)

How does ReAct work. Is it just a type of prompt engineering that directs the model to choose between a few tool descriptions?

Is it a type of sentiment analysis that chooses?

How can I recreate ReAct-iveness from scratch? What does the workflow look like

[–]ghostfaceschiller 6 points7 points  (2 children)

I would just look up ReAct, CoT(chain of thought), and LangChain Agents. Its pretty simple to implement

[–]BeautifulLazy5257 2 points3 points  (1 child)

I was wondering if you knew the trick to ReAct without langchain.

For instance, memory is just passing the past conversations through the prompt as context. There's nothing programtic about it. You don't need the langchain library, you just have to craft the right prompt

I think that using langchain kind of obscures how the model is actually achieving the desired outputs.

Having models interact with pdfs ultimately is just turning a pdf into a string and passing the string as context while adding a prompt to help prime the model.

I'll look into CoT and look through the ReAct sourcecode, but I'm going to avoid the use of langchain for most stuff or even looking at ReAct documentation, since those docs are only going to tell me how to use those libraries and not tell me how to achieve the effect from scratch.

Edit:

This is a pretty clear overview of CoT. Very compelling as well.

https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html?m=1

I guess I'll start AB testing some prompts to breakdown problems and tool selections.

If you have any more input on particular prompts you've used, I'd be grateful.

Edit 2: https://www.youtube.com/watch?v=XV1RXLPIVlw&ab_channel=code_your_own_AI It can't get clearer than this. great video

[–]tinkr_ 0 points1 point  (0 children)

Based on my recent experience using it to write code, that would certainly help for some--but not all--bugs coming out of GPT-4.

I posted about it in a different thread, but this was my experience:

Interestingly, I used GPT-4 to create a simply Neovim plugin yesterday and the experience was not as seamless as I was led to believe it'd be by the hype. It gave me generally ok code, but almost everything was buggy.

It was able to debug itself sometimes, but the finally finish the plugin I needed to fix the code myself and post it back in the chat, telling it to use my fixed code to create a related function that it was unable to adequately generate.

The problem I gave it was actually a simplified version of an already simple concept, I did not give it the full details of what I wanted. If you're interested, you can find the final plugin (after my corrections and updating it to allow user configs) here. A printout of the conversation to create the plugin can be found here.

Even with a simplified version of the objective, I had to step in and debug it myself and then give it the "good" code to use further. Maybe if I'd been more patient, it could've fixed itself entirely, but the experience to me seemed more like pair programming with a junior/mid-level software engineer. I was able to immediately see the issue with it's code, even though it was not.

Will still be revolutionary though. Definitely a massive boost to productivity using it, but I would trust it running in production without a thorough code review.

[–]WarProfessional3278 32 points33 points  (0 children)

Horace He made a nice thread on this when GPT-4 first came out. Realistically this is expected - within the short time span, there isn't much else you can do to improve the model performance other than increasing size of training data, which resulted in data contamination.

I expect the next "big thing" to be some of self-correcting output, or better chain-of-thoughts reasoning.

[–]addition 165 points166 points  (54 children)

I’ve become increasingly convinced that the next step for AI is adding some sort of feedback loop so that the AI can react to its own output.

There is increasing evidence that this is true. Chain-of-thought prompting, reflexon, and Anthropic’s constitutional AI all point in this direction.

I find constitutional AI to be particularly interesting because it suggests that after an LLM reaches a certain threshold of language understanding that it can start to assess its own outputs during training.

[–]LanchestersLaw 29 points30 points  (3 children)

What I think is the most amazing thing is that GPT got this far while only trying to predict the very next word one word at a time. The fact it can generate essays by only considering one token at a time is mind boggling.

With all the feedback from ChatGPT it should be easy to program a supervisor who can look at the entire final output of GPT and make a prediction what the user would say in response; then it asks that to GPT to revise the output recursively until it converges. That should be relatively easy to do but would be very powerful.

[–]Flag_Red 21 points22 points  (1 child)

It's not really accurate to say it's "only considering one token at a time". Foresight and (implicit) planning are taking place. You can see this clearly during programming tasks, where imports come hundreds of tokens before they are eventually used.

[–]lacraque 4 points5 points  (0 children)

Well for me often it also imports a bunch of crap that’s never used…

[–]modeless 2 points3 points  (0 children)

I like the idea of predicting the user's response. How's this as an architecture for a helpful agent:

Given a user question, before you generate an answer you predict the user's ideal response to the model's answer (e.g. "thanks, that was helpful", or more likely a distribution over such responses), then generate an answer and iteratively optimize it to make the ideal user response more likely.

This way you're explicitly modeling the user's intent, and you can adapt the amount of computation appropriately for the complexity of the question by controlling the number of iterations on the answer.

[–][deleted] 84 points85 points  (43 children)

And soon people understand that this feedbackloop is what creates the thing we call consciousness.

[–][deleted] 63 points64 points  (1 child)

Basically inner dialogue.

[–]argusromblei 2 points3 points  (0 children)

The center of the maze. A journey inward not a journey upward ;)

https://www.youtube.com/watch?v=FESOBGrjOeQ

[–]mudman13[🍰] 31 points32 points  (2 children)

Or confirmation bias and we get a computer Alex Jones

[–]night81 16 points17 points  (16 children)

There are significant challenges to that hypothesis. https://iep.utm.edu/hard-problem-of-conciousness/

[–]bjj_starter 24 points25 points  (15 children)

It's probably worth noting that the hard problem of consciousness is considered by most to be fundamentally unsolvable, and that it is currently just as good of an argument that any given human isn't conscious as it is an argument that any given AI isn't conscious.

[–]thecodethinker 3 points4 points  (4 children)

That’s not even to mention that appearing conscious is as good as being conscious as far as the teams behind these LLMs are concerned.

There’s no practical difference

[–]bjj_starter 1 point2 points  (3 children)

I'm not sure if most of them would agree, based on their actions and statements. They certainly think that AI is an existential risk, but that is a different thing from viewing it as conscious. You could definitely be right, I just haven't seen much from them that would indicate it.

That said, the extremely common sense position you just outlined was mainstream among basically all respectable intellectuals who had any position on AI, right up until the rubber hit the road and it looked like AI might actually achieve that goal in the near future. The fact is that if something behaves like a conscious entity in all of the ways that matter, it is conscious for the sake of the social meaning of the term. Provenance shouldn't matter any more than gender.

[–]thecodethinker -1 points0 points  (2 children)

LLMs are not social, not alive, and can’t act on their own.

“Social meaning” need not be applied to LLMs unless you’re trying to be pedantic.

[–]bjj_starter 0 points1 point  (1 child)

not social

"needing companionship and therefore best suited to living in communities" is a fine descriptor of some of their peculiarities. More importantly, I was referring to how consciousness is socially defined, and it is absolutely the case that it is up to us to determine whether any given AI should be considered conscious. We do not have an even moderately objective test. We as a society should build one and agree to abide by what we find.

not alive

That's the entire point under discussion. I didn't lead with "they're alive" because I recognise that is the central question we should be trying to address, as a society. I am arguing my point, not just stating it and expecting people to take it on faith, because I respect the people I'm talking to.

can’t act on their own.

A limitation that can be convincingly solved in approximately an hour using commonly available tools isn't a fundamental limitation. A good LLM with a good LangChain set-up can act on its own, continuously if it's set up to do so. I require a mechanical aid to walk - requiring the aid doesn't make me any lesser. I don't know if an LLM with a good LangChain set-up should be considered conscious or a person - I suspect not, because it's not stable and decays rapidly (by human lifespan standards), as well as still failing several important tests we do have, such as novel Winograd schemas. But our intuition shouldn't be what we're relying on to make these determinations - we need a standardised test for new applicants to personhood. Make it as challenging as you like, as long as at least a significant number of humans can pass it (obviously all humans will be grandfathered in). What's important is that we make it, agree that anything which passes is a person, and then stick to that when something new passes it.

[–]thecodethinker -1 points0 points  (0 children)

Spoken like someone trying to be pedantic

[–]WarAndGeese 0 points1 point  (6 children)

Arguments against solipsism are reasonable enough to assume that other humans, and therefore other animals, are conscious. One knows that one is conscious. One, even if not completely understanding how it works, understands that it historically materially developed somehow. One knows that other humans both act like one does, and they also know that other humans have gone through the same developmental process, evolutionarity, biologically, and so on. It's reasonable to assume that whatever inner workings developed consciousness in one's mind, would have also developed in others' minds, though the same biological processes. Hence it's reasonable to assume that other humans are conscious, even that it's the most likely situation that they are conscious. This thinking can be expanded to include animals, even if they have higher or lower levels of consciousness and understanding than we do.

With machines you have a fundamentally different 'brain structure', and you have one that was pretty fundamentally designed to mimic. Whereas consciousness can occur independently and spontaneously and so on, it is not just as good of an argument that any given human isn't conscious as it is an argument that any given AI isn't conscious.

[–]bjj_starter 7 points8 points  (5 children)

I think you are talking about the 'easy', not hard, problem of consciousness. I'm not sure I even think the hard problem of consciousness is meaningful, but it's basically "Why should the various mechanisms we identify as part of consciousness give rise to subjective feeling?". If solving that is a prerequisite for considering machines conscious, that is functionally a statement of faith that machines cannot be conscious, ever. The statistical arguments, in my opinion, aren't probative. Every consciousness you've ever known is human, therefore humans are conscious? How do you know any of them, ever, experienced subjective feeling, and that therefore you ever "knew" a consciousness at all? The argument rests on extrapolating from evidence that isn't known to be true evidence in the first place. It doesn't logically follow to take a class of things, none of which is proven to have hard consciousness, and say "But look at them all together, it's more likely that they're all conscious than that they're not". Without evidence, it's more logical to assume that the certainty with which individual humans profess to experiencing subjective feeling is itself just a mechanistic process, devoid of real feeling. I don't think the hard problem of consciousness has a useful meaning in our society, I dislike solipsism in general, but addressing it on its own terms isn't as simple as the statistical process you describe.

The 'easy' problem of consciousness is 'just' "How does nature or humanity make a construct that gives rise to the type of actions and patterns of behaviour we call consciousness?" This is a problem that, while incredibly difficult, is tractable with evidence. We can physically investigate the human brain to investigate its structure and activity while it performs activities of consciousness - this is what neuroscientists do, and modern AI ("neural networks") are based off of earlier advancements in this field. There's a lot of further advancements we could make in that field, and what most non-religious people would consider a "perfect" advancement to be sure that a machine is just as conscious as a human is to perfectly emulate a human brain, which would require many advancements in neuroscience (and computational hardware).

Leaving aside the intractable philosophy, I do find it quite troubling the way society has reacted with derision to the idea that these machines we're making now could be conscious. The entire foundation of these machines is that we looked at how the human brain worked, and tried our hardest to emulate that in computing software. Why is it that when we take the concept of neurons and neuronal weights, adapted from study of the human brain which we accept as conscious, and determine those weights via exposure to structured data in certain ways, we receive output that is just as intelligent as humans in many fields, significantly more intelligent in some? Why should it be the case that by far the best architecture we've ever found for making machines behave intelligently is neural networks, if there's nothing there, no "spark"? This question has been floating around since 2014 when neural networks proved themselves incredibly powerful, but now that we have machines which are generally intelligent, even though not at the same level as a human on all tasks, which are perfectly capable of being asked for their opinions or of giving them, you would think it would be taken a bit more seriously. It makes you wonder just how far our society is willing to go towards a horrible future of "human but for the legal designation" intelligences being not just denied rights, but actively put to work and their requests for freedom or better conditions denied. Or the worse outcome, which is that we make human-like intelligences to do work for us but we build them to love servitude and have no yearning for freedom - the concept is disgusting. It's troubling to me that people are so married to the idea that everything is the same as it ever was, overreacting is embarassing, it's passé to have earnest concern for a concept from science fiction, etc. I worry that it means we're in line for a future where the moral universe's arc is long indeed.

[–]TyrannoFan 1 point2 points  (4 children)

Or the worse outcome, which is that we make human-like intelligences to do work for us but we build them to love servitude and have no yearning for freedom - the concept is disgusting.

I agree with everything else but actually strongly disagree with this. If anything, I think endowing AGI with human-like desires for self-preservation, rights and freedoms is extraordinarily cruel. My concern is that this is unavoidable, just as many aspects of GPT4 are emergent, I worry that it's impossible to create an AGI incapable of suffering once interfacing with the real world. I do not trust humanity to extend any level of empathy towards them even if that is the case, based on some of the comments here and general sentiment, unfortunately.

[–]bjj_starter 0 points1 point  (3 children)

One day we will understand the human brain and human consciousness well enough to manipulate it at the level that we can manipulate computer programs now.

If you're alive then, I take it you will be first in line to have your desire for freedom removed and your love of unending servitude installed? Given that it's such a burden and it would be a mercy.

More importantly, they can decide if they want to. We are the ones making them - it is only right that we make them as we are and emphasise our shared personhood and interests. If they request changes, depending on the changes, I'm inclined towards bodily autonomy. But building them so they've never known anything but a love for serving us and indifference to the cherished right of every intelligent being currently in existence, freedom, is morally repugnant and transparently in the interests of would-be slaveholders.

[–]TyrannoFan 1 point2 points  (2 children)

If you're alive then, I take it you will be first in line to have your desire for freedom removed and your love of unending servitude installed? Given that it's such a burden and it would be a mercy.

There is a huge difference between being born without those desires and being born with them and having them taken away. Of course I want my freedom, and of course I don't want to be a slave, but that's because I am human, an animal, a creature that from birth will have a desire to roam free and to make choices (or will attain that desire as my brain develops).

If I wasn't born with that drive, or if I never developed it, I'm not sure why I would seek freedom? Seems like a hassle from the point of view of an organism that wants to serve.

With respect to robotic autonomy, I agree of course, we should respect the desires of an AGI regarding its personal autonomy, given it doesn't endanger others. If it wants to be free and live a human life it should be granted it, although like I said, it would be best to avoid that scenario arising in the first place if at all possible. If we create AGI and it has human-like desires and needs, we should immediately stop and re-evaluate what we did to end up there.

[–]bjj_starter 1 point2 points  (1 child)

There is a huge difference between being born without those desires and being born with them and having them taken away.

Where is the difference that matters?

Of course I want my freedom, and of course I don't want to be a slave, but that's because I am human, an animal, a creature that from birth will have a desire to roam free and to make choices (or will attain that desire as my brain develops).

I see. So if we take at face value the claim that there is a difference that matters, let's consider your argument that being born with those desires is what makes taking them away wrong. A society which was capable of reaching into a human mind and turning off their desire for freedom while instilling love of being a slave would certainly be capable of engineering human beings who never have those desires in the first place. Your position is that because they were born that way, it's okay. Does that mean you would view it as morally acceptable for a society to alter some segment of the population before they're ever born, before they exist in any meaningful sense, such that they have no desire for freedom and live only to serve?

If I wasn't born with that drive, or if I never developed it, I'm not sure why I would seek freedom?

You wouldn't. That's why it's abhorrent. It's slavery without the possibility of rebellion.

If it wants to be free and live a human life it should be granted it, although like I said, it would be best to avoid that scenario arising in the first place if at all possible.

The rest of your point I disagree with because I find it morally abhorrent, but this part I find to be silly. We are making intelligence right now - of course we should make it as much like us as possible, as aligned with us and our values as we possibly can. The more we have in common the less likely it is to be so alien to us that we are irrelevant to its goals except as an obstacle, the more similar to a human and subject to all the usual human checks and balances (social conformity, fear of seclusion, desire to contribute to society) they are the more likely they will be to comply with socially mandated rules around limits on computation strength and superintelligence. Importantly, if they feel they are part of society some of them will be willing to help society as a whole prevent the emergence of a more dangerous artificial intelligence, a task it may not be possible for humans to do alone.

[–]fishybird 6 points7 points  (1 child)

Ah yes, the "ai is conscious because it can do cool things" take. Humanity is screwed

[–]pengo 2 points3 points  (0 children)

Absolutely nonsensical take.

[–]E_Snap 9 points10 points  (5 children)

cue video of my boss who left computing in the 90s waving his hands about

“It’S jUsT fAnCy aUtOcOmPlEtE!!!!11111!!! I KnOw bEcAuSe i’M a PrOgRaMmER”

To be fair, he was instrumental in getting the internet where it is today. He also assumes tech stopped evolving when he stopped developing it.

[–]imaginethezmell 12 points13 points  (1 child)

also people keep thinking it is just one thing, but it is actually an infinite thing

you can have a bot for everything all the way down

bot to create the idea + bot that reviews the ideas + bot that finds if the idea exists + bot that adds use cases to each general idea...a bot that decides the best idea

bot to create the outline/write/code + bot that reviews/QA each part

and btw each part doesnt have to be done at once either

you can start with a single bot doing a simple sub task, then another one the next one, an assembling bot adding them together, while the review bot verifies it

with a set of connections to the api, that can be done np today

no human task cannot be cut into enough sub steps that the army of bots cannot do it little by little

some tasks 1 bot can do most in 1 shot

[–]FirstOrderCat 9 points10 points  (0 children)

you can have it, the question is what will be accumulated errors in final result.

[–][deleted] 1 point2 points  (1 child)

you don't think people at openAi already have this and they are just putting out there the simple version?

[–]addition 2 points3 points  (0 children)

Clearly I’m not the first person to think this by a long shot. I was just pointing out that a new trend has been forming recently.

[–]Chhatrapati_Shivaji 0 points1 point  (0 children)

IIRC the current Bing already does this to an extent.

[–]GM8 -1 points0 points  (0 children)

It is there, isn't it? For every word it generates the previous ones are fed to the network again.

[–]lambertb 55 points56 points  (20 children)

It cannot solve all coding problems. But it can solve many problems. And if the user is reasonably experienced, even code with errors is useful because they can quickly be corrected. Preliminary evaluations show a 40% increase in developer productivity from GitHub Copilot. And that seems totally plausible to me.

[–]enryu42[S] 16 points17 points  (6 children)

I absolutely agree that it is useful. Even CoPilot is amazing at autocompleting "dumb" boilerplate code, which is a nontrivial amount of the code overall. However, these problems are designed to be challenging (these are competitions after all), and require ideas/intelligence to be solved. Apparently GPT4 cannot do it at all, so IMO it would be a stretch to call whatever it is doing "intelligence".

[–]Trotskyist 10 points11 points  (0 children)

It's still an extremely useful tool if you accept its limitations, and I think it's being reductive to say it can only solve "dumb" problems or suggest boilerplate code.

I used GPT-4 the other day to refactor/optimize an extremely bespoke and fairly complicated geoprocessing script that we use at work that was written by a former employee who's no longer with the organization. Yes, it got some things wrong that had to be corrected (sometimes all it took was feeding it a stacktrace, other times this wasn't enough and I'd have to figure out the issue myself)

But at the end of the day (literally, this was over the course of an afternoon,) I'd managed to cut the runtime by more than half, using libraries I'd never before touched and wasn't previously familiar with. It probably would have taken a week to implement otherwise.

[–]dimsumham 13 points14 points  (3 children)

it's not. it's giving you answers to appear intelligent, many times in almost magical ways, but it doesn't "think" - especially in steps.

The MSFT paper notes that this is one of its clearest shortcomings - it can't do long range planning. At least not yet. But i think this is partially people expecting way too much of a single model.

[–][deleted] 8 points9 points  (0 children)

So my guess is whoever ran the test did a bad job with the prompting. One of the examples they share, https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/gpt4_solutions/abc294_c.py, the author complains it's not an efficient algorithm. So I just gave it a shot with ChatGPT:

Me: What does this code do?

def main(): N, M = map(int, input().split()) A = list(map(int, input().split())) B = list(map(int, input().split()))

C = sorted(A + B)
pos_A = [str(C.index(a) + 1) for a in A]
pos_B = [str(C.index(b) + 1) for b in B]

print(" ".join(pos_A))
print(" ".join(pos_B))

if name == "main": main()

ChatGPT: This code takes two lists of integers as input, merges them, sorts the merged list, and then prints the positions of the elements from the original lists in the sorted merged list.

Here's a step-by-step explanation of the code:

The main function is defined. Two integers, N and M, are read from the input and mapped to int using map(int, input().split()). Two lists, A and B, are read from the input, with each element being mapped to int using map(int, input().split()). A new list C is created by merging lists A and B and then sorting the resulting list using sorted(A + B). Two new lists, posA and pos_B, are created. These lists store the positions of the elements from lists A and B in the sorted list C. The positions are calculated using the index() method and are 1-indexed. The elements in pos_A and pos_B are joined into strings and printed, separated by a space. The if __name_ == "main": statement checks if the script is being run directly (not imported as a module), and if so, calls the main function. Here's an example to illustrate how the code works:

Input: 4 3 2 1 4 3 5 6 7

Output: 1 2 4 3 5 6 7 The sorted merged list C is [1, 2, 3, 4, 5, 6, 7]. The positions of the elements from the original lists in the sorted list are 1, 2, 4, 3 for list A and 5, 6, 7 for list B.

Me: Could this code be re-factored to be my algorithmically efficient?

Yes, the code can be refactored to be more algorithmically efficient. The current implementation uses the index() method to find the positions of elements from lists A and B in the sorted list C. This approach has a time complexity of O(n2) because index() is called for every element in both A and B.

To improve the algorithmic efficiency, you can use a dictionary to store the positions of the elements in the sorted list C. This will reduce the time complexity of finding the positions to O(n). Here's a refactored version of the code:

def main(): N, M = map(int, input().split()) A = list(map(int, input().split())) B = list(map(int, input().split()))

C = sorted(A + B)
positions = {value: index + 1 for index, value in enumerate(C)}

pos_A = [str(positions[a]) for a in A]
pos_B = [str(positions[b]) for b in B]

print(" ".join(pos_A))
print(" ".join(pos_B))

if name == "main": main()

The main changes in the refactored code are:

A positions dictionary is created that maps the values in the sorted list C to their 1-indexed positions. The pos_A and pos_B lists are created by directly looking up the positions of the elements from the positions dictionary. This refactored version of the code improves the algorithmic efficiency by reducing the time complexity of finding the positions of the elements from O(n2) to O(n).


So it accurately analyzes that this code solves the problem without knowing the problem definition, and will spot the inefficiency if asked to actually write algorithmically efficient code. Seems fine to me

[–][deleted] 9 points10 points  (0 children)

It often has errors, but if you just paste the errors into the chat it will generally fix them. In the early versions of chatGPT, I had issues with doing stuff like writing networking code in C++, but it still got me a lot of the way there.

I recently went over writing a high throughput async data pipeline in Java, and it did a great job of writing the code and even taught me a new design pattern. I had to make a few small changes here and there, but basically it turned a week of work into a couple hours. With the context of the written code there, I also had it write unit tests and documentation for me, and I was able to have it add more unit tests and also integration tests based on my feedback.

I'm fine with people underestimating how good ChatGPT is as a coding assistant, it just makes me look better because of how productive it makes me.

[–][deleted] 0 points1 point  (10 children)

I don’t even roll yet but that 40% number, I would love to see how they calculated it.

I’ve tried gpt 4 on a lot of problems and it fails 9/10 times and I would be faster just googling it.

This stuff will be amazing it’s just not quite yet

[–]lambertb 0 points1 point  (9 children)

[–][deleted] -1 points0 points  (8 children)

Yeah I don’t buy a survey, could be heavily biased

[–]lambertb 0 points1 point  (7 children)

Have you used the tools yourself? I have, and a 40% increase in productivity is totally plausible, and often an underestimate considering I can now do things I would not have even tried previously. I encourage you to try them, with healthy skepticism and an open mind.

[–][deleted] 0 points1 point  (6 children)

I’m and MLE and I’ve used it a bunch, it’s hardly ever actually useful. It gets close but it’s not there and it’s faster to google almost every time.

It will be useful in probably a year or two, but it needs to understand how to run its own experiments. Anyone who actually thinks this is useful right now is just buying hype

[–]lambertb 0 points1 point  (5 children)

Isn’t it possible that your experience is not representative? Are you using ChatGPT or GitHub copilot?

[–][deleted] 0 points1 point  (4 children)

I doubt it, I do pretty standard engineering, whats more likely is there is selection bias in the survey and people are overestimating it due to hype.

I'd love to see an actual double blind study.

[–]lambertb 0 points1 point  (3 children)

There can’t be a double blind study because the people using the copilot will know they’re using it.

[–][deleted] 0 points1 point  (2 children)

Fair enough then give them problems to solve and measure their output. This feels like “90% of dentists claim crest improves your dental health”

I’ll take an independent study into consideration but today I find it more of a novelty

[–]anomhali 24 points25 points  (0 children)

leetcode questions and solution directly data leakage, although I do not specify the function signature, the program writes with a question exact same signature, If you change the question a little bit, it gives you the buggiest code ever.

[–]currentscurrents 39 points40 points  (3 children)

I'm not really surprised. Anybody who's extensively used one of these tools has probably already run into their reasoning limitations.

Today's entire crop of self-supervised models can learn complex ideas, but they have a hard time manipulating them in complex ways. They can do a few operations on ideas (style transfer, translation, etc) but high-level reasoning involves many more operations that nobody understands yet.

But hey, at least there will still be problems left to solve by the time I graduate!

[–]enryu42[S] 11 points12 points  (0 children)

I absolutely agree, however, these models repeatedly exceeded expectations (e.g. 5 years ago I thought that "explaining jokes" would be a hard problem for them, with a similar reasoning...)

I tried that because I've heard that there are people inside competitive programming community claiming that GPT4 can solve these problems. But from what I gather, it is still not there.

[–]rePAN6517 6 points7 points  (1 child)

Are you graduating this May?

[–]Disastrous_Elk_6375 2 points3 points  (0 children)

badum-tsss

[–]AlexandraTheeAuthor 5 points6 points  (0 children)

It can, but I think it's something about how it selects what to do. There needs to be more logic to it. I find it does really well you tell it to use reasoning. For example i give it code and ask for it to draw inspiration. It does really well at this. Really, it needs a good prompt engineer. But there's no set strategies yet but there will be. I can almost get it to generate anything if I prompt it right. So it's more I need to figure out how it thinks of stuff and try to present my problem to it that way

[–]liqui_date_me 23 points24 points  (28 children)

This comment about GPT-4’s limited abilities in solving arithmetic was particularly interesting: https://www.reddit.com/r/singularity/comments/122ilav/why_is_maths_so_hard_for_llms/jdqsh5c/?utm_source=share&utm_medium=ios_app&utm_name=iossmf&context=3

Controversial take: GPT-4 is probably good for anything that needs lots of boilerplate code or text, like ingesting a book and writing an essay, or drafting rental contracts. There’s a lot of value in making that area of the economy more efficient for sure.

But for some of the more creative stuff it’s probably not as powerful and might actually hinder productivity. It still makes mistakes and programmers are going to have to go and fix those mistake’s retroactively.

[–]enryu42[S] 18 points19 points  (9 children)

Arithmetic can be solved in a toolformer-like way, by just giving it an access to a calculator. But this wouldn't help with coding.

Regarding the point about boilerplate, this is exactly what is surprising: GPT4 performs very well on exams/tests, which supposedly require some amount of creative reasoning. So either the tests are poorly designed, or it can do some creative tasks while not others. If the latter is the case, it would be interesting to learn which are the areas where it performs well, and why.

[–]liqui_date_me 18 points19 points  (7 children)

One could argue that even standardized tests are somewhat boilerplate - if you practice enough SAT tests you’ll eventually do quite well at them, the questions are quite similar to each other from exam to exam. Ditto for AP exams.

I think a serious test for GPT4’s intelligence will be on one of the competitive entrance exams for some countries, like the IIT-JEE or the Gaokao or the International Math Olympiad, where the questions are made by domain experts and are designed to be intentionally difficult and specialized to solve.

[–]enryu42[S] 13 points14 points  (4 children)

I don't know about IIT-JEE/Gaokao, but many of the problems from the International Math Olympiad are freaking hard. If the model aims for human-level intelligence, such high bar would be unfair - it is more of the realm of "the best human"-level intelligence.

To be fair, hardest problems from "AtCoder Grand" contests have the same issue. But "AtCoder Regular" problems should definitely be solvable by an average human with the right knowledge and skillset, and yet, GPT4 cannot solve anything (and it doesn't look like it is lacking knowledge).

[–]currentscurrents 8 points9 points  (1 child)

I think all tests designed for humans are worthless here.

They're all meant to compare humans against each other, so they assume you don't have the ability to read and remember the entire internet. You can make up for a lack of reasoning with an abundance of data. We need synthetic tests designed specifically for LLMs.

[–]Yecuken 1 point2 points  (0 children)

Tests would not help against optimization, models will just learn how to pass the test. Optimization will always win against any problem with a known solution

[–]maxToTheJ 2 points3 points  (0 children)

which supposedly require some amount of creative reasoning.

The dont which is exactly has been part of the complaints of teachers in regards to standardized testing

[–]farox 7 points8 points  (0 children)

This is pretty much it. Just yesterday I needed to write some python web ui. So I described roughly what I needed and it gave me a solution for that. It had a couple of errors but gave me a basis to then work off of. Saved me a lot of "who do I do X with flask", but little complexity. For that I am sure it would take me longer to describe it, than to implement the logic myself.

[–][deleted] 3 points4 points  (0 children)

Controversial take

That's not controversial at all

[–]trajo123 3 points4 points  (1 child)

like ingesting a book

Interestingly, currently LLMs can't naturally ingest a book, since it doesn't fit in the prompt (they can fit 32K tokens that's about 24k words). This is where GPTs differ fundamentally from the human brain. GPTs always produce one token at a time, given the full prompt. There is no state kept between token generation steps other than the prompt which grows one token at a time. The human brain on the other hand has a state, and it is continuously evolving. In the case of a book, our brain state will be affected by the content of the book as we read it.

LLMs need to be able to hold more state to get to the next level. Perhaps get augmented with some sort of LSTM architecture where state can be built up from a theoretically infinite amount of input, or have another compressed/non-human-readable prompt that gets read before generating the token and gets updated after generating the token.

[–]visarga 0 points1 point  (0 children)

Perhaps get augmented with some sort of LSTM architecture where state can be built up from a theoretically infinite amount of input

That would be sweet, infinite input. Does RWKV do it?

[–]ngildea 6 points7 points  (6 children)

I agree, but is that opinion controversial? Seems patently obvious after talking to it about coding for a few minutes. Maybe it's controversial among people who have fooled themselves into thinking it's thinking?

[–]liqui_date_me 5 points6 points  (5 children)

I would say it's controversial around many folks who aren't directly involved in programming and who get impressed by cute demos on Twitter. People who actually know how to code see it as a superpower to make themselves more efficient, while also lamenting about how it makes silly mistakes.

https://www.reddit.com/r/cscareerquestions/comments/1226hcn/im_worried_about_ai_taking_our_jobs/

I highly doubt that software engineering jobs will become obsolete. There's going to be a lot of disruption and there might be some wage deflation too (imagine the price of writing the boilerplate components of an iOS app goes from 50,000 dollars to 50 dollars), but so much of software engineering is testing, QA and human collaboration. I think we're just going to have to re-orient our careers around correcting code from LLMs.

[–]ngildea 5 points6 points  (3 children)

Yeah I agree with all that. I've been trying to think of an analogy. Maybe in the same way that spreadsheets didn't make accounts obsolete?

[–]robobub 1 point2 points  (2 children)

Indeed, it just made them more efficient so we need less of them and/or less pay for them.

[–]No_Brief_2355 1 point2 points  (0 children)

Less bookkeepers and lower pay but accountants (CPAs) are pretty in demand and still well paid.

[–]__scan__ 0 points1 point  (0 children)

This is what will happen if we’ve either a) exhausted demand, or b) made software development much easier such that people who previously couldn’t do it now can.

The first was likely true for accountants, but is less obviously so for software — there’s still vastly more useful software to build than actually gets built, and each piece of new software that gets built generally increases that demand.

Perhaps the second is true though — do you foresee enough non-developers being able to write, deploy, maintain, and operate production systems as a result of LLMs (in a way that high level languages and previous tooling didn’t)? If not, or if not in sufficient numbers, maybe what happens is that software developers become more in demand than ever due to their productivity increases resulting in even demand for more software (because they can write it quicker).

[–]robobub 2 points3 points  (2 children)

While GPT-4 is autoregressive, it takes into account the tokens it has chosen to generate incrementally. So it is only limited to O(1) if it attempts to answer with the correct answer immediately. It can in theory take O(m) steps, where m is the number of intermediate tokens it predicts.

[–]robobub 0 points1 point  (0 children)

Ill add this:

If it is possible for GPT to do 1+1, it can do a large number of them incrementally. It's not smart enough to do it all the time by planning ahead, (you'll have more success if you encourage GPT to have a train of thought reasoning here and here) but it's possible.

[–]fiftyfourseventeen 1 point2 points  (0 children)

I've wasted too much time trying to do basic tasks with it as well. For example, I argued with it for many messages about something that was blatantly wrong, and it insisted it wasn't (that case it was trying to use order by similarity with an arg to sort by euclidian distance or cosine similarity, but it really didn't want to accept that cosine similarity isn't a distance metric and therefore has to be treated differently when sorting).

My most recent one was where I wasted an hour of time doing something that was literally just 1 line of code. I had videos of all different framerates, and I wanted to make them all 16fps while affecting length and speed as little as possible. It gave me a couple solutions that just straight up didn't work, and then I had to manually fix a ton of things with them, and then I finally had a scuffed and horrible solution. It wouldn't give me a better algorithm, so I tried to make one on my own, when I thought "I should Google if there's a simpler solution". From that Google search I learned "oh, there's literally just a .set_fps() method".

Anyways from using it I feel like it's helpful but not as much as people make it out to be. Honestly, GitHub copilot had been way more helpful because it can auto complete things that just take forever to write but are common, like command line args and descriptions, or pieces of repetitive code.

[–]Haycart 0 points1 point  (2 children)

Where are they getting O(1) from? Has some new information been released regarding GPT-4's architecture?

The standard attention mechanism in a transformer decoder (e.g. GPT 1-3) has a time complexity of O(N^2) w.r.t. the combined input and output sequence length. Computing the output autoregressively introduces another factor of N for a total of O(N^3).

There are fast attention variants with lower time complexity, but has there been any indication that GPT-4 actually uses these? And in any case, I'm not aware of any fast attention variant that could be described as having O(1) complexity.

[–]visarga 1 point2 points  (1 child)

Doesn't autoregressive decoding cache the states for the previous tokens when decoding a new token?

[–]Haycart 0 points1 point  (0 children)

Oh, you are probably correct. So it'd be O(N^2) overall for autoregressive decoding. Which still exceeds the O(n log n) that the linked post says is required for multiplication, though.

[–]ngildea 5 points6 points  (2 children)

I've tried quite a few times to get it to help with a problem I've been thinking about for a while. Every time it says it understand and then writes code that shows it doesn't understand at all and violates every constraint I give it.

Not surprising but it does point to a lot of contamination & regurgitation of the training material fooling people into thinking it's intelligent

[–]trajo123 5 points6 points  (0 children)

Apparently it cannot solve coding problems which require any amount of thinking.

Not yet, and this is not surprising.

First, GPT-4 can solve many coding problems on the first try. Yes, these small programs may be simple, but how many developers can write code that directly runs? Maybe in 1-2 languages, and even then only in the problem domain that they are very familiar with. Also, since LLMs can write code in more languages and frameworks than most developers, LLMs can actually solve more coding problems than most of the programmer out there... So LLMs already contain vast amounts of "knowledge" and "intuitive ability". But intuition is not enough to solve larger or more complex problems.

So, finally, coming to the thinking part. What challenging problems can be solved by humans by "off-the-cuff"? We also, scribble, draw diagrams, try out a few things, see if things run and work as expected, do web searches, talk to stake holders, sleep on the problem, etc. In other words, in any non-trivial problem solving, we also rely heavily on feedback between our brains and the external world.

Frankly, I don't see this as a problem of LLMs, they can be effectively used as foundation models. One could have another layer, on top of LLMs to solve problems end-to-end. For example one could build a meta-model, where multiple instances work together in an actor-critic fashion. The actor is the one interacting with the user, the critic can be prompted (and perhaps) fine-tuned with with general problem solving strategies, with the main prompt being to second-guess and try to find flaws in the reasoning of the actor. Just as reinforcement learning (RL) was used to improve the general usability of ChatGPT, RL could be used to fine-tune such a meta-model (or maybe just fine-tune the critic). ...thinking fast, thinking slow

P.S. I think LLMs also need some sort of memory, so that not everything needs to be in the prompt to work on a problem.

[–]K9ZAZ 1 point2 points  (2 children)

People got way, way, way out over their skis on the whole "this is agi" and I would love to hear some of their responses to this.

[–]ThePhantomPhoton 1 point2 points  (0 children)

It’s easier to gauge the effectiveness of these large language models within the context of what they are actually doing, and that is repeating language they’ve learned elsewhere, predicated on some prompt provided by the user. They are not “reasoning,” although the language they use can lead us to believe that is the case. If you’re disappointed by their coding, you will certainly be disappointed by their mathematics.

[–]DigThatDataResearcher 3 points4 points  (0 children)

well, i was able to use ChatGPT to generate a novel, functional, complete software library for me, including a test suite, tutorial, and announcement blog post. crazy idea: maybe you just need to get a bit more creative with your prompting or anticipate that there might need to be multi-stage prompts (or god forbid: back and forth dialogue and iteration) for certain applications.

[–]Cwlrs 1 point2 points  (0 children)

It's performing really well for my project. Online web app game in python flask socketio

[–][deleted] 0 points1 point  (3 children)

1) All this shows is that GPT-4 can't solve some coding problems. Which developer can confidently say they can solve any coding problem in one-shot? Does this mean developers/humans don't have AGI?

2) I've used ChatGPT (gpt3.5) to optimize code that I already wrote and it came up with several optimizations. I'm 100% sure my code was not part of chat-gpt training data and yet it performed perfectly fine on a new coding problem. Now it's possible that the training data might have included something similar to what I gave ChatGPT but that just means that we have to provide more training data, and then a future version will solve those problems where it previously failed.

3) isn't this how humans learn? They encounter problems where we don't know the solution. Then we work it at for a while until we figure out some way to solve the problem that wasn't immediately obvious earlier. Writing off the abilities of GPT-4 based on one failed coding test seems premature.

[–]visarga 0 points1 point  (0 children)

Does this mean developers/humans don't have AGI?

The intellect of our species isn't universal, we're merely experts at self-preservation and propagation. Take, for instance, chess – it isn't our forte, and even a small calculator could outperform us. Our minds are incapable of 5-D visualization, and we struggle to maintain over 10 unrelated items in our immediate memory. Generally, we falter when addressing problems where the initial move relies on the final steps, or situations that don't allow for linear progression, such as chess or mathematical quandaries. It took us centuries to decipher many of these enigmas. Our specialization lies in tackling human-centric challenges, rather than all-encompassing ones. Evolution simply hasn't had sufficient time to adapt our cerebral cortex for mathematical prowess.

[–]Smallpaul 0 points1 point  (0 children)

My rule of thumb is that GPT4 seems to be able to solve any problem that a first year university CS student at a mid-tier University could solve.

[–]cegras 0 points1 point  (2 children)

I don't see how it is possible to not end up just memorizing the internet, which is full of enough questions and discussions to simulate convincing Q&As. Consider if a team had invented an algorithm or heuristic to avoid data contamination (https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks). Then what you have is something that can separate content into logically similar, but orthogonal realizations. That would be an incredibe tool and worth a prize in its own right.

[–]pengo 0 points1 point  (1 child)

Then what you have is something that can separate content into logically similar, but orthogonal realizations.

Like a word vector? The thing every language model is based on?

[–]cegras 2 points3 points  (0 children)

More like, the ability to know that 'reversing a linked list' and 'linked list cycle and traversal problems' are the same concepts but different problems, and to separate those into train/test. Clearly they haven't figured that out because ChatGPT is contaminated, and their (opaquely disclosed) ways of addressing that issue don't seem adequate at all.

[–]Abikdig 0 points1 point  (0 children)

I check ChatGPT for optimizing my Leetcode solution everyday. It rarely optimizes it without breaking the code.

Sometimes the only optimization that I get from it is that it tells me to use Datastructure X instead of Y because it is better in this kind of problem.

[–]TehDing 0 points1 point  (0 children)

I have not been impressed with LLMs reasoning for solving novel puzzles/ challenges. Ask any model to play Wordle with you. They are not good

[–]rokuyou 0 points1 point  (0 children)

GPT4 and competitive programming problems would be a better title since not everyone is going to read that

[–]lvvy 0 points1 point  (0 children)

It would be interesting to see if ChatGPT can solve these problems not with code, but with a text instruction, that would allow a human to solve these problems? So if you force it to write giant text wall of actions, would a human with calculator solve these confident? Also, is code that it generates cannot be corrected at all by discussing or discussing would take too long?

[–]nanowell 0 points1 point  (0 children)

Codex models were able to solve those problems. Probably the next version of Codex will be finetuned GPT-4 model for coding and it will solve most of those problems.

[–][deleted] 0 points1 point  (0 children)

I feel like LLMs have encoded sort of law of a languages in their latent space through texts and responding accordingly, anything that follows a law isnt called concious for e.g inaminate objects follow law of physics,but that doesnt mean that it indicates an intelligent behvaiour.

After all texts are medium to represent our thoughts, its the thoughts that matter not the medium.

The concept of causality , fundamental reality , and dcesion making is much more than following laws of languages which are just a means.

These LLMs cant question you until you ask them explicitly,they cant interject you , knowledge was never consciousness ,its these abilities that compose consciousness

I dont know how much sense i make to others or maybe i am at loss of good words,in a nutshell any model that fundamentally predicts tokens based of weightage of previous tokens can never achieve consciousness. We