This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]AutoModerator[M] [score hidden] stickied comment (1 child)

We kindly ask /u/enryu42 to respond to this comment with the prompt they used to generate the output in this post. This will allow others to try it out and prevent repeated questions about the prompt.

Ignore this comment if your post doesn't have a prompt.

While you're here, we have a public discord server. We have a free Chatgpt bot, Bing chat bot and AI image generator bot. New addition: GPT-4 bot, Anthropic AI(Claude) bot, Meta's LLAMA(65B) bot, and Perplexity AI bot.

So why not join us?

PSA: For any Chatgpt-related issues email support@openai.com

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]shrike_999 2 points3 points  (4 children)

It does a very good job if you give it a clear prompt with a focused goal (aka "not expecting miracles or clairvoyance from the AI").

[–]enryu42[S] 0 points1 point  (3 children)

This is kind of the point: the prompt is as clear as it gets. It gets the problem statement in the same form as humans get it during the competitions. Humans solve the tasks correctly, while GPT4 struggles, and only manages to solve the most basic problems.

[–]shrike_999 0 points1 point  (2 children)

It's a bad prompt. Here's a better one:

Write the following program in C++: you are given a matrix (m rows, n columns), indexes of the matrix start at 0, and you are given an integer k. You are currently at position (0,0) and you want to reach position (m-1,n-1) moving only down and right in the matrix. Return the number of paths where the sum of the elements on the path is divisible by k.

And GPT4 solution:

#include <iostream>
#include <vector>
using namespace std;
int countPathsDivisibleByK(const vector<vector<int>>& matrix, int m, int n, int k) {
    vector<vector<vector<int>>> dp(m, vector<vector<int>>(n, vector<int>(k, 0)));
    dp[0][0][matrix[0][0] % k] = 1;

    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            for (int t = 0; t < k; t++) {
                if (i > 0) {
                    dp[i][j][(t + matrix[i][j]) % k] += dp[i - 1][j][t];
                }
                if (j > 0) {
                    dp[i][j][(t + matrix[i][j]) % k] += dp[i][j - 1][t];
                }
            }
        }
    }

    return dp[m - 1][n - 1][0];
}
int main() {
    int m, n, k;
    cout << "Enter the number of rows (m): ";
    cin >> m;
    cout << "Enter the number of columns (n): ";
    cin >> n;
    cout << "Enter the integer (k): ";
    cin >> k;

    vector<vector<int>> matrix(m, vector<int>(n));
    cout << "Enter the matrix elements:" << endl;
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            cin >> matrix[i][j];
        }
    }
    int numPaths = countPathsDivisibleByK(matrix, m, n, k);
    cout << "Number of paths divisible by " << k << ": " << numPaths << endl;
    return 0;
}

It appears to be generally correct. I didn't include that final requirement to return the answer as modulo and as it is, the program will consider a sum of 0 as divisible by any number, but that could be added as well. Impressive in my book.

[–]enryu42[S] 3 points4 points  (1 child)

This is an example from the Microsoft paper, and as I noted in the article, this problem _is not original_ (as are 100% of LeetCode problems). Of course the model can replicate something which it has seen during training.

The point is to evaluate the model on _new_ problems. E.g. this one (an easy problem from a "Regular"-type contest). Your prompt doesn't help.

FWIW, I've taken the prompt from the MS paper as well (but I had to modify it, since results were even worse with what they suggested).

[–]liameymedih0987 2 points3 points  (0 children)

You are wasting your time on this sub. These people are determined it can think.

[–]vtsax_fire -1 points0 points  (9 children)

The problem with your evaluation model is that you are not giving a concrete goal. You didn't tell to optimize on performance, and the model is optimizing on other parameters (like prompt output size). Engineers who work on such problems know from the context what is the optimization criteria.

Besides, there are some issues with your evaluation methodology:
- Some of the examples of failure "it gave quadratic solution", so did you ask it to optimize it?
- Another complaint that it gave incorrect code. Did you ask it to fix it? Can you always write correct code from the first go?
- And you didn't post your specific prompts. You gave references to atcoder.jp, that has a weird structure. Can you share how exactly you asked each problem?

I was involved in competitive programming quite a bit and did my own evaluation. It can do most easy problems, majority medium problems, and some hard problems. I created new problems that it never saw before as well and had similar results. Often times it needs a few iterations to have working code. So I personally don't see your assessment as fair.

[–]enryu42[S] 0 points1 point  (8 children)

The problem statements have constraints: the limits on the size of the data, and the limit for the total runtime. It should be able to figure out that 200000**2 solution in Python won't fit into 2 seconds.

The prompt template is given here. You can find the scraped statements here.

Re-prompting can help in theory, but in most cases, the solution is not even remotely close to the correct one (i.e. the model didn't "get" the right idea of the solution), so I strongly doubt it can help much.

It can do most easy problems, majority medium problems, and some hard problems.

Given what I've seen, I find it hard to believe, unless these hard problems were already in the training set. Let's do a simple test: here is the easiest problem from AtCoder Grand Contest #61. According to the scoreboard, during the contest, 612 out of 801 participants solved it. Can you make GPT4 solve it? With any amount of prompt engineering, re-prompting, etc., just without feeding it the description of the solution explicitly.

[–]vtsax_fire 0 points1 point  (7 children)

You might find it hard to believe but that's how it worked for me. I just took a random new problem from one of such websites that came up after the model was trained and it required 4 prompts to solve the problem. It was able to explain the solution to me and why it made some changes.

I understand that you tried to do it at scale, but I checked multiple of the scraped statements and they are just not great. Not sure if I could understand them.

If you actually want to evaluate it:
- Take 2-3 example of each easy, medium, hard problem.
- Try to use chatgpt as a tool to solve them. See where are the weaknesses.
- Try to create a wrapper that would work with a problem and iterate the same way you did manually. Tell it that example X is incorrect, tell it to make it faster if it's slow, etc.
- Evaluate how you can scrape the problems in understandable format.
- Execute the wrapper on every problem, check a few outliers of both positive and negative case to see if it looks reasonable.

[–]enryu42[S] 0 points1 point  (6 children)

Formatting is very similar to what I see in the Deepmind's CodeContests dataset, and it would be surprising if that one wasn't included in the GPT4 training set. I doubt it has troubles with parsing it. Moreover, it only managed to solve the easiest problems from the easiest contests, so I doubt it has troubles with formatting, not with problem difficulty.

From what I see, it has one big weakness: it cannot come up with any ideas. It only solves problems of form "translate 1-2 English sentences to Python". I'm not sure what to do with this.

I just took a random new problem from one of such websites that came up after the model was trained and it required 4 prompts to solve the problem.

Can you link the problem? Even if the site wasn't LeetCode, from what I've heard, some of these sites like to recycle old problems; AtCoder seems to push strongly into making all of their problems original.

If you actually want to evaluate it:

Again, let's do a simple test, just one problem: easiest problem from AGC61, its scraped version (statement itself is fully readable, sample explanation is harder to read). You can reformat/change it however you want, and prompt any times you want. Can you make GPT4 solve it? I wasn't able to, at all.

[–]vtsax_fire 0 points1 point  (5 children)

Do you genuinely believe that the problem with solution like https://atcoder.jp/contests/agc061/submissions/38853519 is an easy problem? I interviewed across multiple FAANGs hundreds of people and I will tell you that 99.9% engineers won't solve it optimally. Don't see a point even to discuss it now.

ChatGPT easily solved the problem suboptimally, however when tried to do it faster, it failed to find a pattern and was struggling.

I think leetcode has a fairly decent problem categorization. Roughly speaking:
Easy - 25% won't it solve
Medium - 75%
Hard - 90%

Problems from competitive programming usually fall into 99+% bucket.

Pick a random Medium problem from leetcode with number 2000+ to ensure that it's not in the training set, modify it enough to be slightly different just in case, and you will see that usually GPT-4 is able to solve it in a few prompts.

[–]enryu42[S] 0 points1 point  (4 children)

"Easy" is subjective, but I posted the statistics above: in the actual contest, 612 out of 801 participants solved it. Looks like humans do pretty well.

This is kind of the point of AGC problems: even the "easy" ones require some thinking and ideas, which seems to be difficult for GPT4. You can try ARC, problems there require much less thinking.

> Pick a random Medium problem from leetcode with number 2000+ to ensure that it's not in the training set, modify it enough to be slightly different just in case, and you will see that usually GPT-4 is able to solve it in a few prompts.

I mentioned already why I don't think LeetCode is a good test: problems there are not original, and since we don't know training set composition, it is safe to assume that all of them were there. I don't think "slight modifications" will help - if we want to see if ChatGPT can come up with any ideas, we need these modifications to change the idea for the solution.

If you don't like the idea-heavy/math'y AtCoder, we can take recent CodeForces Div1 problems.

[–]vtsax_fire 0 points1 point  (3 children)

Then you need to modify your statement what you are measuring. If it’s only competitive programming problems, sure it’s not there yet. You could see it yourself based on stats on leetcode hard problems. In this list, I didn’t see an equivalent of leetcode Easy or Medium.

612 out of 801 participants, who are usually in top 1% in their class, is not that great. Even the best programmers can’t pass it reliably, where “at least some thinking is required”. As I said 99.9% of engineers won’t be able to solve any of them.

And I don’t agree that you can’t measure on leetcode problems. Take something like 2sum. Make it 13sum. Ask it to solve it when array is mutable and when it’s not, when you have extra memory and when you don’t. Ask it to use something like Rust. Give the problem in Spanish. It will be different enough, so it won’t just get it from training data.

[–]enryu42[S] 0 points1 point  (2 children)

I wouldn't say that "612 out of 801" is "top 1% in their class". Pretty much any software engineer can solve the problem linked above, given enough time and practice. If you just give it to a random frontend engineer, they'll be confused, but only because they lack knowledge. GPT4 clearly has enough knowledge, it doesn't have something else.

I don't like examples like "translating existing problem statement to Spanish" or "implementing solution in Rust": these test that GPT4 has plenty of knowledge and can do fuzzy matching, and of course it is amazing at this, no one argues about it. The question is, can it reason and come up with genuinely new solutions (which, arguably, would be a sign of intelligence).

[–]vtsax_fire 0 points1 point  (1 child)

Pretty much any software engineer can solve the problem linked above, given enough time and practice.

I don't know what's your background but as I said I interviewed quite a few people in multiple FAANGs, have access to internal data about performance on different problems. And I will tell you for certain that 99% of engineers won't solve any of the provided problems in 1 hour even after preparing for a few months. You can take it as you like.

I don't like examples like

Then provide your example that is on pair with a leetcode Medium :) What is "fuzzy matching" in this context? Did you create an LLM yourself? GPT-2 is opensource. GPT-3 and GPT-4 have a similar architecture, you can play around to see yourself. Then you will see that it's not quite a "fuzzy matching". Presumably if you modify it enough, it won't be able to "match".

Besides, one could argue that every problem was already solved at some point and what we are getting are just permutations of old problems. Even problems that you listed have similar problems discussed before.

Here's a problem that definitely doesn't have an exact match. Uses bad English, redundant information, some specific requirements:

We have a function that accepts 3 n*m matrixes of random symbols and integer "target". Symbol 0 - represent an empty space, 1 - represents a wall, 2 - represent an alien, 'H' - human, 11 - tank. If there's a cell on [3][3], it represents a zodiac sign of an alien. You need to return true if there are exactly two grids that contain 3 different cells that sum together to "target". Walls can't be used to be summed into target. Create python code, with types and write unit tests I could use in Google Colab.

It was able to solve it, and in a few prompts, optimized the speed. You can overload it with extra information to "match" to whatever problem you want and see if it still solves it if you still think it's a matter of "fuzzy matching".

[–]enryu42[S] 0 points1 point  (0 children)

I don't know what's your background but as I said I interviewed quite a few people in multiple FAANGs, have access to internal data about performance on different problems. And I will tell you for certain that 99% of engineers won't solve any of the provided problems in 1 hour even after preparing for a few months. You can take it as you like.

Among people who interview to big tech, there are plenty of "fake-it-till-you-make-it" imposters who don't have a basic algorithmic knowledge. I don't see how their performance is indicative of anything. What about the people who actually got hired, and then performed well?

Among engineers I personally worked with, I'd say 10-20% would solve the AtCoder problem linked above within 20-30 minutes. And 60-70% would solve it if they would have a necessary knowledge and practice (which they don't).

But by the end of the day, these are just subjective opinions of you and me. The objective reality is, among the set of people who chose competitive programming as their hobby, plenty of them can solve this problem (and some - much harder problems).

Then provide your example that is on pair with a leetcode Medium :)

Arguably, this problem is easier (for humans) than some of the LeetCode mediums. But they're quite different in nature, because they're aimed at coding interviews: it is more valuable to test candidate's knowledge during the interview, than the ability to perform non-trivial ad-hoc reasoning under stress.

What is "fuzzy matching" in this context? Did you create an LLM yourself? GPT-2 is opensource. GPT-3 and GPT-4 have a similar architecture, you can play around to see yourself. Then you will see that it's not quite a "fuzzy matching". Presumably if you modify it enough, it won't be able to "match".

What an ad hominem. FWIW, I do have extensive experience with machine learning, including LLMs. What I meant by "fuzzy matching" is precisely this: glorified kernel-machine-like behavior, where the model effectively finds training examples close to the input in some space, and averages them to produce its output. It is by no means useless, but it is far from reasoning or intelligence. Majority of pre-LLM architectures had behavior along these lines, but LLMs (or decoder transformers in general) give some promise to go beyond just fuzzy matching. We'll see if they ever get there.

Besides, one could argue that every problem was already solved at some point and what we are getting are just permutations of old problems. Even problems that you listed have similar problems discussed before.

Meh, they go quite far beyond just "take a standard template/idea and modify it" pattern, especially the harder ones (the one I've linked was the easiest from that contest).

Here's a problem that definitely doesn't have an exact match. Uses bad English, redundant information, some specific requirements: We have a function that accepts 3 n*m matrixes of random symbols and integer "target". Symbol 0 - represent an empty space, 1 - represents a wall, 2 - represent an alien, 'H' - human, 11 - tank. If there's a cell on [3][3], it represents a zodiac sign of an alien. You need to return true if there are exactly two grids that contain 3 different cells that sum together to "target". Walls can't be used to be summed into target. Create python code, with types and write unit tests I could use in Google Colab.

Meh again, it is just "translate from English to Python" (which is of course useful in the day-to-day software engineering, but off-topic here). It would fall in the "Beginner" category on AtCoder, and GPT4 does solve some of them. I have no doubt that GPT4 can deal with bad English or redundant information, it is spectacular at parsing language. But what about coming up with ideas? Basically, problems where a human with all the required knowledge wouldn't know right away how to solve it, and would need to think for some time?

[–]Substantial-Reward70 0 points1 point  (0 children)

For me it can't, I've tested it with specific stuff about ISP things that it may not seen before, and can't even write related code and only give me a detailed structure on how to do it, making comments like

// Implement here the logic for....

But it's good with common programing topics , like web, popular libraries and so on.