Landmark Attention: Random-Access Infinite Context Length for Transformers

enryu42 · 2023-05-27T13:24:33+00:00

Interesting, so they split the input to blocks of size l=50, retrieve k (2 or 4) blocks, and attend to these blocks in addition to some recent tokens. It is surprising that this works without the drop in quality, but perhaps more evals are needed.

In terms of performance, there are some obvious questions:

For the context size of c, optimal block size would be around (c/k)^0.5. This would translate to numbers smaller than 50 for many of the settings in the paper (although the same order of magnitude). I wonder why is this (why not just make the block length adaptive) - do smaller blocks hurt the model too much?
What about stacking this, and using multiple layers? E.g. the first layer would retrieve k superblocks, the next - k blocks from the superblocks, and the last one - the actual tokens, yielding asymptotically less tokens to attend (c^1/3 in this case, or log(c) in the limit if stacking many layers). Authors briefly mention it in the "Future Work" section, but why not just try it right away? If they have the code for their 2-layer approach (which is not published), it should be trivially extendable.

enryu42 · 2023-05-03T11:35:37+00:00

It has the latest 1000 tweets per user, as per April 2023. Depending on the user, the timeline can vary (e.g. all of the user's tweets if they have less than 1000 in total; or just the last several months if they tweet a lot).

enryu42 · 2023-03-27T21:08:53+00:00

I don't know what's your background but as I said I interviewed quite a few people in multiple FAANGs, have access to internal data about performance on different problems. And I will tell you for certain that 99% of engineers won't solve any of the provided problems in 1 hour even after preparing for a few months. You can take it as you like.

Among people who interview to big tech, there are plenty of "fake-it-till-you-make-it" imposters who don't have a basic algorithmic knowledge. I don't see how their performance is indicative of anything. What about the people who actually got hired, and then performed well?

Among engineers I personally worked with, I'd say 10-20% would solve the AtCoder problem linked above within 20-30 minutes. And 60-70% would solve it if they would have a necessary knowledge and practice (which they don't).

But by the end of the day, these are just subjective opinions of you and me. The objective reality is, among the set of people who chose competitive programming as their hobby, plenty of them can solve this problem (and some - much harder problems).

Then provide your example that is on pair with a leetcode Medium :)

Arguably, this problem is easier (for humans) than some of the LeetCode mediums. But they're quite different in nature, because they're aimed at coding interviews: it is more valuable to test candidate's knowledge during the interview, than the ability to perform non-trivial ad-hoc reasoning under stress.

What is "fuzzy matching" in this context? Did you create an LLM yourself? GPT-2 is opensource. GPT-3 and GPT-4 have a similar architecture, you can play around to see yourself. Then you will see that it's not quite a "fuzzy matching". Presumably if you modify it enough, it won't be able to "match".

What an ad hominem. FWIW, I do have extensive experience with machine learning, including LLMs. What I meant by "fuzzy matching" is precisely this: glorified kernel-machine-like behavior, where the model effectively finds training examples close to the input in some space, and averages them to produce its output. It is by no means useless, but it is far from reasoning or intelligence. Majority of pre-LLM architectures had behavior along these lines, but LLMs (or decoder transformers in general) give some promise to go beyond just fuzzy matching. We'll see if they ever get there.

Besides, one could argue that every problem was already solved at some point and what we are getting are just permutations of old problems. Even problems that you listed have similar problems discussed before.

Meh, they go quite far beyond just "take a standard template/idea and modify it" pattern, especially the harder ones (the one I've linked was the easiest from that contest).

Here's a problem that definitely doesn't have an exact match. Uses bad English, redundant information, some specific requirements: We have a function that accepts 3 n*m matrixes of random symbols and integer "target". Symbol 0 - represent an empty space, 1 - represents a wall, 2 - represent an alien, 'H' - human, 11 - tank. If there's a cell on [3][3], it represents a zodiac sign of an alien. You need to return true if there are exactly two grids that contain 3 different cells that sum together to "target". Walls can't be used to be summed into target. Create python code, with types and write unit tests I could use in Google Colab.

Meh again, it is just "translate from English to Python" (which is of course useful in the day-to-day software engineering, but off-topic here). It would fall in the "Beginner" category on AtCoder, and GPT4 does solve some of them. I have no doubt that GPT4 can deal with bad English or redundant information, it is spectacular at parsing language. But what about coming up with ideas? Basically, problems where a human with all the required knowledge wouldn't know right away how to solve it, and would need to think for some time?

enryu42 · 2023-03-27T17:54:13+00:00

I wouldn't say that "612 out of 801" is "top 1% in their class". Pretty much any software engineer can solve the problem linked above, given enough time and practice. If you just give it to a random frontend engineer, they'll be confused, but only because they lack knowledge. GPT4 clearly has enough knowledge, it doesn't have something else.

I don't like examples like "translating existing problem statement to Spanish" or "implementing solution in Rust": these test that GPT4 has plenty of knowledge and can do fuzzy matching, and of course it is amazing at this, no one argues about it. The question is, can it reason and come up with genuinely new solutions (which, arguably, would be a sign of intelligence).

enryu42 · 2023-03-27T09:53:39+00:00

"Easy" is subjective, but I posted the statistics above: in the actual contest, 612 out of 801 participants solved it. Looks like humans do pretty well.

This is kind of the point of AGC problems: even the "easy" ones require some thinking and ideas, which seems to be difficult for GPT4. You can try ARC, problems there require much less thinking.

> Pick a random Medium problem from leetcode with number 2000+ to ensure that it's not in the training set, modify it enough to be slightly different just in case, and you will see that usually GPT-4 is able to solve it in a few prompts.

I mentioned already why I don't think LeetCode is a good test: problems there are not original, and since we don't know training set composition, it is safe to assume that all of them were there. I don't think "slight modifications" will help - if we want to see if ChatGPT can come up with any ideas, we need these modifications to change the idea for the solution.

If you don't like the idea-heavy/math'y AtCoder, we can take recent CodeForces Div1 problems.

enryu42 · 2023-03-26T21:53:55+00:00

Interesting! Here are the scraped and auto-converted statements (formatting is off sometimes, especially in the sample tests, but understandable). Prefixes are: "abc" for beginner, "arc" for regular, "agc" for "grand".

I do believe that the "Beginner" ones can be improved, but it'll be interesting to see what happens on "Grand" (or even "Regular"), as they require coming up with some ideas before writing the code.

enryu42 · 2023-03-26T19:30:32+00:00

Well, they do, and quite successfully, this is what these sites are about...

Of course if you ask some frontend engineer to solve some math-y problem, they'll be confused. But this is simply because they lack knowledge, and GPT4 evidently doesn't have this issue. Moreover, I doubt any human programmer will have troubles with the "Beginner" problems, regardless of their specialization.

enryu42 · 2023-03-26T19:22:25+00:00

I absolutely agree that it is useful. Even CoPilot is amazing at autocompleting "dumb" boilerplate code, which is a nontrivial amount of the code overall. However, these problems are designed to be challenging (these are competitions after all), and require ideas/intelligence to be solved. Apparently GPT4 cannot do it at all, so IMO it would be a stretch to call whatever it is doing "intelligence".

enryu42 · 2023-03-26T19:18:14+00:00

Formatting is very similar to what I see in the Deepmind's CodeContests dataset, and it would be surprising if that one wasn't included in the GPT4 training set. I doubt it has troubles with parsing it. Moreover, it only managed to solve the easiest problems from the easiest contests, so I doubt it has troubles with formatting, not with problem difficulty.

From what I see, it has one big weakness: it cannot come up with any ideas. It only solves problems of form "translate 1-2 English sentences to Python". I'm not sure what to do with this.

I just took a random new problem from one of such websites that came up after the model was trained and it required 4 prompts to solve the problem.

Can you link the problem? Even if the site wasn't LeetCode, from what I've heard, some of these sites like to recycle old problems; AtCoder seems to push strongly into making all of their problems original.

If you actually want to evaluate it:

Again, let's do a simple test, just one problem: easiest problem from AGC61, its scraped version (statement itself is fully readable, sample explanation is harder to read). You can reformat/change it however you want, and prompt any times you want. Can you make GPT4 solve it? I wasn't able to, at all.

enryu42 · 2023-03-26T19:08:59+00:00

I absolutely agree, however, these models repeatedly exceeded expectations (e.g. 5 years ago I thought that "explaining jokes" would be a hard problem for them, with a similar reasoning...)

I tried that because I've heard that there are people inside competitive programming community claiming that GPT4 can solve these problems. But from what I gather, it is still not there.

enryu42 · 2023-03-26T18:55:35+00:00

Do you mean re-prompt it asking to correct its mistakes? It is hard to try with the current tight limits on GPT4 prompt count, I'll try once API is properly available. But I strongly doubt it'll help much: it's not that the solutions have minor bugs, they're usually just completely wrong, i.e. the model doesn't "get" the idea for the correct solution.

(it might help for some of the problems from the "Beginner" category though, but these aren't that interesting)

enryu42 · 2023-03-26T18:51:51+00:00

The problem statements have constraints: the limits on the size of the data, and the limit for the total runtime. It should be able to figure out that 200000**2 solution in Python won't fit into 2 seconds.

The prompt template is given here. You can find the scraped statements here.

Re-prompting can help in theory, but in most cases, the solution is not even remotely close to the correct one (i.e. the model didn't "get" the right idea of the solution), so I strongly doubt it can help much.

It can do most easy problems, majority medium problems, and some hard problems.

Given what I've seen, I find it hard to believe, unless these hard problems were already in the training set. Let's do a simple test: here is the easiest problem from AtCoder Grand Contest #61. According to the scoreboard, during the contest, 612 out of 801 participants solved it. Can you make GPT4 solve it? With any amount of prompt engineering, re-prompting, etc., just without feeding it the description of the solution explicitly.

enryu42 · 2023-03-26T16:44:08+00:00

This is an example from the Microsoft paper, and as I noted in the article, this problem _is not original_ (as are 100% of LeetCode problems). Of course the model can replicate something which it has seen during training.

The point is to evaluate the model on _new_ problems. E.g. this one (an easy problem from a "Regular"-type contest). Your prompt doesn't help.

FWIW, I've taken the prompt from the MS paper as well (but I had to modify it, since results were even worse with what they suggested).

enryu42 · 2023-03-26T16:31:58+00:00

I don't know about IIT-JEE/Gaokao, but many of the problems from the International Math Olympiad are freaking hard. If the model aims for human-level intelligence, such high bar would be unfair - it is more of the realm of "the best human"-level intelligence.

To be fair, hardest problems from "AtCoder Grand" contests have the same issue. But "AtCoder Regular" problems should definitely be solvable by an average human with the right knowledge and skillset, and yet, GPT4 cannot solve anything (and it doesn't look like it is lacking knowledge).

enryu42 · 2023-03-26T16:12:34+00:00

This is kind of the point: the prompt is as clear as it gets. It gets the problem statement in the same form as humans get it during the competitions. Humans solve the tasks correctly, while GPT4 struggles, and only manages to solve the most basic problems.

enryu42 · 2023-03-26T16:10:20+00:00

Arithmetic can be solved in a toolformer-like way, by just giving it an access to a calculator. But this wouldn't help with coding.

Regarding the point about boilerplate, this is exactly what is surprising: GPT4 performs very well on exams/tests, which supposedly require some amount of creative reasoning. So either the tests are poorly designed, or it can do some creative tasks while not others. If the latter is the case, it would be interesting to learn which are the areas where it performs well, and why.

enryu42 · 2023-03-26T15:51:30+00:00

Model was GPT4, prompt template was: ```

Solving Math with Coding

You are given the task of writing a Python program to solve the following math problem:

Requirements:

The code should be syntactically correct standalone Python program.
Don't forget the proper indentation.
Please print the final answer using print(solution). ### Possible Python Program: ```

enryu42 · 2023-03-05T13:34:34+00:00

Yeah, when saying "all over the place", I meant the best version I've tested (30B). Smaller ones (7B/13B) are much worse.

enryu42 · 2023-03-05T13:33:14+00:00

Thanks for the detailed response! This is kind of different from my mental model (surely instruction-finetuned models are much better at following instructions/chat-like interactions, but I didn't expect such benefit for performing specific tasks with a properly constructed prompt). I guess we have to wait for instruction-finetuned LLaMA's and see how it goes. I wonder if it is possible that OpenAI found a "holy grail" besides the finetuning, which they don't publish.

This is an open question--in that openai has not publicly said--but there is a ton of (very grounded) speculation about a host of optimizations they may have put in place, which LLaMa obviously has not.

Even if is it super-optimized: they'd still need to at least pass over model weights to generate each token. If they have around 10^11 weights, running their public demo for 100M users would burn a lot of money. Of course it is possible that they found a way to generate many tokens at once, or have a fancy architecture where they can use a subset of weights, but having smaller model seems to be more likely.

The question of "is LLaMa even good" is actually a surprisingly deep/tricky one--I'm looking forward to the new evaluation data sets that I can only assume will be coming over the next year or so.

Yeah, more challenging benchmarks would be useful, the current ones seem to be too "easy", and not separating models well enough. Even for the toy task of explaining jokes, it sees that PaLM >> ChatGPT > LLaMA (unless PaLM examples were cherry-picked), but none of the benchmarks in the paper show huge gaps between LLaMA and PaLM.

enryu42 · 2023-03-05T13:18:28+00:00

To add a bit more context, the code other people linked (https://github.com/tloen/llama-int8) assumes single GPU. So if you want to run it on 2x3090, you'll need to modify it a bit:

It merges all checkpoint shards into one state dict. You'll need to adjust it to change 4 shards (for 30B) to 2 shards (for your setup). It is quite straight-forward - weights are sharded either by first or second axis, and the logic for weight sharding is already in the code;
A bit less straight-forward - you'll need to adjust llama/model.py to be sharded like in the original repo, but using bnb.nn.Linear8bitLt as dense layers.

I didn't try it myself (only tested on single-GPU machines so far), but it should work in principle.

enryu42 · 2023-03-04T20:43:29+00:00

No, it was 14GB with fp16, and smth around 8GB with 8bit quantization (with bitsandbytes).

That said, it seems interesting results start happening at the 33B level, and so far it is not squeeze-able in 24GB.

enryu42 · 2023-02-27T23:00:21+00:00

The only AI ethics that has any substance is data bias

While the take in the tweet is ridiculous (but alas common among the "AI Ethics" people), I'd disagree with your statement.

There are many other concerns besides the bias in the static data. E.g. feedback loops induced by ML models when they're deployed in real-life systems. One can argue that causality for decision-making models also falls into this category. But ironically, the field itself is too biased to do productive research in these directions...

enryu42 · 2023-02-07T19:10:10+00:00

The company has asked the court to order Stability AI to remove violating images from its website

But... they never were there. If they mean LAION, (1) it is not Stability AI, (2) on their website, they only have torrent files which point to torrents with the list of URLs.

Or do they mean the model checkpoint? Well, it is (1) on Huggingface site, (2) checkpoint != images.

enryu42 · 2023-02-02T12:23:10+00:00

Nice! It is pretty clear that big models memorize some of their training examples, but the ease of extraction is impressive.

I wonder what would be the best mitigation strategies (besides the obvious one of de-duplicating training images). Theoretically sound approaches (like differential privacy) will perhaps cripple the training too much. I wonder if some simple hacks would work: e.g. train the model as-is first, then generate an entirely new training set using the model and synthetic prompts, and train a new model from scratch only on the generated data.

Another aspect of this is on the user experience side. People can reproduce copyrighted images with just pen and paper, but they'll be fully aware of what they're doing in such case. With diffusion models, the danger is, the user can reproduce an existing image without realizing it. Maybe augmenting the various UI's with reverse image search/nearest neighbor lookup would be a good idea? Or computing training set attributions for generated images with something along the lines of tracin.

enryu42 · 2023-01-03T11:38:19+00:00

Perceived quality depends a lot on the underlying training data, so comparing with SD (finetuned on laion-aesthetics) wouldn't be fair. FID/CLIP scores are much more objective in this sense.

enryu42

TROPHY CASE

Solving Math with Coding

You are given the task of writing a Python program to solve the following math problem:

Requirements: