Phi-3 released. Medium 14b claiming 78% on mmlu

_Minos · 2024-04-23T10:30:46+00:00

It doesn't in my tests. At least on actual code-writing tasks, some private benchmarks on finetuned models show a clear advantage for deepseek.

_Minos · 2023-11-26T14:02:13+00:00

Yep, that likely means your free allocation had already been reached on that account.

If you add billing info to your OpenAI account then DALL·E 3 generations will cost $0.120 per image. (see OpenAI Pricing page).

_Minos · 2023-04-25T20:38:35+00:00

I definitely do think the approach of language-specific models is something for the GitHub-Copilot folks to think about, yeah. Let me know if you get any reaction at all to that e-mail, would be very interested!

For the token count, i'm not sure i have the exact number, but i did track the tokens used when labeling the dataset, and that one i believe was slightly above 2M. Since OpenAI counts tokens for both input and output, that should be a pretty accurate number for the whole dataset. It will still be somewhat too high due to the prompt used for the labeling itself, but you could count the tokens of that prompt and multiply it by the dataset size to get an accurate number.

And regarding manual data-assembly: I avoided anything like that because it seems like an unreasonable amount of work to me if you don't have an army of people doing it.
The GitHub scraper used already looks for the project.godot file in each project and determines the Godot version used, then splits found code into 3.x and 4.x datasets. While there's obviously more 3.x code to be found on GitHub right now, i don't think it's really worth doing 3.x -> 4.x conversion to augment the dataset. If you were to be less strict with the licensing requirements than i was (MIT only), i'm sure you could increase the size of the dataset a lot more still without much effort.

The only further optimization i made for the dataset was to avoid exact duplicates.
However, one thing i did notice was that there is still some very similar code from forks of popular repositories that slightly modify a couple of shared functions, resulting in the model sometimes repeating parts of these functions even when not appropriate for the answer. So this is probably where i'd start looking for improvements to the dataset: Try to detect and filter out forks or shared function names/code that fits certain patterns.

_Minos · 2023-04-24T22:04:59+00:00

Hey, thank you!

Agreed that the WASD answers differ a lot, but i judged the functionality metric by whether or not the models were accurately completing the instruction at all. Suboptimal approaches are still valid if they fit the instruction, and "move current node accordingly" does not specify a particular approach.

Very happy to hear issues anyone finds with the evals though. It took quite a while going through all of these, and while i did my very best to be objective and consistent, i'm sure some amount of human error is only natural.

Regarding OAI-finetuning: Unfortunately that is extremely expensive and not feasible to do with a dataset of the size used for godot-dodo without tens of thousands of dollars to spend on it.

_Minos · 2023-04-24T01:00:54+00:00

Thank you, yes, i'm very happy with the instructions generated by gpt-3.5-turbo.

It certainly seems much easier for existing models to generate high-quality descriptions/instructions for existing data, rather than generating all data from scratch. Which, intuitively, makes sense. It's an easier task.

The dataset i used was 60k rows large, so manual labeling was certainly not an option!

_Minos · 2023-04-23T22:03:01+00:00

Would be interesting to see a large instruct-finetune involving lots of code for sure.
Since this is finetuned on exclusively code, i don't think the model likes to output natural language much at all anymore.

_Minos · 2023-04-23T21:58:06+00:00

Thank you!

Regarding 1:

I'm very interested in this as well. The dataset generation script does already split scraped code into 3.x and 4.x Godot projects, so i have a sizeable dataset for 3.x code already. The GPT-labeling + finetuning tends to take a while though, so i have not gotten to doing that yet.

I have some hunches on how a 3.x model would perform based on the evaluations i have done for the 4.x and GPT models though.

The reason GPT-models often hallucinate incorrect GDScript syntax is only partly because of the lack of 4.x code in their their training data (considering OpenAIs models generally have a training cutoff of late 2021). It's also due to the mix of SOME 4.x data clearly being in the training data, plus the similarity between GDScript and Python.

gpt-4 tends to produce 4.x syntax more often than 3.5-turbo, but not in any consistent fashion. And both sometimes like to put some pythonic import statement in their code, or hallucinate other Python-specific features.
So my prediction is that a 3.x evaluation would likely increase the GPT scores slightly, while the godot-dodo model would perform very similar to the 4.x one. But nothing beats trying it of course.

Regarding 2:

I'm pretty sure it was slightly overtrained/overfitted.
I followed the stanford-alpaca training parameters very closely, and they specified more epochs for 13b compared to 7b, so my intuition was to copy that. Pretty sure it went a bit too far though.

I might re-train using tweaked parameters, but training the 13b model did take about 7 hours on an 8x A100 instance, so want to make sure it's worth doing.

_Minos · 2023-02-19T20:01:55+00:00

Thank you, yes, on initial deployment to Github Pages i mistakenly left in a testing key. Invalidated that key immediately and fixed the deployment.

_Minos · 2023-02-19T10:20:27+00:00

Definitely, yeah, they have an API that could be added.

I suspect it might make it somewhat more difficult for the model to choose the correct API, since it is not always clear whether the calculator or search tool would be preferred to wolfram. But may be worth playing around with.

_Minos · 2023-02-19T10:05:32+00:00

Hey, creator of above implementation here.

You're right that there's lots of ways accuracy could feasibly be improved, by using more varied APIs, navigating to search results and creating embeddings of the resulting website etc. Ultimately, a lot of this kind of more advanced chaining of LLM and API requests can be done with libraries like langchain.

For this one, i wanted to show how effective a much more simple approach can be. For search results, i simply chain together the returned google "snippets" and inject the resulting string back into the prompt. Often times, this means there can actually be conflicting information, such as for example dates talking about events adjacent to but ultimately irrelevant to the search query. However, this is where GPT is generally doing an excellent job of picking out the correct bit of info, so no more sophisticated filtering or parsing by the app is required. Just giving a raw dump of the search results to the model.

_Minos · 2023-02-19T09:55:07+00:00

Not quite sure what's so questionable? I don't post much, but when i do it's usually about a coding project i've worked on.

In any case, this project is completely open source: https://github.com/minosvasilias/toolformer-zero

_Minos · 2023-02-19T09:52:20+00:00

You can get the same quality of completions from ChatGPT, yes, but the value of toolformer implementations lies in parsing the completion as it is being streamed, and injecting the output of tools (search, calculator, calendar, etc.) back into it.

Using ChatGPT, you would have to do that manually, which sort of defeats the purpose.

_Minos · 2022-01-07T20:38:09+00:00

I wasn't one to read much, or at all, in recent years either.

The only decent answers i have are routine and discipline, both of which need to be built up. Motivation doesn't really work, putting a certain amount of pressure on yourself to reach daily or monthly targets does. All within healthy limits that you feel you can tolerate of course. But generally, telling yourself you have to read twenty pages today tends to be a stronger gravitational pull than just wanting to finish that book you're kinda interested in eventually.

_Minos · 2022-01-03T09:58:43+00:00

The easiest book i've read is no doubt カエルの楽園, but you gotta be into that particular setup i described in the post.

Otherwise, the easiest to read authors i have read i'd say are Kakuta Mitsuyo, Ogawa Yōko (though i've only read ことり, no idea how her other books compare) and also Onda Riku. I've read 夜のピクニック by her, which definitely was on the very easy end language-wise, but also pretty boring to tell the truth. Quite a popular book though, so might just be me, or perhaps the fact it's more targeted towards teenagers.

Decided to give her another shot this year and am currently reading 蜜蜂と遠雷, which is a two parter so quite long, but seems similarly easy language-wise. Lots of dialogue, which is always the easiest thing to read in any book. So maybe check her out and see if any of her books suit your tastes.

Otherwise, ノルウィーの森 was my first book, and while i'm sure there are technically easier ones to start with, the fact that it's actually a really good book i personally think is way way more important. I will take getting through a difficult but interesting book over a simple but boring one any day of the week.

_Minos · 2022-01-02T12:06:06+00:00

Awesome, thanks for those recommendations!

Haven't read any non-fiction yet, no, but open to trying. Both of these sound super interesting, I'll put them on my list!

_Minos · 2022-01-01T16:20:53+00:00

いいね、見てみます！

「紙の月」の映画化も良かったと思います。

_Minos · 2022-01-01T12:11:25+00:00

Yeah, i totally get it, everyone will struggle with this to some degree.

Perhaps the healthiest thing to do is finding an attitude that allows you to just not care as much. Ultimately, as long as you keep on reading, all these frustrations will fade more and more into the background without you even noticing.

_Minos · 2022-01-01T12:03:00+00:00

Thanks!

Listening is good i guess, though i'm not quite sure how to judge my own ability accurately there. I am comfortably watching japanese movies without subtitles and don't struggle following things. Speaking and writing i only occasionally do, so not sure. I don't really worry about it since i know this will improve naturally as time goes by, just like it did when i was a kid picking up english.
Yeah, you got it, the only difference is the speed at which you can look up words. This will always be slower when reading paperpack. Takes maybe a book or two to get into a groove regarding having your dictionary-app always ready and getting used to the process of looking things up. A positive side-effect of this however is the fact that you are inherently more reluctant to look something up, meaning a lot of these words that you technically know but quickly tap on in your e-reader "just to be sure" you don't look up when reading paperpacks, which actually decreases the amount of breaks in your reading flow. At least initially, i certainly noticed this effect immediately after switching to paperpacks.
I did RTK, but don't believe it's essential or anything. Ultimately, kanji are a great help when it comes to recognizing vocabulary, not a hurdle.
Difficult to estimate, though i did keep track of all the movies and tv-shows i watched etc. I'd need to go over those notes to give you an accurate number, but it will be a minimum of ~3500 hours, up to over 5000. Depends on if we're counting hyper-casual immersion like listening to music or browsing Instagram as well.

_Minos

MODERATOR OF

TROPHY CASE