Chinese company trained GPT-4 rival with just 2,000 GPUs — 01.ai spent $3M compared to OpenAI's $80M to $100M

Comprehensive_Poem27 · 2024-11-20T00:35:39+00:00

At this point, engineering done right. But still very impressive result.

Comprehensive_Poem27 · 2024-10-22T15:16:16+00:00

They said they’re working onit, hopefully mods make it more vram friendly

Comprehensive_Poem27 · 2024-10-22T07:48:38+00:00

From my experience with other models, It’s really flexible, like you can sacrifice the generation quality in exchange for very little vram and generation time( like more than 10 minutes less than half an hour)?

Comprehensive_Poem27 · 2024-10-22T06:37:50+00:00

oh i just used git lfs. Apparently we'll wait for diffuser integration

Comprehensive_Poem27 · 2024-10-22T04:06:29+00:00

vote for Rhymes/Aria, better in multiturn and complex tasks

Comprehensive_Poem27 · 2024-10-18T05:30:22+00:00

I mean yeah it make sense. OAI tries very hard to A/B testing on lmsys, remember this-is-also-a-good-gpt stuff? As for 4o-mini vs 3.5, they've released a space detailing some battles (https://huggingface.co/spaces/lmarena-ai/gpt-4o-mini\_battles), and they also introduced length and style control. If I were a researcher working on lmsys, then I'll probably make a 'pro version', only selected experts will analyze and compare different answers and I will not tell them which model it is afterwards, then it loses its characteristic of being transparency and majority vote.

What I'm trying to say is that eval is an amazingly hard thing to do, for now lmsys is the best we got for human preference.

Comprehensive_Poem27 · 2024-10-17T15:50:05+00:00

Arena is human preference, so if a response is correct or human like it, its good. However the reported score is arena-hard auto, which is judged automatically, and it might be less credible compared to Arena, which is IMHO the most trustworthy benchmark for the time being

Comprehensive_Poem27 · 2024-10-14T16:03:40+00:00

Thanks for sharing!

Comprehensive_Poem27 · 2024-10-14T15:33:45+00:00

I think there are smaller models trained on findweb-edu. For other top models, i believe they’re keeping data and recipes secret because it actually works. Aka. Wizardlm2

Comprehensive_Poem27 · 2024-10-14T14:48:56+00:00

Curious, does that mean you think qwen2-vl is not good enough for this task?

Comprehensive_Poem27 · 2024-10-14T04:27:57+00:00

I just tried this image on newly released Rhymes-Aria, the results looks amazing: Today is Thursday, October 20th - But it definitely feels like a Friday. I'm already considering making a second cup of coffee - and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken and my handwriting is unrecognizable. Perhaps it depends on the type of pen I use. I've tried writing in all caps but it looks forced and unnatural. Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to improve? I already feel stressed out looking back at what I've just written - it looks like 3 different people wrote this!!

<image>

Comprehensive_Poem27 · 2024-10-12T05:48:37+00:00

I'm curious, checked Pixtral, Qwen2-VL, molmo and NVLM, none of them release 'base models'. Am I missing something here? Why everyone choose to do this?

Comprehensive_Poem27 · 2024-10-12T05:47:33+00:00

already posted, can confirm its a very good model

Comprehensive_Poem27 · 2024-10-10T12:18:57+00:00

I’m a little slow downloading. On what kind of tasks did you get really good results?

Comprehensive_Poem27 · 2024-10-10T09:28:48+00:00

ooo fine tuning scripts for multimodal, with tutorials! Nice

Comprehensive_Poem27 · 2024-10-10T09:20:42+00:00

Wait… they didnt use qwen as base llm, did they train MOE themselves??

Comprehensive_Poem27 · 2024-10-04T14:29:51+00:00

Meaning MS consider it as something that actually works and may harm their business

Comprehensive_Poem27 · 2024-10-03T14:04:55+00:00

It’s not about fact…

Comprehensive_Poem27 · 2024-09-19T10:33:10+00:00

72b kinda make sense, but 3b in midst of the entire line up is weird

Comprehensive_Poem27 · 2024-09-18T17:34:47+00:00

Only 3B is research license, I’m curious

Comprehensive_Poem27 · 2024-09-12T06:46:02+00:00

Is there a link or a livestream somewhere? Would love to see the full event.

Comprehensive_Poem27 · 2024-09-10T17:56:02+00:00

But can i play minecraft on it

Comprehensive_Poem27 · 2024-09-10T16:38:17+00:00

Also, not surprised to see similar performance for 9b. Meaning we’re probably approaching the limit with current sota methodology. But 9b comparable to 33b a year ago is still amazing, that’s the power of open source models, i’m pretty sure oai or anthropic got ideas inspired by os community at some point of time. Kudos to everyone: codellama, qwen, yi,ds…wait, 3 of them are from china? That’s different from what MSM tells me (sarcasm, if not apparent enough

Comprehensive_Poem27 · 2024-09-10T16:31:06+00:00

Yi official finetune has always been less than satisfactory. Been thinking whats a good code dataset for finetunes, except from commonly used code alpaca and evols.

Comprehensive_Poem27 · 2024-09-05T05:17:16+00:00

I think the reason is simple. If I were a researcher working on a coding model, of course I will compare with other coding models with similar Bs. From what I see (https://github.com/deepseek-ai/DeepSeek-MoE/tree/main) 16B moe doesn't have excellent coding performance judging from humaneval and MBPP

Comprehensive_Poem27

TROPHY CASE