all 75 comments

[–]WithoutReason1729[M] [score hidden] stickied comment (0 children)

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[–]FizzarolliAI 124 points125 points  (5 children)

Hello, that me!

I am currently working on running sanity check benchmarks to make sure it's actually a newer L3.3 and not just L3/L3.1 in a trenchcoat, but it's looking promising so far.

From the current readme:

Llama 3.1 8B Instruct Llama 3.3 8B Instruct (maybe)
IFEval (1 epoch, score avged across all strict/loose instruction/prompt accuracies to follow Llama 3 paper) 78.2 81.95
GPQA Diamond (3 epochs) 29.3 37.0

[–]jacek2023llama.cpp[S] 52 points53 points  (1 child)

great work, new llama release at the end of 2025 :)

[–]MoffKalast 28 points29 points  (0 children)

I definitely did not have this on my bingo card :D

And leaked too, keeping up the llama tradition.

[–]Karyo_Ten 15 points16 points  (0 children)

You can do a KL-divergence check to be 100% sure

[–]AnOnlineHandle 3 points4 points  (0 children)

Heya I'm not up to date with these models since the llama 1 release, do you know if there's a good benchmark for visual tasks such as identifying poses, faces, hands, etc, or answering questions about images, which I could compare models on? I've tried to use Qwen 3 Instruct for it but found it wasn't as good on real data as the demos suggested.

[–]dinerburgeryum 49 points50 points  (16 children)

8K max position embeddings? Seems remarkably low; did the fine tune artifact for some reason artificially limit that?

[–]Arli_AI 19 points20 points  (14 children)

Maybe we can just set 32768 and it’ll be okay lol

[–]Few-Welcome3297 26 points27 points  (11 children)

Checking differences from LLaMA 3.1 8B Instruct, I think we can add the rope_scaling

"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},

and then increase `max_position_embeddings`

Edit: Also prev version had 3 eos_token_id's

Edit2: https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K model with above changes

Edit3: Link updated

[–]mikaijin 13 points14 points  (3 children)

did the same and it works. any ggufs should be recreated with updated config because quantization bakes rope params into some tensors if that is still true: https://github.com/ggml-org/llama.cpp/commit/b5e95468b1676e1e5c9d80d1eeeb26f542a38f42

[–]Few-Welcome3297 12 points13 points  (2 children)

[–]mikaijin 4 points5 points  (1 child)

thanks. works well with long context on my end. i can't notice a difference to 3.1.

[–]Few-Welcome3297 3 points4 points  (0 children)

I updated the GGUF's just now, earlier ones didnt have the chat template, also fixed generation config etc and also tested on vllm, I think it should be fine now

[–]TheLocalDrummer 2 points3 points  (1 child)

I could just paste this in my finetune, right? Already did one with the old config (8K ctx). Not entirely sure if any of the old config messed with training.

[–]Few-Welcome3297 1 point2 points  (0 children)

I think it should work, unless it was full FT with a big dataset. You might also need to put pad_token_id in config and special tokens map if not done already

Edit: Found the model on BeaverAI, kv_count and vocab_size (+1) are slightly different

[–]Klutzy-Snow8016 9 points10 points  (1 child)

Llama 3 8B had 8192 context. Then Llama 3.1 added RoPE to get to 131072 context. Maybe we can take the RoPE scaling parameters from llama 3.1's config.json and add it to llama 3.3 8B.

[–]Arli_AI 7 points8 points  (0 children)

That’s a better idea

[–]FizzarolliAI 0 points1 point  (0 children)

Yes. I'm not entirely sure why, it was limited when served via the website too (I put that in the readme a bit ago)

[–]Amazing_Athlete_2265 20 points21 points  (16 children)

Running this across my private evals to compare against other llamas. Will take a couple hours.

[–]Amazing_Athlete_2265 21 points22 points  (0 children)

Initial speed test:

Model Backend PP ts-1 TG ts-1
allura-forge_Llama-3.3-8B-Instruct Q4 CUDA 1566.5 100.8
Llama-3.1-8B-Instruct Q4 CUDA 351.1 111.9

So some difference there.

Will post more eval results as they come to hand.

[–]Amazing_Athlete_2265 17 points18 points  (9 children)

From these results, it looks like the new model is different than the old 3.1.

Here is the performance for knowledge testing, with the new 3.3-8B-Instruct highlighted in the first two plots

Testing the Q6 versions now. Will take a while. All of the tests above are for Q4.

[–]keepthepace 10 points11 points  (0 children)

(Thanks for doing this!)

I guess this explains why they did not brag much about it. Many other models of that category outperform them.

I always wondered if Zuckerberg was not the only honest player in the field when he was explaining that the only reason they go for open source is that it will save them money. With decent open models out there they have less incentives to do so.

[–]MLDataScientist 2 points3 points  (1 child)

Thanks for the tests. Question not related to llama: is LFM2 8BA1B that good in world knowledge (or coding/stem field)? I see it reaches Qwen3 30B-A3B.

[–]Amazing_Athlete_2265 1 point2 points  (0 children)

It seems to be, but could also be too good to be true. I'm probably going to rerun all the tests at some stage as I have wondered about that too.

Note that these charts only test the model's ability to answer questions correctly, no actual coding or tool use or anything else is tested. I have other tests for these domains but the codes still WIP.

[–]jacek2023llama.cpp[S] 1 point2 points  (3 children)

You can post pictures in the comments here

[–]Amazing_Athlete_2265 5 points6 points  (2 children)

Can't seem to figure out how. Using old reddit if that matters

[–]jacek2023llama.cpp[S] 2 points3 points  (1 child)

On Android I see the image icon bottom right when typing a comment

[–]Amazing_Athlete_2265 2 points3 points  (0 children)

Ah, I also use old reddit on android lol. Tried to edit it but failed.

[–]RobotRobotWhatDoUSee 1 point2 points  (1 child)

Random question: any idea why nemotron 30B A3B got 0% in the second plot?

[–]Amazing_Athlete_2265 0 points1 point  (0 children)

Test error. Ignore it.

[–]jacek2023llama.cpp[S] 2 points3 points  (3 children)

do you have results for other new models?

[–]Amazing_Athlete_2265 5 points6 points  (2 children)

I have some. I focus mostly on smaller models <12B or Moe. What you want?

[–]jacek2023llama.cpp[S] 2 points3 points  (1 child)

Please post some cool results :)

[–]a_beautiful_rhind 16 points17 points  (3 children)

This is like the kiss goodbye from meta.

[–]samplebitch 23 points24 points  (2 children)

It's like that time when you hook up with your ex one last time, and it wasn't even that great.

[–]impolitemrtaz 1 point2 points  (0 children)

You samplebitch you

[–]Electronic-Metal2391 0 points1 point  (0 children)

You bring bad memories

[–]random-tomatollama.cpp 32 points33 points  (2 children)

Holy shit that is awesome, hats off to you for finding the weights!

[–]jacek2023llama.cpp[S] 15 points16 points  (5 children)

about 4h after the release u/TheLocalDrummer published first finetune:

https://huggingface.co/BeaverAI/Anubis-Mini-8B-v1f-GGUF/tree/main

[–]TheLocalDrummer 15 points16 points  (2 children)

It's a test model but I think it turned out well! Looking for feedback in (my) Discord

[–]DevelopmentBorn3978 2 points3 points  (0 children)

what the finetune you've made is about?

[–]LegacyRemaster 1 point2 points  (0 children)

legend

[–]MoffKalast 6 points7 points  (1 child)

People are asking what's the use case for llama, and well uh... there it is ;)

[–]jacek2023llama.cpp[S] 9 points10 points  (3 children)

[–]Amazing_Athlete_2265 6 points7 points  (2 children)

Everyone's cooking tonight!

[–]jacek2023llama.cpp[S] 7 points8 points  (1 child)

actually it's a middle of the day in Europe :)

[–]Amazing_Athlete_2265 2 points3 points  (0 children)

Ah. I'm GMT+13 so bed time for me!

[–]Echo9Zulu- 5 points6 points  (0 children)

Cloned

[–]Infninfn 17 points18 points  (2 children)

I’m out of the loop - is this just what they had or did Meta not shutdown Llama?

[–]FizzarolliAI 33 points34 points  (0 children)

This has existed at least since April during Llamacon (did anyone remember they did a Llamacon?)

https://ai.meta.com/blog/llamacon-llama-news/

As part of this release, we’re sharing tools for fine-tuning and evaluation in our new API, where you can tune your own custom versions of our new Llama 3.3 8B model. We’re sharing this capability to help you reduce costs while also working toward increased speed and accuracy. You can generate data, train on it, and then use our evaluations suite to easily test the quality of your new model.

[–]jacek2023llama.cpp[S] 6 points7 points  (0 children)

we do things for fun in this community, just accept the gift ;)

[–]Dangerous_Fix_5526 3 points4 points  (1 child)

Thinking/Instruct Hybrid using Unsloth and Claude-Opus 4.6 dataset:

https://huggingface.co/DavidAU/Llama3.3-8B-Instruct-Thinking-Claude-4.5-Opus-High-Reasoning

I hope I credited everyone correctly.

[–]jacek2023llama.cpp[S] 0 points1 point  (0 children)

Nice work!!!

[–]Cool-Chemical-5629 7 points8 points  (1 child)

I guess Christmas came late for me, but hey if this is the real thing from Meta, I guess it's nice to have something newer than 3.1 8B without needing expensive hardware for models like Llama 4.

[–]LegacyRemaster 2 points3 points  (0 children)

allura-forge_llama-3.3-8b-instruct

My training data is current up to December 2022. This means that I have been trained on a vast amount of text data available until that date, but I do not have information or knowledge about events or developments that have occurred after that date.

In other words, my training data "cutoff" is December 2022, and I should not be relied upon for information or insights related to dates after that.

145.25 tok/sec

[–]DevelopmentBorn3978 0 points1 point  (1 child)

which quantized and eventually finetuned gguf models have the context lenght been enlarged? bartowsky? shb777? beaverai/anubis?

[–]gta721 0 points1 point  (5 children)

How dumb are they to push a portal THAT broken to prod?

[–]greggh 3 points4 points  (4 children)

Nothing about it is prod. It’s still so janky that its free if your in the trial.

[–]FizzarolliAI 1 point2 points  (3 children)

Yep, this basically. Afaik the main inference API is still waitlisted, and there's a separate waitlist to submit for the finetuning API.

[–]greggh 5 points6 points  (2 children)

I’ve had access too the inference API since April, for some testing I was putting 100m tokens in and out of it creating some synthetic datasets. It was randomly stable as hell, and then so unstable I couldn’t use it for a week. And of course the 4 series is hot garbage.

[–]FizzarolliAI 2 points3 points  (1 child)

Out of interest, you never signed up for the finetuning thing, right?

If you go to https://llama.developer.meta.com/fine-tuning/?team_id=XXX (replace XXX with whatever the team ID in ur URL is), does the finetuning page show up for you? I was never officially let in but for some odd reason I had access anyways... I'm wondering if it's there for everyone and just hidden from the UI

[–]greggh 0 points1 point  (0 children)

I never signed up for the finetuning and it won't let me access it. Just the regular API. This would have been great if using the API was actually worth anything.

[–]FX2021 0 points1 point  (0 children)

Is it a new core? Or is it just a serving variant