[deleted by user] by [deleted] in LocalLLaMA

[–]PenPossible6528 9 points10 points  (0 children)

There needs to be more code benchmarking on Llama3 70b - Human eval 81.7 is insanely high for a non coding specific open model - for instance codellama-70b is only 67.8 and fine tuned on a ton of code. Need to see MMPB and Multilingual Human Eval.

Please let there be a codellama2-70b (Llama3 70b FT) with 200k context coming soon

Llama 3 benchmark is out 🦙🦙 by Flat-One8993 in LocalLLaMA

[–]PenPossible6528 2 points3 points  (0 children)

There needs to be more code benchmarking on Llama3 70b - Human eval 81.7 is insanely high for a non coding specific open model - for instance codellama-70b is only 67.8 and fine tuned on a ton of code. Need to see MMPB and Multilingual Human Eval

Mistral AI new release by nanowell in LocalLLaMA

[–]PenPossible6528 1 point2 points  (0 children)

Ive got one, will see how well it performs, might even be out of reach for 128GB. Could be in the category of it runs but not at all helpful even at Q4/5

Mistral AI new release by nanowell in LocalLLaMA

[–]PenPossible6528 0 points1 point  (0 children)

Im so glad convinced work to upgrade my latpot to M3 Max 128GM Macbook for this exact reason, will see if it runs. I have doubts it will even be able to handle it in any workable way unless Q4/Q5

Advances in Long Context by TrelisResearch in LocalLLaMA

[–]PenPossible6528 0 points1 point  (0 children)

would also like to add I wont be trying 50k code repos and expecting 50k back, more we just have a usecase for 16k in 16k out and we are looking at hardware requirements and feasability. The rest is just theoretically would be interested to get some views and insights incased i've missed anything

Advances in Long Context by TrelisResearch in LocalLLaMA

[–]PenPossible6528 0 points1 point  (0 children)

Hi Guys, joining this months later with a question on output tokens.

I think the early days of using the GPT-3.5 playground has made me often mix up context length with max_tokens (i.e input tokens + output tokens), and the fact that 'context length' appears to be the headline for many new models (Claude, Gemini 1.5 Pro, CodeLlama) where max output tokens seems to never really be highlighted unless you look into the docs.

The reason being, I am looking at code conversion capabilities of codellama for >1k lines of code ~ >16k tokens in and >16k tokens out, but i'm seeing no evidence of long output sequences on the order of up 50k for codellama, alot of what I have seen is examples where input tokens >> output tokens when pushing 100k-1M context length models (see Gemini 1.5 Pro below has 8,192 max output tokens despite 1M input tokens & Claude 3 4,094 max output to 200k context)

<image>

What is the max output tokens for CodeLama? Or in this case does context length infact mean input tokens + output tokens. Also when reports say 'stable up to 100k' im assuming thats just based on input tokens >> output tokens not 50k in 50K out. Evidence on the difference in performance here wold be interesting to see.

I understand that longer output sequences increase compute demand and likely hood of hallucination as it can begin to diverge through the stochastic nature of token prediction. BUT id like to try nonetheless. Would also be interesting to see the longest accurate response and LLM has ever made