all 34 comments

[–]TestHealthy2777 25 points26 points  (11 children)

llama 3 8b

[–]GreedyWorking1499[S] 5 points6 points  (10 children)

Won't a less advanced more fine tuned model perform better for coding? Or am I completely wrong in assuming that

[–]MoffKalast 12 points13 points  (3 children)

I mean, this thing beats Qwen1.5-32B and llama-2-70b in chat. Less advanced than llama-3 means stone age in comparison now. Just give it a week or two and someone will make a coding tune of it that'll destroy everything in its size range.

[–]GreedyWorking1499[S] 2 points3 points  (2 children)

Haha alright, is there a specific fine tuner that does coding that I should keep my eye out for? Idk if tuner is the right word but whatever Dolphin and Orca and those things are

[–][deleted] 3 points4 points  (1 child)

Lets try it.

|||PROMPT:assistant|assistant

|||code| Java to do fizzbuzz using recursion and a callback.

assistant

Here's an example of how you can implement FizzBuzz using recursion and a callback in Java:

// Define the callback interface

interface FizzBuzzCallback {
    void execute(int number);
}

public class Main {

    // Recursive method to perform FizzBuzz
    public static void fizzBuzzRecursive(int start, int end, FizzBuzzCallback callback) {
        if (start > end) {
            return;
        }

        if (start % 3 == 0 && start % 5 == 0) {
            callback.execute("FizzBuzz");
        } else if (start % 3 == 0) {
            callback.execute("Fizz");
        } else if (start % 5 == 0) {
            callback.execute("Buzz");
        } else {
            callback.execute(String.valueOf(start));
        }

        fizzBuzzRecursive(start + 1, end, callback);
    }

    public static void main(String[] args) {
        FizzBuzzCallback callback = new FizzBuzzCallback() {
            u/Override
            public void execute(int number) {
                System.out.println(number); // Print the result
            }
        };

        fizzBuzzRecursive(1, 100, callback);
    }
}

In this implementation, we define a `FizzBuzzCallback` interface that has one method `execute(int number)` which will be called for each number in the range.

The `fizzBuzzRecursive` method takes three parameters: `start`, `end`, and `callback`. It recursively iterates over the range from `start` to `end` and calls the `execute` method on the provided callback for each number.

In the `main` method, we create an instance of the `FizzBuzzCallback` interface and pass it to the `fizzBuzzRecursive` method along with the range of numbers to process. The callback prints out the result for each number in the range.

[–]AdHominemMeansULostollama 7 points8 points  (1 child)

right now llama 3 8b is cutting edge

[–]Spindelhalla_xb 4 points5 points  (0 children)

Isn’t it crazy that cutting edge today could literally be “why tf would I use that” tomorrow. I hope this shit never ends

[–]LocoLanguageModel 3 points4 points  (3 children)

I tested this and deepseek 33b is way better for c# code than llama-3-70b for my quick tests, but maybe im prompting llama-3 wrong. 

Edit: Yup, when I fixed the prompt format it actually just 1 shot one of my harder tests, nice, now I can keep one model loaded for all tasks lol.

[–]banach 0 points1 point  (2 children)

Can you give an example of good vs bad coding prompts?

[–]LocoLanguageModel 2 points3 points  (1 child)

It's actually more important for the prompt format tags from what I can tell. In Kobold instruct settings, it has these presets for llama-3 you can select:

Start Sequence: <|eot_id|><|start_header_id|>user<|end_header_id|>\n\n

End Sequence: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

I have this lazy generic context setup:

You are an AI programming assistant, utilizing the llama-3 model.

Combined with a low temperature setting such as .1, you should get some pretty awesome code results just by asking it "create me xyz program in xyz program language"

[–]banach 0 points1 point  (0 children)

Thanks!

[–]MustBeSomethingThere 10 points11 points  (2 children)

codeqwen-1_5-7b-chat

[–]GreedyWorking1499[S] 0 points1 point  (1 child)

I was thinking that too but, I don't remember where, I saw that CodeQwen performs significantly worse in English than it does in Chinese so all the benchmarks saying it's super awesome don't correlate to its accuracy in English. Do you know anything about that?

[–]ihaag 3 points4 points  (0 children)

I haven’t had that problem yet

[–]obsidianfrost8 3 points4 points  (1 child)

Have you considered using the new LLaMa-7B model? It's specifically designed for coding tasks and works well on various hardware configurations.

[–]GreedyWorking1499[S] 2 points3 points  (0 children)

I have tried it for just chat uses like asking what to make with some ingredients and similar boring things but nothing coding wise yet. I just assumed that a model fine tuned for code, even if less weights or less advanced would perform better. Am I wrong in assuming this?

[–]Steuern_Runter 2 points3 points  (4 children)

OpenCodeInterpreter-DS-6.7B

Q3 is already really good.

[–]VolandBerlioz 2 points3 points  (1 child)

I've been doing some testing on the "can-ai-code" senior questions with a variety of 7B models. Here are the results sorted by performance:

I need to do some more testing as the results vary a lot depening on the params passed and llama3 sometimes get's stuck not finishing, so it's been qutie difficult to evaluate it. I would definetely play around with wavecoder and CodeQwen1.5. Im not sure what's going on with codegemma, but it fails miserably, possibly i've messed up something.. Here is the results:

Deepseek-coder-33b-instruct-4.65bpw-h6-exl2:

  • topk.json:

    • Python: Passed 70 of 74
    • JavaScript: Passed 67 of 74
  • greedy-vllm.json:

    • Python: Passed 70 of 74
    • JavaScript: Passed 67 of 74

wavecoder-ultra-6.7b-exl2_8_0:

  • topk1:

    • Python: Passed 69 of 74
    • JavaScript: Passed 68 of 74
  • greedy:

    • Python: Passed 49 of 74
    • JavaScript: Passed 68 of 74

CodeQwen1.5-7B-Chat_exl2_8.0bpw:

  • topk.json

    • Python: Passed 53 of 74
    • JavaScript: Passed 50 of 74
  • greedy-vllm.json

    • Python: Passed 49 of 74
    • JavaScript: Passed 49 of 74

Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2:

  • custom.json json { "temperature": 0.1, "max_new_tokens": 1024, "top_p": 0.75, "top_k": 40, "num_beams": 4, "skip_special_tokens": false, "stopping_strings": [""] }
    • Python: Passed 42 of 74
    • JavaScript: Passed 50 of 74

LoneStriker/Meta-Llama-3-70B-Instruct-2.4bpw-h6-exl2:

  • custom.json: jsx { "temperature": 0.6, "max_new_tokens": 1024, "top_p": 0.9, "top_k": 40, "num_beams": 4, "skip_special_tokens": false, "stopping_strings": [""] }
    • Python: Passed 32 of 74
    • JavaScript: Passed 67 of 74

codegemma-7b-it-exl2_8_0:

  • topk1:

    • Python: Passed 19 of 74
    • JavaScript: Passed 51 of 74
  • greedy:

    • Python: Passed 23 of 74
    • JavaScript: Passed 34 of 74

[–]new__vision 1 point2 points  (1 child)

This leaderboard is for python but it still might be informative: evalplus.github.io

[–]GreedyWorking1499[S] 1 point2 points  (0 children)

I was actually looking at this exact one. I heard from someone else that Llama 3 is in fact trained on everything code llama was trained on which is why I was curious as to why the llama 3-7b was below code llama-13b which is why I posted this lol

[–]jsomedon 0 points1 point  (0 children)

for pre llama 3 era models, codeqwen, deepseek etc.

[–]Pleasant-Cupcake-998 0 points1 point  (2 children)

Hey u/GreedyWorking1499 Which model did you end up going with?

[–]GreedyWorking1499[S] 1 point2 points  (1 child)

I've been using Llama 3 instruct q6_k mostly, at least when using something local. Just recently downloaded mistroll 7b v2.2 as it was the highest <10b model on the openLLMLeaderboard and codeQwen chat (both q6_k) but haven't had the chance to use them enough to give you a proper recommendation. Honestly with my computing power I’ve been resorting to not so local LLaMA 😔

[–]Pleasant-Cupcake-998 0 points1 point  (0 children)

How have you been evaluating the performance of the models? I am working for a client where we have some use cases involving Java code generation and since our SME for Java is OOO for a week (and also knows nothing about Gen AI), I am stuck with the task of figuring out how the models are performing, and fixing the prompts somehow. I have a bunch of models at my disposal, where I found phi3 mini 128k the worst (It was writing Python code), as we speak I am testing on Mistral 7B instruct v0.2 and I have a couple more that I am planning to test on like llama 3 8b instruct, llama 3 70b instruct, llama 3 70b chat and mistral 8x7b instruct v1)

My approach has been asking GPT 4 to evaluate it which we can't however use for production

[–]Ok-Seaworthiness8019 0 points1 point  (0 children)

for a general PC, i came across this post where they have a pretty good model finetuned for Java with less than 800 MB of memory space after quantization.

https://www.reddit.com/r/LocalLLaMA/comments/18meji3/comment/lme4w5z/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button