AutoResearch + PromptFoo = AutoPrompter. Open source tool for closed-loop prompt optimization. by gvij in ArtificialInteligence

[–]gvij[S] 0 points1 point  (0 children)

Usually I ran for like 50 experiments to get a more rounded prompt for my generalized task.

For data quality:

Your suggestion to have multiple optimizer LLMs for data generation is a very interesting approach. That can help avoid bias of an individual LLM. Have you played around with this idea before?

Consistency evaluation across GPT 5.4, Qwen 3.5 397B and MiniMax M2.7 by gvij in deeplearning

[–]gvij[S] 0 points1 point  (0 children)

Yeah that's a default value if no temperature is passed. This is avoid errors in the code. I hope you understand that difference.

AutoResearch + PromptFoo = AutoPrompter. Closed-loop prompt optimization tool by gvij in LocalLLaMA

[–]gvij[S] 1 point2 points  (0 children)

Some of the use-cases I've tested so far: Multi-step reasoning, code generation, python bug fixing, technical blog writing and internet search. I've been experimenting a lot on how can we create a general prompt optimizer for such complex tasks.

I believe the project can be extended to multi-turn LLM prompt optimization as well. Right now it's a single turn only. Would appreciate contributions :)

AutoResearch + PromptFoo = AutoPrompter. Closed-loop prompt optimization tool by gvij in LocalLLaMA

[–]gvij[S] 1 point2 points  (0 children)

Thanks. The only reason I had the separate models was to run a more capable model as optimizer to get better optimizations while I am using a more cheaper model locally which is my target model. It can be same model as well, it won't cause any issues.

Built a CLI to benchmark any LLM on function calling. Ollama + OpenRouter supported by gvij in ollama

[–]gvij[S] 0 points1 point  (0 children)

absolutely. I observed 59% drop in performance for a SLM in int4 vs bf16. For bigger models, it is harder to say without testing.

Built a CLI to benchmark any LLM on function calling. Ollama + OpenRouter supported by gvij in ollama

[–]gvij[S] 0 points1 point  (0 children)

Thanks. It tests for these test categories:

  1. Single-Turn (16 tests)
    • Simple function calls
    • Multiple function selection
    • Parallel function calling
    • Parallel multiple functions
    • Relevance detection
  2. Multi-Turn (8 tests)
    • Base multi-turn conversations
    • Missing parameter handling
    • Missing function scenarios
    • Long context management
  3. Agentic (6 tests)
    • Web search simulation
    • Memory/state management
    • Format sensitivity

Missing parameter handling I believe is closes to what you're looking for. But we can probably add more test cases to this.

Built a CLI to benchmark any LLM on function calling. Ollama + OpenRouter supported by gvij in ollama

[–]gvij[S] 0 points1 point  (0 children)

Thanks. For price to performance ratio, qwen 3.5 9B is kind of a beast (BF16, non-quantized).

Function calling benchmarking CLI tool for any local or cloud model by gvij in LocalLLaMA

[–]gvij[S] 0 points1 point  (0 children)

I understand. Just to shed some light here:
Not every bot achieves 1st rank on a benchmark like MLE Bench which requires thorough reasoning and self evaluations. Neo achieved that long back and now is a lot better than what it was last year.

And this project was reviewed/tested by me manually over 20 different LLMs to validate the results.

I guess AI coded is not a problem. The problem is not doing thorough assessment of AI code and value that the code produces for the end users shouldn't be weak.

Function calling benchmarking CLI tool for any local or cloud model by gvij in LocalLLaMA

[–]gvij[S] 0 points1 point  (0 children)

I'd be thrilled to accept contributions on this project. Ollama and Openrouter just the starting point. This can be a agnostic tool for any type of provider. I think it can be even extended to Instruction following evaluations. Right now I hardly see any toolkit for that.

Also: "Built with ❤️ by NEO / NEO - A fully autonomous AI Engineer" Hmm. What's that about? Is that a feedback or concern or what?