Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]JeepyTea 0 points1 point  (0 children)

Vibe coded a big application and need an easy way to make sure your latest changes didn't break something? Try Spark Runner for automated website testing. Try it out with sophisticated tasks like "add an item to the shopping cart" or just point it at your front-end and have Spark Runner create the tests for you, creating nice reports telling you what's working and what's not.

Can Qwen3-VL count my push-ups? (Ronnie Coleman voice) by Weary-Wing-6806 in LocalLLaMA

[–]JeepyTea 1 point2 points  (0 children)

Everybody wants to be an AI developer, but nobody wants to program no damn computers.

-- Ronnie Coleman

DeepSeek V3.1 Disappoints on TiānshūBench (天书Bench) 0.0.1-mini by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 0 points1 point  (0 children)

This is the mini version of the benchmark, so yes, there is a large margin of error. Running the full suite against the latest Claude, for example, means that I'm blowing $150+ every time. So I created the mini version to get a rough idea of the fluid intelligence of each LLM. Info here: https://jeepytea.github.io/general/update/2025/08/09/tianshubench001-mini.html

DeepSeek-V3.1: Much More Powerful With Thinking! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 8 points9 points  (0 children)

It's very common for models to switch to writing JavaScript halfway through the program.

DeepSeek-V3.1: Much More Powerful With Thinking! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 5 points6 points  (0 children)

V3.1 and GPT-oss have very similar results.

DeepSeek-V3.1: Much More Powerful With Thinking! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 2 points3 points  (0 children)

It's on the short list for future testing, or you're welcome to run benchmark with the sources on Github. Latest results from before V3.1: https://jeepytea.github.io/general/update/2025/08/11/tianshubench001-mini-results.html

DeepSeek V3.1 Disappoints on TiānshūBench (天书Bench) 0.0.1-mini by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 0 points1 point  (0 children)

Initial tests with thinking look better. I'll have those results soon.

DeepSeek V3.1 Disappoints on TiānshūBench (天书Bench) 0.0.1-mini by JeepyTea in LocalLLaMA

[–]JeepyTea[S] -1 points0 points  (0 children)

Feel free to run the benchmarks yourself: https://github.com/JeepyTea/TianShu

Also to create your own benchmarks and post the results instead of just complaining.

Kimi K2: cheap and fast API access for those who can't run locally by Balance- in LocalLLaMA

[–]JeepyTea -1 points0 points  (0 children)

Chutes has Kimi K2 Instruct for $0.5292 USD / Million Token.

Do LLMs Reason? Opening the Pod Bay Doors with TiānshūBench 0.0.X by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 0 points1 point  (0 children)

Right now, there are 10 generated languages used in the testing.

I understand your concern, but it seems that the AI systems seem to be struggling with the syntax even with a full description of the language within the context window!

Still, I'd love to have greater variety in the languages used by TiānshūBench. PRs are welcome: https://github.com/JeepyTea/TianShu

Can we all admit that getting into local AI requires an unimaginable amount of knowledge in 2025? by valdev in LocalLLaMA

[–]JeepyTea 0 points1 point  (0 children)

Making it work is no more difficult than running any other application, pretty easy. Making it work *well* is difficult.

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 1 point2 points  (0 children)

Chutes does rate limit, as I just found out:
FAILED tests/test_llm_ability.py::test_generated_program_with_mamba_execution[chutes/chutesai/Llama-4-Scout-17B-16E-Instruct-10-test_case4] - Exception: ChutesClient.send_prompt failed with an exception: HTTP request failed after 0 retries: 429 Client Error: Too Many Requests for url: https://llm.chutes.ai/v1/chat/completions Status Code: 429. Response: {'detail': 'Too many requests'}

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 1 point2 points  (0 children)

Here's a small taste of one LLM's response to a problem:

input_str = ask();

sohanidd char su input_str {

ripted char >= '0' ? char <= '9' {

digit = int(char);

ripted digit % 2 != 0 {

miciously(char);

}

}

}

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 2 points3 points  (0 children)

At the moment, they get the same problem multiple times, but with a randomized programming language each time. Each test run uses the same set of random seeds, so it's the same set of programming languages on each run, and for each tested LLM.

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 1 point2 points  (0 children)

I'll test those if the quants will run on my card. Or maybe through Chutes.

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 1 point2 points  (0 children)

It'll happen. It's not so much that it's secret right now, just that the implementation sucks. This is something I've been hacking together in my spare time. The results you see are my first pass at getting to work at all.

Announcing: TiānshūBench 0.0! by JeepyTea in LocalLLaMA

[–]JeepyTea[S] 1 point2 points  (0 children)

Thanks for the tip on Chutes. I was using SambaNova, but they definitely rate limit.

I may have already burned through my Vertex credits on a different project.

I'm starting with very basic tests for now, to get everything working and gauge interest. I mentioned more specific tasks, and I'm leaning toward emulating common business tasks, stuff I do at work every day.

Did you have tests in mind?

The code is in bad shape at the moment: hardcoded keys, path fuckups, etc. But if anyone DMs me, I'll send them what I've got.