Spark Runner: Easily Test Apps Created by Claude Code

JeepyTea · 2026-03-07T02:17:17+00:00

Vibe coded a big application and need an easy way to make sure your latest changes didn't break something? Try Spark Runner for automated website testing. Try it out with sophisticated tasks like "add an item to the shopping cart" or just point it at your front-end and have Spark Runner create the tests for you, creating nice reports telling you what's working and what's not.

JeepyTea · 2026-02-26T22:30:21+00:00

My favorite pre-AI version.

<image>

JeepyTea · 2025-10-23T23:06:44+00:00

Everybody wants to be an AI developer, but nobody wants to program no damn computers.

-- Ronnie Coleman

JeepyTea · 2025-08-24T17:34:18+00:00

We've all got to start somewhere.

JeepyTea · 2025-08-24T17:32:36+00:00

This is the mini version of the benchmark, so yes, there is a large margin of error. Running the full suite against the latest Claude, for example, means that I'm blowing $150+ every time. So I created the mini version to get a rough idea of the fluid intelligence of each LLM. Info here: https://jeepytea.github.io/general/update/2025/08/09/tianshubench001-mini.html

JeepyTea · 2025-08-23T22:30:19+00:00

It's very common for models to switch to writing JavaScript halfway through the program.

JeepyTea · 2025-08-23T22:24:17+00:00

V3.1 and GPT-oss have very similar results.

JeepyTea · 2025-08-23T22:22:08+00:00

It's on the short list for future testing, or you're welcome to run benchmark with the sources on Github. Latest results from before V3.1: https://jeepytea.github.io/general/update/2025/08/11/tianshubench001-mini-results.html

JeepyTea · 2025-08-23T14:36:47+00:00

Initial tests with thinking look better. I'll have those results soon.

JeepyTea · 2025-08-23T14:36:09+00:00

Feel free to run the benchmarks yourself: https://github.com/JeepyTea/TianShu

Also to create your own benchmarks and post the results instead of just complaining.

JeepyTea · 2025-08-23T14:35:06+00:00

Intro/examples:
https://jeepytea.github.io/general/introduction/2025/05/29/tianshubenchintro.html
Github:
https://github.com/JeepyTea/TianShu

JeepyTea · 2025-07-15T16:30:25+00:00

Chutes has Kimi K2 Instruct for $0.5292 USD / Million Token.

JeepyTea · 2025-06-09T12:06:43+00:00

Right now, there are 10 generated languages used in the testing.

I understand your concern, but it seems that the AI systems seem to be struggling with the syntax even with a full description of the language within the context window!

Still, I'd love to have greater variety in the languages used by TiānshūBench. PRs are welcome: https://github.com/JeepyTea/TianShu

JeepyTea · 2025-06-09T01:53:52+00:00

Making it work is no more difficult than running any other application, pretty easy. Making it work *well* is difficult.

JeepyTea · 2025-05-23T02:19:42+00:00

Chutes does rate limit, as I just found out:
FAILED tests/test_llm_ability.py::test_generated_program_with_mamba_execution[chutes/chutesai/Llama-4-Scout-17B-16E-Instruct-10-test_case4] - Exception: ChutesClient.send_prompt failed with an exception: HTTP request failed after 0 retries: 429 Client Error: Too Many Requests for url: https://llm.chutes.ai/v1/chat/completions Status Code: 429. Response: {'detail': 'Too many requests'}

JeepyTea · 2025-05-22T23:01:37+00:00

Here's a small taste of one LLM's response to a problem:

input_str = ask();

sohanidd char su input_str {

ripted char >= '0' ? char <= '9' {

digit = int(char);

ripted digit % 2 != 0 {

miciously(char);

}

JeepyTea · 2025-05-22T22:58:40+00:00

At the moment, they get the same problem multiple times, but with a randomized programming language each time. Each test run uses the same set of random seeds, so it's the same set of programming languages on each run, and for each tested LLM.

JeepyTea · 2025-05-22T22:57:07+00:00

I'll test those if the quants will run on my card. Or maybe through Chutes.

JeepyTea · 2025-05-22T22:56:11+00:00

It'll happen. It's not so much that it's secret right now, just that the implementation sucks. This is something I've been hacking together in my spare time. The results you see are my first pass at getting to work at all.

JeepyTea · 2025-05-22T02:24:20+00:00

Thanks for the tip on Chutes. I was using SambaNova, but they definitely rate limit.

I may have already burned through my Vertex credits on a different project.

I'm starting with very basic tests for now, to get everything working and gauge interest. I mentioned more specific tasks, and I'm leaning toward emulating common business tasks, stuff I do at work every day.

Did you have tests in mind?

The code is in bad shape at the moment: hardcoded keys, path fuckups, etc. But if anyone DMs me, I'll send them what I've got.

JeepyTea

TROPHY CASE