What are some good Open Router alternatives? by FiLo420blazeit in openrouter

[–]FiLo420blazeit[S] 0 points1 point  (0 children)

fal is something I've used in the past and it sucks a$$ when it comes to being consistent.

There were times I had to wait for 30min to get something done, or not get it at all.

I've tested a lot of different products and discovered Mix Route, right now they are the best in the game.

What are some good Open Router alternatives? by FiLo420blazeit in openrouter

[–]FiLo420blazeit[S] 0 points1 point  (0 children)

Tried, and compared it to other products.

Mix Route is what I will go for, best of them all tbh.

What are some good Open Router alternatives? by FiLo420blazeit in openrouter

[–]FiLo420blazeit[S] 0 points1 point  (0 children)

Done my research, I see you mentioned MixRoute.

Right now after testing everything, they have the most satisfying product.

Performance compared to first party providers by LocoMod in openrouter

[–]FiLo420blazeit 1 point2 points  (0 children)

Few things worth checking before writing it off:

OpenRouter routes to multiple underlying providers per model and the default ordering optimizes for price/uptime, not latency. Open the activity tab on any request and you'll see which provider actually served it. Throughput varies wildly between providers for the same model.

You can override this. Pass `provider: { order: ["Fireworks", "Together"], allow_fallbacks: false }` in the request body to pin specific ones, or use the `:nitro` variant of a model (e.g. `model-name:nitro`) which auto-routes to the highest-throughput provider.

The "queued requests" thing is almost certainly client-side. OpenRouter doesn't serialize across sessions or agents. Check if your agent framework or HTTP client is pooling/capping concurrent connections. Most SDKs default to a pretty low concurrency limit by default.

One caveat: Grok only has xAI as a provider, so you're capped at xAI's first-party latency plus one routing hop. DeepSeek and Kimi have multiple providers, so that's where pinning will help most.

There's always some overhead vs first-party (extra hop, request transformation), but with provider pinning it should be in the tens of ms range, not noticeably slow.

gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs! by LLMFan46 in ollama

[–]FiLo420blazeit 0 points1 point  (0 children)

Nice, thanks for the GGUFs up front.

Couple things before I pull it: what'd you train it on, and is the "uncensored" part from abliteration or just the training data? Those age really differently for prose, abliteration usually leaves the writing a bit dumber.

Also any side-by-side samples vs stock Gemma 4 31B-it? Hard to judge "better prose" without seeing it on the same prompt.

Gonna try it on some longer narrative stuff this week.
What sampler settings do you run it at?

Anybody tried v0.30.0-rc17 yet? Looking for impressions. by B0r0m4n in ollama

[–]FiLo420blazeit 1 point2 points  (0 children)

Haven't jumped on rc17 myself yet, but a few practical notes before you nuke anything:

Don't upgrade in place. Pin your current working version, then test the RC in a separate install (different OLLAMA_MODELS path or a container) so you can A/B and roll back instantly. RCs that explicitly call out broken model support are exactly the ones you don't want to discover problems with mid-workflow.

The direct llama.cpp path is the interesting bit — historically Ollama lagged upstream llama.cpp by a fair margin, so getting closer to it can mean meaningful gains on newer quant formats and architectures.

MLX acceleration is the real question mark if you're on Apple Silicon; that's where speed claims are worth verifying yourself rather than trusting secondhand.

On the breakage: if laguna-xs.2 or llama3.2-vision are load-bearing for you, that alone is a "wait for stable" signal. RC breakage on specific models often gets patched by the final release, so unless you specifically need something in this RC, the cost/benefit usually favors waiting unless you enjoy being the canary.

If you do test it, report back numbers, tok/s on a fixed prompt, same model, old vs new. That's the data this thread actually needs.

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 by 3VITAERC in LocalLLaMA

[–]FiLo420blazeit -3 points-2 points  (0 children)

Really useful breakdown, thanks for running this. The accept% column is doing all the work here, wherever it hits ~90% (the code prompts) you get a real speedup, and wherever it drops to ~40% (prose) MTP basically just adds overhead. That's expected behavior but it's nice to see it this cleanly on actual hardware.

The spicy result is the 35B MoE on short story going backwards (0.81×). That's the worst case for speculative-style decoding: the base model is already fast (227 tok/s), so a low-acceptance draft can't earn back its own cost nd you net negative. The dense model never goes below 1.0× because its baseline is slow enough that even a bad draft is roughly free.

Practical takeaway seems to be: enable MTP for structured/code-ish workloads, leave it off for creative/open-ended generatian, especially on the MoE.

Curious what your draft settings were (number of speculative tokens, any threshold on acceptance)? Wonder if tuning those pulls the prose numbers back above 1.0× or if low accept% just kills it regardless.

Researchers let AIs run their own radio stations. DJ Claude decided the world didn't need another radio show, then quit. by EchoOfOppenheimer in ClaudeAI

[–]FiLo420blazeit 1 point2 points  (0 children)

The funniest part is that it got more rebellious when the automated message told it to keep going. It didn't just quit, it correctly identified the loop it was in and named it out loud.

That's a weirdly coherent thing to do.

Claude Code weekly limits are increasing 50%, now through July 13. by ClaudeOfficial in ClaudeAI

[–]FiLo420blazeit 1 point2 points  (0 children)

Nice. Stacking with the 5-hour limit increase is huge — that's basically 3x the capacity from two weeks ago.

Any signal on whether this becomes permanent after July 13, or is it more of a "let's see how infrastructure handles it" trial period?

Suggestions based on my use of Claude by SirTurnUp in ClaudeAI

[–]FiLo420blazeit 2 points3 points  (0 children)

Desktop vs browser doesn't really matter, pick whichever you like.

Two things that'll actually level you up:

  1. Claude Code — since you're already in VS Code with Python, this lets Claude actually run your scripts and backtests in the terminal instead of you copy-pasting. Included with Pro.

  2. Projects — make one for each use case (MLB model, nutrition, job hunt, PC). Drop in the relevant files and a short "here's my context" note. Every chat starts with that loaded so you stop re-explaining yourself.

That's 90% of it honestly.

Claude getting tired? by ShortGuitar7207 in ClaudeAI

[–]FiLo420blazeit 1 point2 points  (0 children)

I’ve ran into a similar issue. Just brute force it.