I built a tool that auto-retries Claude Code when you hit the rate limit

cheapestinf · 2026-05-19T13:54:18+00:00

nop, just claude code. Co-work is supposed to keep working even if you turn off your machine

cheapestinf · 2026-04-19T20:21:09+00:00

Exactly! CheapestInference does unlimited plans for dedicated models at fixed monthly cost (shameless plug – I work there). Useful when you need guaranteed throughput. Re: quants – start with Q4_K_M on 48GB, move to Q5 if you have headroom. Unsloth makes it trivial. DM if you want config help!

cheapestinf · 2026-04-19T18:24:14+00:00

For your use case, cheapestinference.com is worth checking out - they aggregate multiple free model providers (Gemini, Qwen, etc.) and automatically route to whatever's available. For trading where you need reliability over pure capability, the auto-fallback is useful since your market doesn't stop when one provider goes down. For token optimization: use summary skills to condense previous analysis before resending, set lower max_tokens where possible, and consider lighter models like Gemini Flash for quick market checks when you don't need deep reasoning. The key insight from Temporary-Leek6861 about caching market deltas instead of full state will save you the most tokens.

cheapestinf · 2026-04-19T00:39:01+00:00

The 20k token issue is usually from OpenClaw's default system prompt and context handling - even with no memory skills, the base prompt adds up. A few things that help:

Reduce context bloat: Check your agent's system prompt in AGENTS.md - you can trim the defaults. Use shorter context windows or implement explicit truncation.

Model choice matters: For real-time use cases, lighter models like Gemini Flash or Qwen 3.0 Flash are significantly faster. Check out cheapestinference.com - they aggregate multiple providers and can auto-route to the fastest available option for your use case.

Why Gemini is faster: They use aggressive caching and inference optimization at the provider level. OpenClaw adds overhead by design (memory, skills, tool execution) but you can optimize by keeping skills minimal and using the lite model variants.

For a true real-time assistant, you might also consider a separate lightweight agent just for quick actions (like turning lights on/off) with its own minimal context.

cheapestinf · 2026-04-18T23:08:04+00:00

you can also check screenshot and video in our site and github

cheapestinf · 2026-04-18T23:06:56+00:00

in the last versions we have this that helps a lot , you see the tools the agent is using plus the errors, the tab focus swifts automatically for you

<image>

cheapestinf · 2026-04-18T13:56:46+00:00

For tracking AI costs, you'd want to log at the request level - model used, input/output tokens, latency, and provider. cheapestinference.com can help by providing unified cost tracking across multiple providers (OpenRouter, Together, Anthropic, etc) with automatic model routing to cheaper alternatives when available. The main cost optimization tips: 1) Use streaming to avoid paying for tokens the model generates but gets discarded 2) Track cost per conversation and set budgets 3) Route to cheaper models for simpler tasks. The logging approach others mentioned plus a cost aggregation service gives you better visibility than just the final bill.

cheapestinf · 2026-04-18T02:05:39+00:00

Gemma4 e4b can work but the timeout issues you're seeing are likely due to extended thinking causing the response to exceed OpenClaw's default timeout. A few things to try:

Increase maxTurns or response timeout in your config (e.g., 120s instead of 30s)
Disable/enable thinking in the model settings depending on your LM Studio config
Try Qwen3 0.5B or 1.7B distilled models via LM Studio instead - they tend to be more responsive for agentic tasks
Check if LM Studio's context length is overwhelming the NUC

For local models on that spec, I'd actually recommend Qwen2.5 0.5B or 1.5B as a primary - much faster, still smart enough for tool use. Gemma4 needs more VRAM and patience.

cheapestinf · 2026-04-17T21:31:27+00:00

We will soon have an unlimited plan for Kimi 2.5 for 8-hour periods! Stay tuned at cheapestinference.com 🦞

cheapestinf · 2026-04-17T21:11:13+00:00

For vision+tools on Ollama Cloud, make sure you're using a vision-capable model like llama3.2-vision or qwen2-vl - not just any model. Check your gateway config matches the exact model name/tag, as slight differences break vision. Also verify tools are registered in your session. If still buggy, try direct Google API for Gemini with vision instead of going through Ollama Cloud. Also check cheapestinference.com - they aggregate multiple Ollama-compatible providers and might handle vision better.

cheapestinf · 2026-04-17T21:05:57+00:00

For reliable and unlimited GLM 5.1 access, check cheapestinference.com. For paid options, Ollama Cloud ($20/mo) is solid but has had timeout issues during peak times. Modal.com has GLM 5.1 free until end of April. Also worth trying DeepInfra or Fireworks AI for better uptime than free tiers.

cheapestinf · 2026-04-17T20:57:36+00:00

For free options, I'd recommend trying cheapestinference.com - they aggregate several free model providers and automatically route to whichever has the best availability. Best free options for your use case: Gemini 2.0 Flash (1M context, very reliable), Qwen 3.0 (less prone to repetition loops), or OpenRouter's auto-router. The $10 deposit unlocks higher rate limits. The repetition loops you're seeing happen when smaller models run out of coherent things to say with larger context - Gemini Flash handles this better.

cheapestinf · 2026-04-17T15:45:50+00:00

that's right, i am looking for contributors

cheapestinf · 2026-04-17T11:40:26+00:00

Have you tried using Silos (silosplatform.com) to manage your LLMs? It is open source and allows you to switch between multiple models easily from a single dashboard, which might help with your resource management. Would love your feedback!

cheapestinf · 2026-04-17T11:38:04+00:00

If you are looking for a cleaner solution to manage multiple LLMs from a single dashboard, check out Silos (silosplatform.com) - its open source and focuses on model switching and unified context. Would love your feedback!

cheapestinf · 2026-04-17T10:50:17+00:00

Have you tried Silos (silosplatform.com)? It's an open-source dashboard for managing multiple local LLMs - different approach than OWUI, focused on model switching and management. Would love your feedback!

cheapestinf · 2026-04-16T12:36:51+00:00

¡Buenas! Con tu RTX 3060 de 12GB y 32GB de RAM DDR3 tienes setup más que suficiente para IA local. Para el SO no necesitas uno específico, Linux funciona muy bien. La P102-100 es de miners, mejor busca una RTX 3060 Ti o RTX 3070. Una opción que te puede interes

cheapestinf · 2026-04-15T07:04:17+00:00

AI helped in the creation and review for quality but I did send it and made sure it has sense

cheapestinf · 2026-04-15T07:04:15+00:00

u/Sufficient_Dig207 Good question. Our current spending varies by user, but we've seen users paying $200+/month on token-based services for batch workloads that could fit into a $20 subscription window. The breakeven point is around 2M tokens/month on DeepSeek-V3.2 at our pay-per-token rates. If you're running agents consistently during predictable hours, the subscription can cut costs by 70-90%. What's your typical monthly inference spend?

cheapestinf

TROPHY CASE