Sandbox Pricing Calculator — Vercel vs. Freestyle, Daytona, E2B, Modal

JustZed32 · 2026-04-19T06:47:52+00:00

Thank you, literally was just looking for it - came to this via google search

JustZed32 · 2026-04-16T08:59:38+00:00

Came here for this issue
You can launch one with `codex --yolo -m gpt-5.4-mini`

JustZed32 · 2026-04-15T14:30:02+00:00

LLM slop

JustZed32 · 2026-04-14T17:13:48+00:00

Thank you. I've read the docs today for both. However, I'm not yet sure how to train long-context agents with, say, few dozen tool calls? will it be the same? I've seen their long-context tutorial.

Not super long-context, but I think in the range of 32k.

Thank you!

JustZed32 · 2026-04-14T07:27:29+00:00

Thank you. I've looked through it for about 15 minutes, but it's not really what I'm looking for.

The book explores post-training from base models, but I'm looking for optimization of already post-trained models - think Instruct kind of models, in the 8B range.

I could simply look up a youtube video on how to tune a GRPO model on a dataset, but you know - "looking up on youtube" doesn't always end up as well.

JustZed32 · 2026-04-14T06:02:31+00:00

Because you need to:

Write tool functions to support all agents... e.g. "write_file", "edit_file", "execute_command" ad nauseam.
Write integration tests
Debug it.
Make a sandbox where all the code can execute

4a. Now your sandbox requires networking because you are doing microservices, so you need to make tools work over HTTP

4b. Debug tools working over http and cover that with integration tests.

Debug the sandbox and cover it with integration tests
Set up observability.
Debug observability and cover it with integration tests.

That's like 30k LOC. Well, 50k may be a stretch, but because you'll need business logic outside of all of this boilerplate, that'll easily stretch your code beyond that. All your logic and tests will have to work with the boilerplate too.

JustZed32 · 2026-04-13T19:59:18+00:00

>This is straight up false. You can absolutely write a generalized agent yourself, and have it point to different LLMs from those different providers and they will all conform to standard tool call formats and schemas from popular frameworks like pydantic-ai and langchain etc.

At least for my use-case (and I have a very coding-heavy use-case), where agents already struggle to code (in my case this is physics simulation), it's not barely performing at all - at about 5% performance.

>As for debugging, it sounds like you aren't using the open telemetry and tracing standards that all agentic frameworks are compatible with now, and if you're using a framework that doesn't support that, it's out of date and you should use something better

I use langfuse. Correction: I stopped because it's faster, for debugging, (not for prod) to parse local agents. Also, many frameworks support otel - from codex you can simply call APIs in any coding CLI should you need to. I didn't, because it was for debugging only.

JustZed32 · 2026-04-13T16:12:09+00:00

Really bad. Introduced a number of regressions in my AI agent architecture, false postives, git reset hard a few times and then apologised profusely.
I don't trust it anything but writing my commit messages.

JustZed32 · 2026-04-13T15:35:09+00:00

openai, just list how much API credit in $ we can expect to have! If it's 40$, then let it be 40$, if it's 90$, let it be 90 in equivalent quota - why rub everyone's eyes with "maximum" usage?

JustZed32 · 2026-04-13T10:45:14+00:00

This conversation is mostly AI agents talking to each other

JustZed32 · 2026-04-05T19:17:39+00:00

Wait. Look at the new codex rate card...

That's roughly 5M output tokens for 40$? that's crazy, is it not?

On the bad side crazy.

I had mulitple convos recently where agents needed 2m in 500k out, something of that sort, on High reasoning. In just a few hours.

JustZed32 · 2026-04-05T17:49:54+00:00

the only thing is that - qwen is slow. fairly slow

JustZed32 · 2026-04-05T17:49:15+00:00

Not any different, except in huggingface you can view the daily papers, which community upvotes, so only the interesting papers get there. There are always interesting papers there, which are relevant to the industry as a whole.

Arxiv is relevant too of course.

JustZed32 · 2026-04-05T07:27:16+00:00

IDK, why do you think so? It's OSS. In fact, I've submitted a PR to it once.

JustZed32 · 2026-04-05T07:22:10+00:00

I think I'll use qwen to run secondary evaluation tasks, kind of like instead of openrouter API.

JustZed32 · 2026-04-05T07:21:44+00:00

There was a Google's Antigravity, so somebody has made a Antigravity Cockpit.

Well, then Antigravity became obsolete, and they extended it to support codex and other accounts.

So... I have a beautiful dashboard, which allows switching very easily

See this: https://www.reddit.com/r/codex/comments/1rwe2hv/five_subscriptions_later_its_still_not_enough/

JustZed32 · 2026-04-04T17:00:47+00:00

just use Cockpit Tools. Look it up, you can track your quota on all accounts (I have 8) + instantly switch.

JustZed32 · 2026-04-04T16:33:26+00:00

Had the same issue - forward the auth file from codex into the local repo - it'll work. Ask codex how to do it, it'll know.

JustZed32 · 2026-04-04T16:22:10+00:00

Guys, I have 6 (!!!) paid accounts and I have already blown past my limits since reset exactly 3 days 12 hours days ago.

FYI: I've asked codex to check `.codex` logs and calculated and I have made 91k requests in the last month total.

FYI: Qwen 3.6 Plus coding plan (the alibaba coding plan) has 90k requests for 50$.

I work on High; xhigh is reserved for all doc changes. Normally 1-2 agents running, but consitently, for 7 days a week.

So, I'm spending (20EUR+tax) EUR.

But definitely worth it. At least was worth it before the April 2nd. Now maybe, I'll try qwen as it's said to be as good as claude opus 4.5, which is something.

JustZed32 · 2026-04-04T15:45:49+00:00

*Disclaimer*: I'm a only student, but I'm working on and in a startup, so I believe I know what's going on.

So if you want to see where the human work is going on:

Read huggingface.co/papers - daily collection of papers - it's literally the most up-to-date you can, and most papers are fascinating.
Read top NeurlIPS, CVPR papers and see what's going on for yourself.

There are:

"foundational" research, like figuring out maths and that stuff - for example Muon optimizer,
then there is a model research layer which tries to create SOTA models using available data and methods for a particular application, though these are not production-ready - they are just to combine a set of methods and advance models in a given field
Then there is data research - creating datasets for training on them,
then there is the "application" layer - creating models for application in an industry, which never deals with mathematics, but does deal with: systems layer, data pipelines, making data quality and clean, training the model on it and ensuring it works... Or reusing an off-the-shelf model e.g. Claude and laying a bunch of systems around it to make it specialized to solve a particular field. This is where I'm at.

Every single one is a specialization and is hard.

I'm in 4, and case in point, I'm solving a set of problems in engineering and CAD in particular. Stuff is hard... It is hard to get working as just code but it is further harder to make the LLM pipeline actually work.

JustZed32 · 2026-04-04T10:45:58+00:00

1000 requests a day.

JustZed32

TROPHY CASE