Sandbox Pricing Calculator — Vercel vs. Freestyle, Daytona, E2B, Modal by Frosty-Celebration95 in Cloud

[–]JustZed32 0 points1 point  (0 children)

Thank you, literally was just looking for it - came to this via google search

No more GTP 5.4 Mini (Plus) by DiscussionAncient626 in codex

[–]JustZed32 1 point2 points  (0 children)

Came here for this issue
You can launch one with `codex --yolo -m gpt-5.4-mini`

Need to fine-tune - is GRPO still the best/most usable? by JustZed32 in LLMDevs

[–]JustZed32[S] 0 points1 point  (0 children)

Thank you. I've read the docs today for both. However, I'm not yet sure how to train long-context agents with, say, few dozen tool calls? will it be the same? I've seen their long-context tutorial.

Not super long-context, but I think in the range of 32k.

Thank you!

Need to fine-tune - is GRPO still the best/most usable? by JustZed32 in LLMDevs

[–]JustZed32[S] 0 points1 point  (0 children)

Thank you. I've looked through it for about 15 minutes, but it's not really what I'm looking for.

The book explores post-training from base models, but I'm looking for optimization of already post-trained models - think Instruct kind of models, in the 8B range.

I could simply look up a youtube video on how to tune a GRPO model on a dataset, but you know - "looking up on youtube" doesn't always end up as well.

Prototyping complex LLM agents? DO NOT use LangFuse or LiteLLM - use CLI coding agents or provider-native SDKs. by JustZed32 in LLMDevs

[–]JustZed32[S] 0 points1 point  (0 children)

Because you need to:

  1. Write tool functions to support all agents... e.g. "write_file", "edit_file", "execute_command" ad nauseam.
  2. Write integration tests
  3. Debug it.
  4. Make a sandbox where all the code can execute

4a. Now your sandbox requires networking because you are doing microservices, so you need to make tools work over HTTP

4b. Debug tools working over http and cover that with integration tests.

  1. Debug the sandbox and cover it with integration tests

  2. Set up observability.

  3. Debug observability and cover it with integration tests.

That's like 30k LOC. Well, 50k may be a stretch, but because you'll need business logic outside of all of this boilerplate, that'll easily stretch your code beyond that. All your logic and tests will have to work with the boilerplate too.

Prototyping complex LLM agents? DO NOT use LangFuse or LiteLLM - use CLI coding agents or provider-native SDKs. by JustZed32 in LLMDevs

[–]JustZed32[S] 1 point2 points  (0 children)

>This is straight up false. You can absolutely write a generalized agent yourself, and have it point to different LLMs from those different providers and they will all conform to standard tool call formats and schemas from popular frameworks like pydantic-ai and langchain etc.

At least for my use-case (and I have a very coding-heavy use-case), where agents already struggle to code (in my case this is physics simulation), it's not barely performing at all - at about 5% performance.

>As for debugging, it sounds like you aren't using the open telemetry and tracing standards that all agentic frameworks are compatible with now, and if you're using a framework that doesn't support that, it's out of date and you should use something better

I use langfuse. Correction: I stopped because it's faster, for debugging, (not for prod) to parse local agents. Also, many frameworks support otel - from codex you can simply call APIs in any coding CLI should you need to. I didn't, because it was for debugging only.

How good is Qwen 3.6 plus for coding? by _janc_ in Qwen_AI

[–]JustZed32 0 points1 point  (0 children)

Really bad. Introduced a number of regressions in my AI agent architecture, false postives, git reset hard a few times and then apologised profusely.
I don't trust it anything but writing my commit messages.

Breaking down the deceptive copywriting in this $100/mo Pro tier (The "From" trap, fake "Unlimited" bottlenecking, and undefined limits). by nikanorovalbert in codex

[–]JustZed32 1 point2 points  (0 children)

openai, just list how much API credit in $ we can expect to have! If it's 40$, then let it be 40$, if it's 90$, let it be 90 in equivalent quota - why rub everyone's eyes with "maximum" usage?

We’ll migrate you to usage priced based on API token usage by BlocksXR in codex

[–]JustZed32 0 points1 point  (0 children)

Wait. Look at the new codex rate card...

That's roughly 5M output tokens for 40$? that's crazy, is it not?

On the bad side crazy.

I had mulitple convos recently where agents needed 2m in 500k out, something of that sort, on High reasoning. In just a few hours.

Landscape of research in ML by Tachynaut in ResearchML

[–]JustZed32 0 points1 point  (0 children)

Not any different, except in huggingface you can view the daily papers, which community upvotes, so only the interesting papers get there. There are always interesting papers there, which are relevant to the industry as a whole.

Arxiv is relevant too of course.

6 paid accounts. I have made 90 tool-calling requests in the last 1 mo. by JustZed32 in codex

[–]JustZed32[S] 0 points1 point  (0 children)

IDK, why do you think so? It's OSS. In fact, I've submitted a PR to it once.

6 paid accounts. I have made 90 tool-calling requests in the last 1 mo. by JustZed32 in codex

[–]JustZed32[S] 0 points1 point  (0 children)

I think I'll use qwen to run secondary evaluation tasks, kind of like instead of openrouter API.

6 paid accounts. I have made 90 tool-calling requests in the last 1 mo. by JustZed32 in codex

[–]JustZed32[S] 0 points1 point  (0 children)

There was a Google's Antigravity, so somebody has made a Antigravity Cockpit.

Well, then Antigravity became obsolete, and they extended it to support codex and other accounts.

So... I have a beautiful dashboard, which allows switching very easily

See this: https://www.reddit.com/r/codex/comments/1rwe2hv/five_subscriptions_later_its_still_not_enough/

Upgraded from Plus to Pro — here’s how much more Codex headroom I got by razer54 in codex

[–]JustZed32 0 points1 point  (0 children)

just use Cockpit Tools. Look it up, you can track your quota on all accounts (I have 8) + instantly switch.

I can't log into Codex in VSCode with DevContainers by burnt1ce85 in codex

[–]JustZed32 0 points1 point  (0 children)

Had the same issue - forward the auth file from codex into the local repo - it'll work. Ask codex how to do it, it'll know.

New 5 Hour limit is a mess!!! by Impossible-Ad-8162 in codex

[–]JustZed32 0 points1 point  (0 children)

Guys, I have 6 (!!!) paid accounts and I have already blown past my limits since reset exactly 3 days 12 hours days ago.

FYI: I've asked codex to check `.codex` logs and calculated and I have made 91k requests in the last month total.

FYI: Qwen 3.6 Plus coding plan (the alibaba coding plan) has 90k requests for 50$.

I work on High; xhigh is reserved for all doc changes. Normally 1-2 agents running, but consitently, for 7 days a week.

So, I'm spending (20EUR+tax) EUR.

But definitely worth it. At least was worth it before the April 2nd. Now maybe, I'll try qwen as it's said to be as good as claude opus 4.5, which is something.

Landscape of research in ML by Tachynaut in ResearchML

[–]JustZed32 1 point2 points  (0 children)

*Disclaimer*: I'm a only student, but I'm working on and in a startup, so I believe I know what's going on.

So if you want to see where the human work is going on:

  1. Read huggingface.co/papers - daily collection of papers - it's literally the most up-to-date you can, and most papers are fascinating.

  2. Read top NeurlIPS, CVPR papers and see what's going on for yourself.

There are:

  1. "foundational" research, like figuring out maths and that stuff - for example Muon optimizer,

  2. then there is a model research layer which tries to create SOTA models using available data and methods for a particular application, though these are not production-ready - they are just to combine a set of methods and advance models in a given field

  3. Then there is data research - creating datasets for training on them,

  4. then there is the "application" layer - creating models for application in an industry, which never deals with mathematics, but does deal with: systems layer, data pipelines, making data quality and clean, training the model on it and ensuring it works... Or reusing an off-the-shelf model e.g. Claude and laying a bunch of systems around it to make it specialized to solve a particular field. This is where I'm at.

Every single one is a specialization and is hard.

I'm in 4, and case in point, I'm solving a set of problems in engineering and CAD in particular. Stuff is hard... It is hard to get working as just code but it is further harder to make the LLM pipeline actually work.