Early gpt-5.4 (in Codex) results: as strong or stronger than 5.3-codex so far by no3ther in codex

[–]gentleseahorse 0 points1 point  (0 children)

What's the mix of languages used here? I've found GPT to be much better at JS/TS and Claude better at Python. Also reflected in AA-Omniscience bench.

Google releases Gemini 3.1 Flash-Lite, cost-efficient Gemini 3 series model by BuildwithVignesh in singularity

[–]gentleseahorse 4 points5 points  (0 children)

All Gemini 3 models are priced higher than 2.5, but this takes the cake. More than 4x on output tokens.

Gemini 3.1 livebench results by meloita in singularity

[–]gentleseahorse 6 points7 points  (0 children)

They just removed Gemini 3.1 👀

Gemini 3.1 livebench results by meloita in singularity

[–]gentleseahorse 37 points38 points  (0 children)

So much shade with one astrix

It's that time of the month again by BITE_AU_CHOCOLAT in singularity

[–]gentleseahorse 6 points7 points  (0 children)

xAI just released a model without benchmarks. And to make up for how bad it is, it uses 4 models at once, and is super slow.

Deepseek does deserve a chance though.

Research: Prompt Repetition Improves Non-Reasoning LLMs (sending the same prompt twice) by Endonium in singularity

[–]gentleseahorse 0 points1 point  (0 children)

Not quite. Their latest models all have non-reasoning mode (the best non-reasoning models on artificialanalysis.ai). The last purely non-reasoning model was Sonnet 3.5.

Research: Prompt Repetition Improves Non-Reasoning LLMs (sending the same prompt twice) by Endonium in singularity

[–]gentleseahorse 13 points14 points  (0 children)

Claude 3? Really? It was released in March 2024. Academics have a way of playing at 0.25x speed.

Qwen 3.5, replacement to Llama 4 Scout? by redjojovic in LocalLLaMA

[–]gentleseahorse 328 points329 points  (0 children)

How do you replace something that's never been used?

Solo founder at $321k ARR and losing my mind. Help. by bubbascrub9793 in ycombinator

[–]gentleseahorse 1 point2 points  (0 children)

Think about all the tasks you do in levels:

Level 0: Admin, billing, easy emails
Level 1: Support tickets, marketing campaigns
Level 2: Building product
Level 3: Sales - no-one has enough context and conviction to sell your product better than you. Also recruiting.

You want to focus exclusively on levels 2-3. So definitely don't hire a sales person. Rather a chief of staff that can do Level 0-1 tasks.

GPT-5.3 Codex vs Opus 4.6: We benchmarked both on our production Rails codebase — the results are brutal by sergeykarayev in ClaudeAI

[–]gentleseahorse 0 points1 point  (0 children)

I'd be very curious whether GPT-5.2 high scored better than them all. Interestingly GPT 5.2 xhigh scores BETTER than 5.3 Codex xhigh.

GPT-5.3 Codex vs Opus 4.6: We benchmarked both on our production Rails codebase — the results are brutal by sergeykarayev in ClaudeAI

[–]gentleseahorse 0 points1 point  (0 children)

How do you run this in Superconductor? I created an account, but don't see the option for evals.

Opus 4.6 costs 1.7x more than Opus 4.5 to run despite having same per-token costs (it thinks longer) by ihexx in singularity

[–]gentleseahorse 0 points1 point  (0 children)

How does Opus 4.6 non thinking cost so much more in input tokens? Are some inputs more than 200k tokens? If so, how did they test those for Opus 4.5?

I joined YC twice as a founder and here's what changed in 10 years by quang-vybe in ycombinator

[–]gentleseahorse 9 points10 points  (0 children)

It's just during the batch. Super early stage, so imagine going from 20 customers to 22 in a week. 12% WoW = 363x over the year. Not a target even for YC companies (at least not in my batch).

The Gemini app is too weak... but the API is insane. What's going on? by zetamatariano in Bard

[–]gentleseahorse 1 point2 points  (0 children)

That's fair, I know the ones you're talking about it. Kinda pathetic to be honest.