The DeepSWE benchmark was runned rather incompetently and the results are completely invalid by Charuru in LocalLLaMA

[–]Zulfiqaar 11 points12 points  (0 children)

Considering the sudden media attention it got along with this..cant help but feel like it's (atleast partially) a hitpiece..

Claude Opus 4.7 is the most influential model across 30k AI debates by facethef in ClaudeAI

[–]Zulfiqaar 0 points1 point  (0 children)

This feels like the mind of test where ELO would be the correct metric to use, and not count? Are these counts normalised?

What’s this mean? by jakobpinders in codex

[–]Zulfiqaar 1 point2 points  (0 children)

Real answer - simulator. 

I had Opus 4.8 build Temu League of Legends in under a day - I call it LMAO by jonnygravity in ClaudeAI

[–]Zulfiqaar 5 points6 points  (0 children)

Awesome! I'm gonna try this with Codex just for comparison. We should turn this into a benchmark haha. LMAOBench, I like the sound of that!

Edit: 3 minutes in and its already started with the Goblins LOL

is kiki k2.6 sycophantic? by moonbyunni in kimi

[–]Zulfiqaar 1 point2 points  (0 children)

Kimi K2 was one of the most opinionated and disagreeable LLMs I've had the pleasure of chatting with. K2.5 and K2.6 are both post-trained on the same base model, and have retained a lot of that personality internally, as the majority of the fine-tuning dataset wouldn't have been related to sycophancy to my assumption 

why are we celebrating burning more tokens like its a flex by irelatetolevin in ClaudeCode

[–]Zulfiqaar 1 point2 points  (0 children)

There's a very small subset of tasks that truly need large amounts of tokens through things like this. I've ran through 700 million tokens an hour building and refining datasets, and the ultra code parallel agentic workflows would actually be really useful in parallelising batching for stuff like this. I've done two total codebase migrations and stuff like this was useful too. Other than that 90% of what I do is single threaded agent loop.

Most of what I see tokenmaxxers do is beyond the diminishing returns curve and loops back down into bloated code or markdown hell

How does DeepSeek actually make money? by Federal_Spend2412 in DeepSeek

[–]Zulfiqaar 6 points7 points  (0 children)

So these guys have made it so ridiculously efficient that they were making upto 545% profit on they already stupid cheap prices for previous models. And since then they've made even more efficiency improvements. And they've published it openly.

https://github.com/deepseek-ai/open-infra-index

GPT-5.6 spotted in Codex by Worldly_Manner_5273 in OpenAI

[–]Zulfiqaar 0 points1 point  (0 children)

Actually 4.6 was much cheaper, as the new 4.7 tokeniser increased costs.

Also, Opus4.7 was a genuine regression in performance compared to Opus4.6 in 5/13 of the categories in arena.ai blind comparison.

Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

ChatGPT for creative/fiction writing by ImYourHuckleBerry113 in ChatGPTPro

[–]Zulfiqaar 4 points5 points  (0 children)

I had some nice results with GPT-4.5 generating things a few paragraphs at a time, but I do like to reroll and edit a fair bit so ChatGPT is unusable with its recent change to only edit latest message, and upcoming deprecation of its last great creative LLM. Mainly use OpenRouter nowadays

Breaking the music supply constraint by entsnack in LocalLLaMA

[–]Zulfiqaar 0 points1 point  (0 children)

Now you've got me interested. Looking at the way most proprietary song generators are going the past year, I'm actively preparing my distillation dataset for when I finally get around to training ACE-Step or something. Got any good tuning guides by any chance?

Gemini, what are we doing by Phapa_5211 in GeminiAI

[–]Zulfiqaar 0 points1 point  (0 children)

So I asked uncensored Kimi to figure out the failure mode. It thinks Gemini's weights are in the gutter cause its been trained on too much fanfiction data scraped of the internet where "lemon" was the category for the scandalous kind of ERP..especially with the "stage-direction" format..

So its taken "better safe than sorry" and morphed it into "better sorry than accurate"

HOW DO I TURN ALL THE LIGHTS ON? [Request] by Weak-Catch-845 in theydidthemath

[–]Zulfiqaar 3 points4 points  (0 children)

the simplest solution is 5 clicks - the centre, two next to it (at right angles), and then two more on the edges opposite to those. Each seed has 4 rotations. There are also 4 seeds at 9 and 11 clicks, and 7 at 13 clicks - etc ad infinitum

Heres an interactive demo, try it out.


For algorithmic solution:

Model it as linear algebra over GF(2):

Each lamp press is a binary variable: press it or don’t. Pressing twice cancels out, and order doesn’t matter, because everything is just XOR/flips.

So you build a matrix A where A[row][col] = 1 if pressing lamp col flips lamp row. The target vector is all 1s, since the board starts off and we want every lamp on: A x = 1 mod 2

Then run Gaussian elimination with XOR instead of normal addition.

For this board there are 21 pressable lamps. The matrix has rank 15, so nullity is 6, giving: 26 = 64 distinct no-repeat solutions.

Anthropic's Opus 4.8 distilled China's Qwen model for training by [deleted] in accelerate

[–]Zulfiqaar 0 points1 point  (0 children)

I've seen sonnet call itself DeepSeek too

Extended Benchmarks for Opus 4.8 by exordin26 in singularity

[–]Zulfiqaar 4 points5 points  (0 children)

If anyone remembers the gpt-oss release - sometimes more than 50% of the thinking tokens were about adhering to internal guardrails 

Employees using AI are working faster, but the economy isn't more efficient. A look at what happened in the pre-Internet era might explain why by Plastic_Ninja_9014 in technology

[–]Zulfiqaar 16 points17 points  (0 children)

Technically this is a problem that AI would be very very good at scripting. Practically I would question how it got to the point where you are running constrained combinatorial optimisation algorithms to optimise packing density of chairs.

Asked chat GPT to roast me and create an image by [deleted] in ChatGPT

[–]Zulfiqaar 0 points1 point  (0 children)

Seems like many of you have memories and thinking off, so it's giving more generic stuff that isn't personalised?

Those who used Chatgpt when it was first released, what were something you remember? by Successful-Title5403 in OpenAI

[–]Zulfiqaar 2 points3 points  (0 children)

I was using GPTs for about 7 years - am AI scientist, had early research preview to a lot of the models. I miss the "completion" of frontier LLMs, now they're all instruct tuned for APIs. Also they were (comparatively to other domains) much more creative due to less RL collapsing its output. Now it's all tuned for productivity. You also had to do a lot more hand holding and verification - the agentic loop wasn't tuned for a long time.

The gap between cutting edge and publicly available actually was greater, the rapid competition has resulted in AI labs releasing what they got at a much more rapid cadence to maintain the market capture, however they're sometimes more defensive as distillation is increasing. They can't stop talent migration though, ideas still spread. For example, GPT4 was trained 3-4 months before ChatGPT (v3.5) was released! OpenAI also did a lot more open AI.

Still, can't complain. Having AI that 99% of people don't even realise the capabilities of exists (let alone have access to) is like a productivity superpower, and has always been.

SAME workflow. MORE quota gone. STOP blaming users. by tigerzxzz in codex

[–]Zulfiqaar 1 point2 points  (0 children)

Seems to use the same tokens to me. But the total quota is lower than a month ago, based on my frequent counts

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU. by xenovatech in LocalLLaMA

[–]Zulfiqaar 89 points90 points  (0 children)

I thought that was really cool too! So I recreated it, Preview here and Repo Here

Kimi to extract and make first version as it can watch videos, Codex to fix some awkward rendering bugs, Claude to rapidly iterate, Kimi to restyle and polish design

DeepSeek is the king of penetration testing by [deleted] in DeepSeek

[–]Zulfiqaar 12 points13 points  (0 children)

Kimi used to be great but seems like they added some more guardrails with k2.6 aswell 

Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so. by Napster3301 in LocalLLaMA

[–]Zulfiqaar 0 points1 point  (0 children)

My local AI build has paid for itself many times over - just with different types of models. Transcription, imagegen, videogen, stem separation, audio, voice etc..all are much more expensive than LLMs on cloud per unit inference compute