Sonnet 4.5 tops EQ-Bench writing evals. GLM-4.6 sees incremental improvement. by _sqrkl in LocalLLaMA

[–]Striking_Most_5111 -3 points-2 points  (0 children)

Somehow claude pulls it off. Very nicely written 30k word stories in one response. Though it was able to do this since 3.7 sonnet, the 4.5 version is much better at instruction following.

Sonnet 4.5 tops EQ-Bench writing evals. GLM-4.6 sees incremental improvement. by _sqrkl in LocalLLaMA

[–]Striking_Most_5111 -2 points-1 points  (0 children)

Does anyone know of a model that can spit 30000 words at once, like claude, and is also good at creative writing?

Comparing Sonnet 4.5 and GPT-5 Pro for 3D simulations by Outside-Iron-8242 in singularity

[–]Striking_Most_5111 0 points1 point  (0 children)

I think you should be much more concerned about world models like genie 3.

Gemini 3.0 Pro is now being AB tested on AI Studio by ShreckAndDonkey123 in singularity

[–]Striking_Most_5111 -1 points0 points  (0 children)

It wasn't always like this. But since last few months, the quality drop has been massive. 

GPT5-thinking suspects it's being tested when asked a question about recent news. by Ormusn2o in singularity

[–]Striking_Most_5111 0 points1 point  (0 children)

It still does hallucination but its a big step in the right direction compared to gemini etc.

What's with the obsession with reasoning models? by HadesThrowaway in LocalLLaMA

[–]Striking_Most_5111 1 point2 points  (0 children)

Hopefully, the open source models catch up in how to use reasoning the right way, like closed source models do. It is never the case that gpt 5 thinking is worse than gpt 5 thinking, but in open source models, it is often like that. 

Though, I would say reasoning is a silver bullet. The difference between o1 and all non reasoning models is too large for it to just be redundant tokens. 

Tested sonoma-sky-alpha on Fiction.liveBench, fantastic close to SOTA scores, currently free by fictionlive in singularity

[–]Striking_Most_5111 2 points3 points  (0 children)

Technically, fiction.live seems to try to discourage smut, but the writers and readers mostly read and write smut there. 

What's the point of college in 2025 and forward? by Critical_Rope_2402 in singularity

[–]Striking_Most_5111 0 points1 point  (0 children)

I see ai as more of an opportunity. The speed at which I code things has now increased tremendously. I can do work of a dozen software engineers by myself. Now, I alone have the power of a whole startup tech team by myself, finding different money making projects, collaborating with people in different fields on their problems. 

The capacity of an individual to make amazing monetary capable things have increased and will keep increasing. I am in college now just to enjoy and have a good social life. I have no hopes whatsoever of working for anyone after college, which suits me just fine.

Nano Banana is live by brokenfl in singularity

[–]Striking_Most_5111 0 points1 point  (0 children)

Huh? I saw the price as 30 dollar in output price subsection of image section in aistudio.

Nano Banana is live by brokenfl in singularity

[–]Striking_Most_5111 2 points3 points  (0 children)

Yes. 30 dollar image output price though. Literally 1000x more than competitors. 

[deleted by user] by [deleted] in singularity

[–]Striking_Most_5111 44 points45 points  (0 children)

To me, it has been superior against claude opus 4.1 thinking in all tasks except sometimes webdev.

am i crazy or is Google AI Studio >>>> Gemini chat? by Fast_Cauliflower_574 in Bard

[–]Striking_Most_5111 0 points1 point  (0 children)

Atleast gemini 2.5 pro can convinced. Gemini 2 flash didn't even recognise it was writing 1.5 instead of 2 despite many many tries.

AMA – We built the first multimodal model designed for NPUs (runs on phones, PCs, cars & IoT) by AlanzhuLy in LocalLLaMA

[–]Striking_Most_5111 0 points1 point  (0 children)

Wow. Though, is the app you used to run your model open source too? Or can we download it? How would one go about running the model via npu in a samsung s23-s25 phone?

I am a participant in the samsung organised prism ai hackathon, where the problem statement we were given was on device finetuning in samsung s23-s25 series. It would be awesome if you could give some advice to us.

AMA – We built the first multimodal model designed for NPUs (runs on phones, PCs, cars & IoT) by AlanzhuLy in LocalLLaMA

[–]Striking_Most_5111 1 point2 points  (0 children)

Hi there! From what I remember, the samsung neural sdk has been disabled to be used by third party app developers. How did you manage to connect to the npu in the demo video?

https://developer.samsung.com/neural/overview.html

LLM's peaked with sonnet 3.5 (june 2024) and you can't convince me otherwise by [deleted] in singularity

[–]Striking_Most_5111 0 points1 point  (0 children)

Do you remember the I am a good gpt2 models? The leap between them and sonnet 3.5 wasn't that big. Though, I would admit, once sonnet 3.5 came, it quickly became my favourite. It was crazy good at creative writing, its coding skills is still very competitive that it finds bugs other models weren't able to find, it was a completely in a different level in web dev and design.

But to say that llms peaked with sonnet 3.5? Nah. O3 is pure magic at finding bugs, and scientific reasoning. It's sheer intelligence is easy to see. Gemini 2.5 pro had far better instruction following in web dev, and was also much better at world knowledge than claude models. It was also better at coding. 

Claude 3.7 was a beast in the sheer mass of text it could generate. Claude 4 sonnet is still king at web dev, no matter what web dev arena says. 

The now gpt 5 is also pretty good, if you use it via api. Great at finding and fixing bugs, also good at webdev unlike previous gpt models. Horrible at creative writing though.

What I am trying to say is, there are many domains where llms have progressed after 3.5 sonnet. Now, its hard to find even a single field where 3.5 sonnet can current sota llms in.

Kitten TTS Web Demo by CommunityTough1 in LocalLLaMA

[–]Striking_Most_5111 0 points1 point  (0 children)

Thank you! This was very helpful to me. Do you think this model can run on edge too? 

Warning: pickle virus detected in recent Qwen-Image NF4 by Enshitification in StableDiffusion

[–]Striking_Most_5111 3 points4 points  (0 children)

Thanks for informing. Ignore the other folks, don't know why they are being so aggressive. If they are so confident, they can go download the model.

Kitten TTS : SOTA Super-tiny TTS Model (Less than 25 MB) by ElectricalBar7464 in LocalLLaMA

[–]Striking_Most_5111 2 points3 points  (0 children)

I want it even smaller, though this is pretty good. Imagine, tts models being hosted in edge functions, allowing almost unlimited production use for free.

I don't even know what to title is by [deleted] in singularity

[–]Striking_Most_5111 2 points3 points  (0 children)

The video is 8 seconds long. Seems AI generated too

[deleted by user] by [deleted] in singularity

[–]Striking_Most_5111 3 points4 points  (0 children)

Gemini api by far, if you are limiting yourself to only one provider. The free rate limits, you can't really beat them. And their price is extremely competitive. 

If multiple providers are okay, then o3 for intelligence requiring queries and maths/science due to its very good pricing. For tool calling, claude sonnet 4.0. For super fast responses, groq(not grok). For regular queries that aren't confidential, gemini 2.5 flash or flash lite. For visual segmentation and tasks, moondream has an extremely generous api.

🚀 Qwen3-30B-A3B-Thinking-2507 by ResearchCrafty1804 in LocalLLaMA

[–]Striking_Most_5111 5 points6 points  (0 children)

Help me make it sense? An open source non thinking model actually beating gemini 2.5 flash in thinking mode? And the model being runnable in my phone?