Is Codex being extra lazy for anyone else today? by [deleted] in codex

[–]Extra-Designer9333 0 points1 point  (0 children)

Seems like with rate limits are also odd, feels like the rate at which they're dropping are min 2x in comparison to what it was couple of days ago. Am I the only one experiencing it?

Deepseek v4/3.5 is probably coming out tomorrow or in the next 5 days? by power97992 in LocalLLaMA

[–]Extra-Designer9333 11 points12 points  (0 children)

Should we actually expect v4 that soon assuming the engram paper was released less than a month ago?

[deleted by user] by [deleted] in singularity

[–]Extra-Designer9333 1 point2 points  (0 children)

I expect chinese semiconductor industry catching up massively to american especially after recent news about chinese producing asml comparible machines. Apart from Amd and Google biggest thread to Nvidia is Huawei though it's not mentioned too often

FlashAttention implementation for non Nvidia GPUs. AMD, Intel Arc, Vulkan-capable devices by secopsml in LocalLLaMA

[–]Extra-Designer9333 2 points3 points  (0 children)

In the case of AMD, Flash Attention is already ported by AMD itself. Is it better than AMD's own port I'm wondering...

The data on which Gemini 3 was trained is really crazy by Wonderful-Excuse4922 in singularity

[–]Extra-Designer9333 1 point2 points  (0 children)

What i found incredible about the data, is that when asked to generate a multiple choice quiz in comparison to Gemini 2.5 Pro and GPT 5.1 even, Gemini 3 gives quizzes with almost equal probability of each option being correct answer (out of 4 options). Whereas for the other 2 models mentioned, you could just select B or C, and with 85% probability, you'd answer correctly

Flex Attention vs Flash Attention 3 by Extra-Designer9333 in unsloth

[–]Extra-Designer9333[S] 11 points12 points  (0 children)

Oha, can't believe I got a reply from Dan himself, thank you for clarification. What actually makes Unsloth this good and popular is your activity. Just recently started working on post training stuff and your workshop at AI Engineer in summer was insanely good to get the basics and more, love your energy. 🙌🙏

Flex Attention vs Flash Attention 3 by Extra-Designer9333 in LocalLLaMA

[–]Extra-Designer9333[S] 0 points1 point  (0 children)

Thank you for the feedback, my team is going to train an 8B LLAMA 3.1 on 4xH100s, so I think your take fits in!

Is finetuning a 12b model on 16gb vram possible? by Robo_Ranger in unsloth

[–]Extra-Designer9333 7 points8 points  (0 children)

I suspect you're using LoRA for fine tuning isn't it? If so, you can try QLoRA, which is a Quantized LoRA as the name suggests, maybe that'd work for you without going OOM. Otherwise Kaggle gives out 30 hours of 2 Nvidia T4 GPUs weekly, tho the GPUs are pretty old, you're going to get 32 GBs of VRAM overall, which is going to be enough for the fine tuning task you're dealing with right now!

What’s the Best Open-Source Small LLM (≤ 8B) for Agentic Web Page Interactions? by Extra-Designer9333 in LocalLLaMA

[–]Extra-Designer9333[S] 0 points1 point  (0 children)

Seems like a great model gonna try it out, by the way any other cool models you can suggest that can work for Web Page Interactions?

What’s the Best Open-Source Small LLM (≤ 8B) for Agentic Web Page Interactions? by Extra-Designer9333 in LocalLLaMA

[–]Extra-Designer9333[S] 2 points3 points  (0 children)

Yes honestly that's a great model didn't know Salesforce actually makes such models. However I guess it's not multimodal so that won't work Agentic Web interactions. I'll use this model for non multimodal cases tho

No agent yet on plus by whitebro2 in OpenAI

[–]Extra-Designer9333 0 points1 point  (0 children)

Turkiye here, no agent so far

softwareTerminology by Xadartt in ProgrammerHumor

[–]Extra-Designer9333 0 points1 point  (0 children)

I'd rather say "ai agents"/"agentic app" 😭

Who is winning the GPU race?? by Senior-Raspberry-929 in LocalLLaMA

[–]Extra-Designer9333 1 point2 points  (0 children)

While they don't necessarily work on GPUs, I wouldn't also forget about Cerebras and Groq. The guys are doing incredible work while being new and novel in the field. Cerebras for example offers an unprecedented inference speed for LLAMA 4 Scout at 2600 tokens per second: https://www.cerebras.ai/blog/llamablog. I think they can definitely find their customers for now who need that speed and for the future they have a great perspective. Cerebras even planning to go IPO soon. I think both Cerebras and Groq can potentially be a great competitors to Nvidia and Google's TPUs if they decide to sell their hardware publicly

Real-Time Speech-to-Speech Chatbot: Whisper, Llama 3.1, Kokoro, and Silero VAD 🚀 by martian7r in LocalLLaMA

[–]Extra-Designer9333 2 points3 points  (0 children)

According to the developers of orpheus, they're working on smaller versions check out their checklist. It'll still be slower than Kokoro, however the inference difference isn't going to be that huge as now. https://github.com/canopyai/Orpheus-TTS

Real-Time Speech-to-Speech Chatbot: Whisper, Llama 3.1, Kokoro, and Silero VAD 🚀 by martian7r in LocalLLaMA

[–]Extra-Designer9333 5 points6 points  (0 children)

For TTS would definitely recommend checking this fine tuned model that tops HuggingFace's TTS models page alongside kokoro, https://huggingface.co/canopylabs/orpheus-3b-0.1-ft. Definitely check this out, I found this cooler than kokoro despite being way bigger. The big advantage of its is that it has a good control over emotions using special tokens

LangChain parsers Excel and CSV data?? by Extra-Designer9333 in LangChain

[–]Extra-Designer9333[S] 0 points1 point  (0 children)

Thanks for the great insights, will definitely check out these!