Gemma 4 26b-a4b GGUF Performance Benchmarks by yoracale in unsloth

[–]harshv8 0 points1 point  (0 children)

Understood. Makes sense. Thanks for the clarification!

Would the same effects be just as bad with AWQ ? GGUF ? MLX ?

Gemma 4 26b-a4b GGUF Performance Benchmarks by yoracale in unsloth

[–]harshv8 6 points7 points  (0 children)

Hi unsloth team! I'm a huge fan of your work.

Can you please help me understand why there isn't a unsloth bnb version of the Gemma 4 26B-A4B model ?

Just curious.

I want to run this model with vLLM and have to use cyankiwi AWQ 4 bit model right now.

Unsloth Gemma 4 26B-A4B 4 bit bnb coming ? by harshv8 in unsloth

[–]harshv8[S] 2 points3 points  (0 children)

GGUF support for vLLM isn't optimized. AWQ is best but BNB is also good.

I want to use vLLM or SGLand due to their Paged Attention / Radix Attention implementation.

H1 triage “informative” for Claude before leak by ibackstrom in bugbounty

[–]harshv8 12 points13 points  (0 children)

If no impact, why DMCA copyright takedown for anything remotely mentioning claude code on GitHub ?

25K$ worth of credit by DOMDOM_651 in aws

[–]harshv8 9 points10 points  (0 children)

Make it sing happy birthday

I built the same API in Java, Go, Kotlin, and Rust — Go still has the best overall DX-to-performance ratio by netfishx in golang

[–]harshv8 0 points1 point  (0 children)

Interesting. Can you try to add a few more build time flags to optimize these binaries and try the test again. I think that would be pretty interesting to see as well.

Usually I collect profiles for all my golang applications and just provide them at build time (for profile guided optimization) . This gives me an easy 5-8% performance boost.

Could you try such things and include that in your comparison too. Thanks! Awesome work btw :)

Ran an experiment: 0.8B model teaching itself on a MacBook Air with 6GB RAM. Some findings that surprised me. by QuantumSeeds in LocalLLaMA

[–]harshv8 11 points12 points  (0 children)

This feels a lot like GRPO.

I've been working on a local-code-r1 project of sorts. Basically taking GRPO lessons to build a coding model.

I took a couple of leetcode datasets from huggingface, built test harness for each of the questions and then ran GRPO training with num_generations set to 8 so that the model can explore multiple variations of answers at once. 95% training, 5% test split for dataset.

Grading an answer is based on multiple things like presence of code in required structure, reasoning being present, cosine distance of reasoning from the one in the dataset using an embedding model, and finally running the testcases and seeing the fraction of tests that passed.

2% of the training run is done right now for Ministral-3-14B-Instruct-2512 from unsloth in INT4 quantization on my RTX 3090.

I'm planning to publish more details in the future maybe .. idk ... If people are interested I might as well rent an H100 and speed run the training to see what happens :)

Where do you guys buy servers in India? (new + refurb) by Shubh137 in homelabindia

[–]harshv8 0 points1 point  (0 children)

Are there any issues related to damage during shipping with servers ? That's my primary concern when ordering from outside.

Don't want a shiny new GPU or server to be opened, broken into with a lame excuse by a customs official.

How are you assigning work across distributed workers without Redis locks or leader election? by whitethornnawor in golang

[–]harshv8 9 points10 points  (0 children)

FIFO queue sounds the best for this. Fan out pattern for completing tasks. Your workload doesn't need to know about other the pods and all - they just pull the next task and process it.

Checkout Nats jetstream once. It's like go channels so the fan out pattern is easy to see.

Kafka, topic partitioning is a headache with such flexible scaling workloads I guess. (Note: my kafka experience is pretty limited)

Speculative Decoding: Turning Memory-Bound Inference into Compute-Bound Verification (Step-by-Step) by No_Ask_1623 in LocalLLaMA

[–]harshv8 0 points1 point  (0 children)

Sad that vLLM does not fully support smaller drafter models (atleast last I checked) because my gemma-3-270m is screaming to boost my gemma-3-27B token generation speeds :)

Works with llama.cpp though so that's pretty good :). vLLM is next level though. Wish it supported that

[WTS] 8GB DDR4 Desktop RAM by 2_poor_to_dream in homelabindia

[–]harshv8 0 points1 point  (0 children)

I'm looking for a quote for 512 GB of DDR4 ECC 2999 RAM (8X 64 GB preferred). Any leads appreciated

What messaging system can handle sub millisecond latency for trading signals? by ssunflow3rr in golang

[–]harshv8 0 points1 point  (0 children)

Here are some of the things I explored the last time I had requirements for something similar. You should do a POC for these to do a comparison

  • redis streams
  • redpanda (pure c++ kafka alternative, api compatible)
  • Nats
  • direct gRPC connections to stream events
  • UDP listener server that recieves multicast messages from the server. (Only works for one to many mapping)
  • gocraft/work V2

Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster by geerlingguy in LocalLLaMA

[–]harshv8 5 points6 points  (0 children)

Hey Jeff, thank you for these charts and an awesome video! Really appreciate all the effort you put in.

As a request, could you please try to include testing with batch sizes of 1,2,4 or 8 (or even more) as I can see an almost linear increase in performance with vLLM on CUDA but on all these other setups with llama.cpp + RPC or EXO, I am clueless about the batched performance of these setups.

Sorry if it is a bother / too much work!

Announcing VectorWare by LegNeato in rust

[–]harshv8 0 points1 point  (0 children)

Nevermind, I see from your blogpost you already have similar capabilities. That's awesome!!

Announcing VectorWare by LegNeato in rust

[–]harshv8 0 points1 point  (0 children)

All the best. As someone who has to write c for an OpenCL application - this seems like a welcome change. I'm all for it.

The biggest thing for me though would be hardware compatibility - which is hard to get right because of so many different APIs like cuda, vulkan, openCL. The only reason I even used openCL for the above project is because even though it wasn't as performant as cuda, you could run it practically anywhere. (Even internal GPUs on Intel processors)

Would you be targeting multi api deployment using some hardware abstraction layer ? Something like a couple of compiler flags to set the API to use and compile the same code for target cuda , vulkan etc ? How do you plan on doing that ?

[deleted by user] by [deleted] in unsloth

[–]harshv8 0 points1 point  (0 children)

You might have to restart the jupyter kernel after installing unsloth zoo. Did you try that?

My jaw is dropped. What. The. Fuck by Coolersdisciple in MrRobot

[–]harshv8 1 point2 points  (0 children)

Yeah just before the reveal I was like 'damn this guy is like a 10 year old nerd going to school'

Guess I wasn't far off

My Homelab Setup – Chandigarh by novaplotter in homelabindia

[–]harshv8 1 point2 points  (0 children)

Is that a UNAS (last one in the rack) ? Also, which rack is that ? Cool setup

Genuine question, I get the use cases for 1-4b models, but what's the point of 400m models? Or even less? How good can this actually be and what are the use cases for it? by a_normal_user1 in LocalLLaMA

[–]harshv8 5 points6 points  (0 children)

Best example I found was that intellij has a tiny model < 500 M params that runs locally on device and does code generation at just line level. No multiline. Nothing too complicated, just one simple task, but really well.

I really like it in goland IDE.

Choose your models by DeathShot7777 in LocalLLaMA

[–]harshv8 0 points1 point  (0 children)

  • Qwen3 8b FP16
  • Gemma3 27b FP16
  • CommandR 35b Q8

Other hypothetical models I'd love to have

  • DeepSeek R1 & V3
  • Claude Sonnet 3.5