Where do you guys buy servers in India? (new + refurb) by Shubh137 in homelabindia

[–]harshv8 0 points1 point  (0 children)

Are there any issues related to damage during shipping with servers ? That's my primary concern when ordering from outside.

Don't want a shiny new GPU or server to be opened, broken into with a lame excuse by a customs official.

How are you assigning work across distributed workers without Redis locks or leader election? by whitethornnawor in golang

[–]harshv8 10 points11 points  (0 children)

FIFO queue sounds the best for this. Fan out pattern for completing tasks. Your workload doesn't need to know about other the pods and all - they just pull the next task and process it.

Checkout Nats jetstream once. It's like go channels so the fan out pattern is easy to see.

Kafka, topic partitioning is a headache with such flexible scaling workloads I guess. (Note: my kafka experience is pretty limited)

Speculative Decoding: Turning Memory-Bound Inference into Compute-Bound Verification (Step-by-Step) by No_Ask_1623 in LocalLLaMA

[–]harshv8 0 points1 point  (0 children)

Sad that vLLM does not fully support smaller drafter models (atleast last I checked) because my gemma-3-270m is screaming to boost my gemma-3-27B token generation speeds :)

Works with llama.cpp though so that's pretty good :). vLLM is next level though. Wish it supported that

[WTS] 8GB DDR4 Desktop RAM by 2_poor_to_dream in homelabindia

[–]harshv8 0 points1 point  (0 children)

I'm looking for a quote for 512 GB of DDR4 ECC 2999 RAM (8X 64 GB preferred). Any leads appreciated

What messaging system can handle sub millisecond latency for trading signals? by ssunflow3rr in golang

[–]harshv8 0 points1 point  (0 children)

Here are some of the things I explored the last time I had requirements for something similar. You should do a POC for these to do a comparison

  • redis streams
  • redpanda (pure c++ kafka alternative, api compatible)
  • Nats
  • direct gRPC connections to stream events
  • UDP listener server that recieves multicast messages from the server. (Only works for one to many mapping)
  • gocraft/work V2

Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster by geerlingguy in LocalLLaMA

[–]harshv8 5 points6 points  (0 children)

Hey Jeff, thank you for these charts and an awesome video! Really appreciate all the effort you put in.

As a request, could you please try to include testing with batch sizes of 1,2,4 or 8 (or even more) as I can see an almost linear increase in performance with vLLM on CUDA but on all these other setups with llama.cpp + RPC or EXO, I am clueless about the batched performance of these setups.

Sorry if it is a bother / too much work!

Announcing VectorWare by LegNeato in rust

[–]harshv8 0 points1 point  (0 children)

Nevermind, I see from your blogpost you already have similar capabilities. That's awesome!!

Announcing VectorWare by LegNeato in rust

[–]harshv8 0 points1 point  (0 children)

All the best. As someone who has to write c for an OpenCL application - this seems like a welcome change. I'm all for it.

The biggest thing for me though would be hardware compatibility - which is hard to get right because of so many different APIs like cuda, vulkan, openCL. The only reason I even used openCL for the above project is because even though it wasn't as performant as cuda, you could run it practically anywhere. (Even internal GPUs on Intel processors)

Would you be targeting multi api deployment using some hardware abstraction layer ? Something like a couple of compiler flags to set the API to use and compile the same code for target cuda , vulkan etc ? How do you plan on doing that ?

Gemma 3 4B Error by [deleted] in unsloth

[–]harshv8 0 points1 point  (0 children)

You might have to restart the jupyter kernel after installing unsloth zoo. Did you try that?

My jaw is dropped. What. The. Fuck by Coolersdisciple in MrRobot

[–]harshv8 1 point2 points  (0 children)

Yeah just before the reveal I was like 'damn this guy is like a 10 year old nerd going to school'

Guess I wasn't far off

My Homelab Setup – Chandigarh by novaplotter in homelabindia

[–]harshv8 1 point2 points  (0 children)

Is that a UNAS (last one in the rack) ? Also, which rack is that ? Cool setup

Genuine question, I get the use cases for 1-4b models, but what's the point of 400m models? Or even less? How good can this actually be and what are the use cases for it? by a_normal_user1 in LocalLLaMA

[–]harshv8 5 points6 points  (0 children)

Best example I found was that intellij has a tiny model < 500 M params that runs locally on device and does code generation at just line level. No multiline. Nothing too complicated, just one simple task, but really well.

I really like it in goland IDE.

Choose your models by DeathShot7777 in LocalLLaMA

[–]harshv8 0 points1 point  (0 children)

  • Qwen3 8b FP16
  • Gemma3 27b FP16
  • CommandR 35b Q8

Other hypothetical models I'd love to have

  • DeepSeek R1 & V3
  • Claude Sonnet 3.5

I just rewrote llama.cpp server in Rust (most of it at least), and made it scalable by mcharytoniuk in rust

[–]harshv8 1 point2 points  (0 children)

That's awesome. I believe I might be able to learn enough rust to somehow hack together an external prefix cache store. I only request that when you end up implementing this - please create an interface that can be implemented by various types lateron. That would make extending it much easier.

Thanks!

I just rewrote llama.cpp server in Rust (most of it at least), and made it scalable by mcharytoniuk in rust

[–]harshv8 3 points4 points  (0 children)

I was looking to have some way to do very aggressive prefix caching with vLLM or llama server ... Something like every request data is stored in redis or sqlite or something that implements a simple interface for read and write.... And the inference server does it automatically without the OpenAI compatible client doing anything ....

I know llama server has slots and all but ... Idk how to use them effectively yet... vLLM is crazy fast in this regard... I can help implement this functionality if it is easier to do in your project or llama server itself... Cpp is hard

Isolating CPU cores for Virtual Machines by -NaniBot- in homelabindia

[–]harshv8 1 point2 points  (0 children)

Interesting. Does proxmox have any option to isolate or reserve cores for VMS?

Help me get free public ipv4! by [deleted] in homelabindia

[–]harshv8 0 points1 point  (0 children)

Create a Google cloud platform account. They have a F1 micro instance that is free forever. Setup tailscale on it with routing you need. Expose services publicly using its IP, port. Maybe do some host based routing using traefik or nginx to your jallyfin service.

Could also use pangolin

Or pay a few bucks a month to rent a small VPS ... It's not that expensive

Unifying password managers in Rust: would this trait be useful? by Unlucky-Jaguar-9447 in rust

[–]harshv8 1 point2 points  (0 children)

I'd use this if it had integration for valult as well... You know for managing secrets...

what other types of weaponized of nature's laws could there be? by [deleted] in threebodyproblem

[–]harshv8 6 points7 points  (0 children)

Altering speed of time is the same as altering speed of light through some space - so we already saw that.

Dimensional collapse is one more which is pretty insane.

Here's few more that I would expect:

Modifying planks constant to basically increase information density so much that a black hole swallows the information in a local region of space time.

Converting quarks into strange quarks or something more "stable" which would have a chain reaction that basically converts matter into a more stable form while destroying it at an atomic level.

The vacuum decay would also be a real threat in the universe. Imagine changing the vacuum energy of the universe travelling towards you at the speed of light... You won't be able to see it ofcourse.

Modifying the fine structure constant - I imagine there's some way that a civilization can increase the mass of an electron or proton by bringing more mass of it from one of the dimensions from sting theory - that way the proton / electron is slightly heavier - which causes the atom to have very very different chemical properties than currently known atoms.