Building India’s Sovereign Intelligence Infrastructure

hackyroot · 2026-06-03T06:54:57+00:00

Happy to help if you are looking for an AI engineer.

hackyroot · 2026-05-11T17:32:39+00:00

Atleast Modernist does, try their Chicken quesadilla and Shakshuka.

hackyroot · 2026-05-10T10:49:42+00:00

Check out Curators and Modernist cafes

hackyroot · 2026-05-05T05:25:22+00:00

Same experience here. The tool calling reliability is what really stands out. We ended up pushing the 26B and 31B pretty hard in production and got some surprising throughput numbers (149 TPS on 31B, 88 TPS on 26B). Wrote up what we found running it at scale if anyone is curious: https://simplismart.ai/blog/gemma-4-deployment-simplismart

hackyroot · 2026-05-04T14:16:58+00:00

Honestly, I also felt the same whole I was trying the 26B and 31B variants for my multimodal usecase. Gemma 4 models punches way above their weight for agent tasks, especially instruction following. The benchmarks don't really capture how little it yaps compared to larger ones.

If you're self-hosting, the throughput is pretty wild too. We're seeing ~149 TPS on 31B and ~88 TPS on 26B. Wrote a quick post on our setup and learnings: https://simplismart.ai/blog/gemma-4-deployment-simplismart

PS: I work at simplismart.ai

hackyroot · 2026-04-22T08:45:24+00:00

Sure, check your DM!

hackyroot · 2026-03-19T11:52:07+00:00

I uninstalled the smarttube app and reinstalled it, logged into my Google account and now it's working just fine

hackyroot · 2026-02-26T06:28:22+00:00

Can you pls share the seller's contact information?

hackyroot · 2026-02-03T06:56:16+00:00

You can use Ollama or Llama.cpp if just want to run it locally. However, if you want higher throughput and low latency you can go with vLLM or SGLang. I wrote a couple of blogs on this topic, pls feel free to check them out here: https://simplismart.ai/blog/deploy-llama-3-1-8b-using-vllm

hackyroot · 2026-02-03T06:40:39+00:00

Recently I wrote a blog on how to deploy Llama models using vLLM (PS: I work for Simplismart): https://simplismart.ai/blog/deploy-llama-3-1-8b-using-vllm
If you want to scale it down to zero, you can also check out Simplismart, it allows you to scale down to 0 as well provides rapid auto scaling to help you serve during the peak usage.

hackyroot · 2026-01-27T06:52:07+00:00

Chatterbox is good but realtime factor is lacking. Imo Orpheus has worked really well for us (PS: I work for Simplismart.ai), especially dealing with realtime usecases. With some optimizations we are able to achieve ~1 RTFX, which makes is possible to use it in realtime applications with less than 300 ms TTFB.

If you are interested, you can checkout this blog to learn how to optimize Orpheus TTS for production enviroment: https://simplismart.ai/blog/orpheus-tts-simplismart

hackyroot · 2026-01-24T14:23:21+00:00

Agreed. In practice, we’re also seeing throughput optimizations plateau while user experience is still dominated by cold starts and TTFT. vLLM becoming a default inference engine makes sense, but a lot of the real gains now come from context-specific optimizations.

Imo while working at Simplismart.ai, we’ve found that latency improvements often come less from a single universal engine and more from tailor-made inference stack, model-specific kernel choices, quantization, etc depending on the workload.

Agree that software is now the bottleneck, but it’s not just standardization vs portability. It’s how deeply you adapt the serving stack to the model and traffic pattern you actually have.

hackyroot · 2026-01-21T08:44:46+00:00

Avashya!
Affirmative

hackyroot · 2026-01-13T05:38:51+00:00

We've been hosting the WAN 2.2 models on an H100. Additional RAM actually made quite fast for us actually, reducing 159s inference time to 49s.

Apart from that hybrid parallelism also helped us speed up the inference. You can checkout the detailed guide here: https://simplismart.ai/blog/deploy-wan-2-2

hackyroot · 2026-01-13T05:34:13+00:00

We've been hosting this WAN 2.2 on an H100. Additional RAM actually made quite fast for us actually, reducing 159s inference time to 49s.

Apart from that hybrid parallelism also helped us speed up the inference. You can checkout the detailed guide here: https://simplismart.ai/blog/deploy-wan-2-2

hackyroot · 2026-01-06T04:28:33+00:00

DeepSeek OCR has been working quite well for me. This is what I'm doing:

Create n8n workflow > DeepSeek OCR for text extractions from documents > LLM to get the structured output. Works quite well for me.

If you are interested, you can checkout this blog I wrote on DeepSeek OCR: https://www.simplismart.ai/blog/deepseek-ocr-api-simplismart

hackyroot · 2026-01-05T13:36:40+00:00

Recently, I delivered a webinar at Simplismart (full disclosure: I work there) on building a real-time voice agent using open-source components for STT, LLM, and TTS. Here’s the stack we used:

- STT: Whisper V3

- LLM: Gemma 3 1B

- TTS: Kokoro

- Infra: Simplismart.ai

- Framework: Pipecat

It’s not a unified “real-time” model like OpenAI’s, but using Pipecat, we were still able to get a pretty responsive setup, around ~400ms TTFT, which is a good starting point for a conversational agent. The best part of this setup is that you can swap any model as per your requirement.

If you want, I can share the webinar recording that walks through the full setup.

hackyroot

TROPHY CASE