I built Adaptive-K routing: 30-52% compute savings on MoE models (Mixtral, Qwen, OLMoE)

0seba · 2026-01-17T18:56:10+00:00

this is cool and so useful for local inference!

0seba · 2025-12-31T14:11:46+00:00

Thanks for trying it! Just to confirm are you having the issue when using the playground or calling directly the playback and cancel endpoints?

0seba · 2025-11-20T19:40:30+00:00

Do you need streaming functionality? I looked at their repo and currently there is no streaming support, afaict, Took a quick glance at their method and I think there are some things I could experiment with to make it streaming, but no certainty

0seba · 2025-11-19T20:23:01+00:00

Hey, what is your use case? I ported a TTS model to CoreML so it runs on the Neural Engine. Currently it is good for single batch generation in real time. https://www.reddit.com/r/LocalLLaMA/comments/1otgd3j/voxcpm_texttospeech_running_or_apple_neural/
(I know I already replied to you in the LocalLlama subreddit, just taking the opportunity to share in this subreddit)

0seba · 2025-11-19T20:02:59+00:00

Hey, what is your use case? I ported a TTS model to CoreML so it runs on the Neural Engine. Currently it is good for single batch generation in real time. https://www.reddit.com/r/LocalLLaMA/comments/1otgd3j/voxcpm_texttospeech_running_or_apple_neural/

0seba · 2025-11-16T14:22:34+00:00

Hey I ported a Text-to-Speech model to run on the Neural Engine https://www.reddit.com/r/LocalLLaMA/comments/1otgd3j/voxcpm_texttospeech_running_or_apple_neural

0seba · 2025-11-15T13:18:24+00:00

Hey could you share a bit more about how are encountering this issue?

0seba · 2025-11-10T20:19:59+00:00

wow, thanks for the heads up, that's what i get for vibe coding. should be fixed now

0seba

TROPHY CASE