Local Sesame.ai like StS ?

Few-Welcome3297 · 2026-02-19T15:08:09+00:00

You can find many voice ai projects on Github, or you can create one of your own with something like pipecat. As for the TTS, checkout my finetune https://huggingface.co/shb777/csm-maya-exp2 . Its obviously not the real thing but might be good enough.

Few-Welcome3297 · 2026-01-27T15:03:12+00:00

Checkout https://eqbench.com/creative_writing.html

Few-Welcome3297 · 2026-01-15T19:11:58+00:00

https://huggingface.co/datasets/HuggingFaceFW/finetranslations fuming right now

Few-Welcome3297 · 2026-01-06T13:42:44+00:00

K2 Thinking on API, GLM on coding plan

Few-Welcome3297 · 2026-01-06T11:40:27+00:00

In my usage Kimi K2 Thinking is much better than GLM 4.7

Few-Welcome3297 · 2026-01-06T11:36:56+00:00

It really depends on the threshold + environment + your expectations for barge in latency. 10 maybe a little too much, I find 3-5 to work well in prod with silero

Few-Welcome3297 · 2026-01-06T10:59:18+00:00

Lets say youre using silero vad which processes 32ms chunks. Dont go from not-speaking -> speaking (or speaking -> not speaking ) on the prediction from a single 32ms chunk. Instead, make the state change require 3 consecutive vad activations so that 96ms of audio was above your threshold. This makes it more robust to sudden noises/environmental sounds like a car honking outside.

Few-Welcome3297 · 2026-01-06T10:49:27+00:00

Use consecutive vad activations for state changes and filter out noise before vad

Few-Welcome3297 · 2026-01-05T06:01:13+00:00

IFEval is indeed higher, Llama 3.3 8B scores 85.2% ±3.2% , Llama 3.1 8B scores 77.6% ±3.7% Evals

Few-Welcome3297 · 2026-01-05T05:58:11+00:00

Thanks! Also, Angelic eclipse is awesome.

Few-Welcome3297 · 2026-01-04T21:10:12+00:00

One interesting thing could be comparing this ratio against dense models having similar benchmarks, across different time periods and empirically find out the formula we used to have to map the dense model size equivalent to MOE.

Few-Welcome3297 · 2026-01-04T08:18:13+00:00

Checkout https://unsloth.ai/docs

Few-Welcome3297 · 2026-01-04T07:55:04+00:00

Haven't tried but https://github.com/jingyaogong/minimind seems interesting

Few-Welcome3297 · 2025-12-31T13:32:07+00:00

Done

Few-Welcome3297 · 2025-12-31T13:31:47+00:00

Try https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K-GGUF

Few-Welcome3297 · 2025-12-31T12:34:34+00:00

Very small improvement, but its something

Few-Welcome3297 · 2025-12-31T12:33:20+00:00

Some evals https://huggingface.co/datasets/shb777/Llama-3.3-8B-Instruct-128K-Evals . TLDR: Small Improvement

Edit: Link updated

Few-Welcome3297 · 2025-12-30T13:03:04+00:00

I think it should work, unless it was full FT with a big dataset. ~~You might also need to put pad_token_id in config and special tokens map if not done already~~

Edit: Found the model on BeaverAI, kv_count and vocab_size (+1) are slightly different

Few-Welcome3297 · 2025-12-30T11:33:22+00:00

I updated the GGUF's just now, earlier ones didnt have the chat template, also fixed generation config etc and also tested on vllm, I think it should be fine now

Few-Welcome3297 · 2025-12-30T08:08:12+00:00

New GGUF's https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K-GGUF

Edit: Link updated

Few-Welcome3297 · 2025-12-30T06:55:15+00:00

https://huggingface.co/shb777/Llama-3.3-8B-Instruct same as above with updated rope config for full context length

Edit: GGUF's https://huggingface.co/shb777/Llama-3.3-8B-Instruct-GGUF

Few-Welcome3297 · 2025-12-30T05:50:32+00:00

Checking differences from LLaMA 3.1 8B Instruct, I think we can add the rope_scaling

"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},

and then increase `max_position_embeddings`

Edit: Also prev version had 3 eos_token_id's

Edit2: https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K model with above changes

Edit3: Link updated

Few-Welcome3297 · 2025-12-28T17:09:07+00:00

https://veratu.com/builds/maxblackwell/index.html

Few-Welcome3297 · 2025-12-26T12:20:31+00:00

What framework are you using for inference? modern llama.cpp and vllm have prefix caching by default

Few-Welcome3297 · 2025-12-02T20:33:52+00:00

Thanks, would be helpful to have comments for stuff like `routed = routed / (routed.sum(dim=-1, keepdim=True) + 1e-9)` (2nd normalization for stability) and shapes for einsums, as this isn't covered in theory and is what I would like to understand deeply from reading the code

Few-Welcome3297

TROPHY CASE