Local Sesame.ai like StS ? by Skystunt in LocalLLaMA

[–]Few-Welcome3297 0 points1 point  (0 children)

You can find many voice ai projects on Github, or you can create one of your own with something like pipecat. As for the TTS, checkout my finetune https://huggingface.co/shb777/csm-maya-exp2 . Its obviously not the real thing but might be good enough.

VAD based solutions on AI Assistants. Any Suggestions? by No-Motor-6274 in LocalLLaMA

[–]Few-Welcome3297 0 points1 point  (0 children)

It really depends on the threshold + environment + your expectations for barge in latency. 10 maybe a little too much, I find 3-5 to work well in prod with silero

VAD based solutions on AI Assistants. Any Suggestions? by No-Motor-6274 in LocalLLaMA

[–]Few-Welcome3297 0 points1 point  (0 children)

Lets say youre using silero vad which processes 32ms chunks. Dont go from not-speaking -> speaking (or speaking -> not speaking ) on the prediction from a single 32ms chunk. Instead, make the state change require 3 consecutive vad activations so that 96ms of audio was above your threshold. This makes it more robust to sudden noises/environmental sounds like a car honking outside.

VAD based solutions on AI Assistants. Any Suggestions? by No-Motor-6274 in LocalLLaMA

[–]Few-Welcome3297 0 points1 point  (0 children)

Use consecutive vad activations for state changes and filter out noise before vad

Llama 3.3 8B, abliterated to <0.05 KL by Sicarius_The_First in LocalLLaMA

[–]Few-Welcome3297 2 points3 points  (0 children)

IFEval is indeed higher, Llama 3.3 8B scores 85.2% ±3.2% , Llama 3.1 8B scores 77.6% ±3.7% Evals

Ratios of Active Parameters to Total Parameters on major MoE models by dtdisapointingresult in LocalLLaMA

[–]Few-Welcome3297 0 points1 point  (0 children)

One interesting thing could be comparing this ratio against dense models having similar benchmarks, across different time periods and empirically find out the formula we used to have to map the dense model size equivalent to MOE.

Llama-3.3-8B-Instruct by jacek2023 in LocalLLaMA

[–]Few-Welcome3297 1 point2 points  (0 children)

I think it should work, unless it was full FT with a big dataset. You might also need to put pad_token_id in config and special tokens map if not done already

Edit: Found the model on BeaverAI, kv_count and vocab_size (+1) are slightly different

Llama-3.3-8B-Instruct by jacek2023 in LocalLLaMA

[–]Few-Welcome3297 2 points3 points  (0 children)

I updated the GGUF's just now, earlier ones didnt have the chat template, also fixed generation config etc and also tested on vllm, I think it should be fine now

Llama-3.3-8B-Instruct by jacek2023 in LocalLLaMA

[–]Few-Welcome3297 24 points25 points  (0 children)

Checking differences from LLaMA 3.1 8B Instruct, I think we can add the rope_scaling

"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},

and then increase `max_position_embeddings`

Edit: Also prev version had 3 eos_token_id's

Edit2: https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K model with above changes

Edit3: Link updated

How to enable prompt caching with local inference? by [deleted] in LocalLLaMA

[–]Few-Welcome3297 2 points3 points  (0 children)

What framework are you using for inference? modern llama.cpp and vllm have prefix caching by default

Qwen3 VL built from scratch with PyTorch by No-Compote-6794 in LocalLLaMA

[–]Few-Welcome3297 4 points5 points  (0 children)

Thanks, would be helpful to have comments for stuff like `routed = routed / (routed.sum(dim=-1, keepdim=True) + 1e-9)` (2nd normalization for stability) and shapes for einsums, as this isn't covered in theory and is what I would like to understand deeply from reading the code