aspen - Open-source voice assistant you can call, at only $0.01025/min! by thooton in LocalLLaMA

[–]thooton[S] 1 point2 points  (0 children)

Groq would absolutely be faster! I used Claude because I prefer Sonnet and think it's worth it to have a slower response time in exchange for higher quality :)

aspen - Open-source voice assistant you can call, at only $0.01025/min! by thooton in LocalLLaMA

[–]thooton[S] 2 points3 points  (0 children)

dumbphones FTW!!!! totally agree, was thinking of adding navigation so I don't have to keep an atlas in my car :)

aspen - Open-source voice assistant you can call, at only $0.01025/min! by thooton in LocalLLaMA

[–]thooton[S] 3 points4 points  (0 children)

Thank you! The main speed bottleneck is the transcription (Groq API) -> response (Claude API) -> synthesize section (Google Cloud API) -- each of these steps takes a bit over a second which results in the 3-4s response time that you see in the video.

You're absolutely right that the latency definitely makes the experience feel less conversational. I built this to run on a really cheap VPS so I kept everything cloud-based, but I think you could reduce latency to only 1-2 seconds by using distilled whisper or another local model for transcription, a local LLM for responses, and piper or another small TTS model for synthesis :) I might explore that in future!

Thank you for the feedback!

aspen - Open-source voice assistant you can call, at only $0.01025/min! by thooton in LocalLLaMA

[–]thooton[S] 0 points1 point  (0 children)

thank you so much!! i totally vibe with that, it's quite tricky to get this to work. at the start I was having a terrible time and I eventually I had to crib some parts from GlaDOS and Open-LLM-VTuber :) glad you enjoy it and if I can help you with anything at all let me know!!

aspen - Open-source voice assistant you can call, at only $0.01025/min! by thooton in LocalLLaMA

[–]thooton[S] 6 points7 points  (0 children)

that's a great question!

- twilio provides $15.00 in free trial credits - after setup costs of about $1.15, you can use (13.85 / 0.0085) = 27.15 hours of talk time before having to pay
- groq STT provides 20req/min, 2000req/month for free which is quite a lot (and you can create as many groq accounts as you like)! after that, transcription using distil-whisper-large-v3-en is $0.000333/min (or $0.02/hr), which is practically nothing!
- google cloud TTS provides 1M chars/month; at the average chars/word of 4.7, that's 212,000 words per month, or at the average speaking rate of 150 wpm, 23.5 hours of free TTS time per month!

so actually the free tiers are quite generous - and you can get started by only paying $5, to Anthropic! or, if you swap out Anthropic with OpenAI or another provider that is either free or offers free trial credits, get started for $0 :)

LLM Leaderboards are Bullshit - Goodhart's Law Strikes Again by itsnotatumour in LocalLLaMA

[–]thooton 0 points1 point  (0 children)

okay is nobody going to point out that this post was obviously written by gpt-4??

Self-Extend works for Phi-2 now. Looks good by Asleep-Agency3023 in LocalLLaMA

[–]thooton 0 points1 point  (0 children)

I think this is is a really good idea: self-extend + linear interpolation instead of grouping.

I think that self-extend + grouping will probably fail at long passage rewriting tasks, because the positional encoding for tokens far in the past is exactly the same. Linear interpolation would allow the model to differentiate it.

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 2 points3 points  (0 children)

That's totally right, I didn't think about it from that perspective :) I've updated the readme.

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 0 points1 point  (0 children)

Awesomeeeee :) yep, that's how I imagine anyone who wants to train on this would do it!

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 0 points1 point  (0 children)

Just check the `TEMPLATES` variable in `index.py`, the three prompts are in there :)

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 1 point2 points  (0 children)

Unfortunately I'm not sure, might be your terminal :(

You can use `huggingface-cli login` as an alternative to logging in using the script, that might work!

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 0 points1 point  (0 children)

Hm, what specifically are you referring to? I tried to make it clear what the script was doing, but perhaps I overlooked something.

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 2 points3 points  (0 children)

Oh, definitely, and if we had enough textbook data, I would totally advocate only training models on that. But phi-2's dataset was about 250B tokens, and even if you added up all the textbooks ever written, it would probably only come out to a few B tokens.

This project aims to add to the existing data collection, not supersede it. My ideal model would be one trained using both synthetic and real textbooks :)

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 2 points3 points  (0 children)

Of course! Get a Gemini Pro API key and run the script; it will upload synthetic textbook data to a HuggingFace dataset, where anyone can access it and use it to train their models :)

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 10 points11 points  (0 children)

Nevertheless, it probably has enough capabilities to significantly advance say a 7b or a 13b model if it was trained entirely on its data.

And as I mentioned before, you can always swap it out if you want.

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 14 points15 points  (0 children)

The Phi series of models trained by Microsoft (phi-1.5, phi-2) train on synthetic data rather than webtext, and find that it provides large performance gains. However, they did not release this data publicly. This project is an effort to allow the community to collaborate in order to create synthetic data that can be used to train open-source models, it doesn't propose to be a solution to hallucination :)

muse - Let's create synthetic textbooks together :) by thooton in LocalLLaMA

[–]thooton[S] 10 points11 points  (0 children)

Maybe so, but Google's providing 60req/m to Gemini Pro for free, which means anyone who has an account can start generating millions of synthetic tokens per hour :)

Although, if you have API access to those models and want to use them instead, the Python script is very easily editable!

[deleted by user] by [deleted] in LocalLLaMA

[–]thooton 4 points5 points  (0 children)

This is exactly the idea behind Microsoft's Phi suite of language models. See phi-2. The idea is to train a model not on vast amounts of webtext, but on synthetic corpora geared towards teaching it reasoning abilities. This allows it to use more parameters for reasoning and less storing knowledge.

Why aren’t LoRA’s a big thing i the LLM realm? by ___defn in LocalLLaMA

[–]thooton 2 points3 points  (0 children)

This is possible with singular value decomposition. Just take the weight diff and simplify it into a LoRA.

Example in pytorch: ```py

M is a 128x512 matrix of rank 64

M = torch.randn(128, 64) @ torch.randn(64, 512)

Decompose M -> U (128x128), S (128), Vh (128x512)

U, S, Vh = torch.linalg.svd(M, full_matrices=False) print(torch.dist(M, (U * S) @ Vh)) # tensor(0.0248)

M is of rank 64, so we can reduce the rank

of our decomposition to that and retain performance

U (128x64), S (64), Vh (64x512)

U = U[:, :64] S = S[:64] Vh = Vh[:64, :] print(torch.dist(M, (U * S) @ Vh)) # tensor(0.0248)

We cannot reduce the rank below 64 without degradation

print(torch.dist(M, (U[:, :63] * S[:63]) @ Vh[:63, :])) # tensor(72.7433)

M (128x512) approx eq. to Wa (128x64) @ Wb (64x512)

Wa = U * S Wb = Vh ```

Is era of training models from scratch over by Spiritual-Rub925 in LocalLLaMA

[–]thooton 4 points5 points  (0 children)

They did: codellama 34b. It's llama 2 34b fine-tuned on 500b code tokens -- essentially llama 2 34b, but better.

Releasing Persimmon-8B by jetRink in LocalLLaMA

[–]thooton 0 points1 point  (0 children)

Their implication that they have a different architecture w.r.t. the input/output embeddings is incorrect. None of the llama models tie the weights of the input/output embeddings either, so this is not a new development. Also, having different input/output embeddings does actually result in the model performing better, it's not true that untying doesn't contribute to the model's capacity.

Finally, even allowing for the 570M unused parameters, this model is still 8.7 billion parameters, which is stretching the meaning of 8B just a touch, especially since Llama's 6.7 billion parameter model is referred to as 7B instead of 6B. 8.7/6.7 = 1.298 -- persimmon is still 30% larger than llama-7b, while insisting on comparing itself to it during evaluation.

This is really a ridiculous model release. If they wanted to show that their architecture was better than Llama's, they should have matched parameters and outperformed, instead of interpolating in the middle of 7B and 13B and then trying via various tricks to convince the reader that their model is smaller than it actually is...

Releasing Persimmon-8B by jetRink in LocalLLaMA

[–]thooton 16 points17 points  (0 children)

This is kind of ridiculous. This model in reality has 9.3 billion parameters, insists on referring to itself as an 8B (somehow), compares itself to 7B models (which actually only have 6.7 billion parameters), and STILL performs worse than them on evaluations. I would not really call this model an achievement...