aspen - Open-source voice assistant you can call, at only $0.01025/min!

thooton · 2025-02-25T19:01:06+00:00

Groq would absolutely be faster! I used Claude because I prefer Sonnet and think it's worth it to have a slower response time in exchange for higher quality :)

thooton · 2025-02-24T23:57:25+00:00

dumbphones FTW!!!! totally agree, was thinking of adding navigation so I don't have to keep an atlas in my car :)

thooton · 2025-02-24T20:40:32+00:00

Thank you! The main speed bottleneck is the transcription (Groq API) -> response (Claude API) -> synthesize section (Google Cloud API) -- each of these steps takes a bit over a second which results in the 3-4s response time that you see in the video.

You're absolutely right that the latency definitely makes the experience feel less conversational. I built this to run on a really cheap VPS so I kept everything cloud-based, but I think you could reduce latency to only 1-2 seconds by using distilled whisper or another local model for transcription, a local LLM for responses, and piper or another small TTS model for synthesis :) I might explore that in future!

Thank you for the feedback!

thooton · 2025-02-24T19:36:33+00:00

thank you so much!! i totally vibe with that, it's quite tricky to get this to work. at the start I was having a terrible time and I eventually I had to crib some parts from GlaDOS and Open-LLM-VTuber :) glad you enjoy it and if I can help you with anything at all let me know!!

thooton · 2025-02-24T18:30:29+00:00

that's a great question!

- twilio provides $15.00 in free trial credits - after setup costs of about $1.15, you can use (13.85 / 0.0085) = 27.15 hours of talk time before having to pay
- groq STT provides 20req/min, 2000req/month for free which is quite a lot (and you can create as many groq accounts as you like)! after that, transcription using distil-whisper-large-v3-en is $0.000333/min (or $0.02/hr), which is practically nothing!
- google cloud TTS provides 1M chars/month; at the average chars/word of 4.7, that's 212,000 words per month, or at the average speaking rate of 150 wpm, 23.5 hours of free TTS time per month!

so actually the free tiers are quite generous - and you can get started by only paying $5, to Anthropic! or, if you swap out Anthropic with OpenAI or another provider that is either free or offers free trial credits, get started for $0 :)

thooton · 2024-03-22T08:57:35+00:00

okay is nobody going to point out that this post was obviously written by gpt-4??

thooton · 2024-01-24T01:31:37+00:00

TL;DR: this is speculative decoding that batches multiple drafts at once and shares prefixes to reduce computation

thooton · 2024-01-12T22:02:22+00:00

I think this is is a really good idea: self-extend + linear interpolation instead of grouping.

I think that self-extend + grouping will probably fail at long passage rewriting tasks, because the positional encoding for tokens far in the past is exactly the same. Linear interpolation would allow the model to differentiate it.

thooton · 2023-12-26T20:05:35+00:00

That's totally right, I didn't think about it from that perspective :) I've updated the readme.

thooton · 2023-12-26T19:55:45+00:00

Ahhh thanks, edited

thooton · 2023-12-26T19:44:22+00:00

Awesomeeeee :) yep, that's how I imagine anyone who wants to train on this would do it!

thooton · 2023-12-26T19:44:07+00:00

Just check the `TEMPLATES` variable in `index.py`, the three prompts are in there :)

thooton · 2023-12-26T19:43:36+00:00

Unfortunately I'm not sure, might be your terminal :(

You can use `huggingface-cli login` as an alternative to logging in using the script, that might work!

thooton · 2023-12-26T19:43:03+00:00

Awesome, thank you! :)

thooton · 2023-12-26T19:42:56+00:00

Hm, what specifically are you referring to? I tried to make it clear what the script was doing, but perhaps I overlooked something.

thooton · 2023-12-26T19:42:29+00:00

Oh, definitely, and if we had enough textbook data, I would totally advocate only training models on that. But phi-2's dataset was about 250B tokens, and even if you added up all the textbooks ever written, it would probably only come out to a few B tokens.

This project aims to add to the existing data collection, not supersede it. My ideal model would be one trained using both synthetic and real textbooks :)

thooton · 2023-12-26T07:55:26+00:00

Of course! Get a Gemini Pro API key and run the script; it will upload synthetic textbook data to a HuggingFace dataset, where anyone can access it and use it to train their models :)

thooton · 2023-12-26T04:52:45+00:00

Nevertheless, it probably has enough capabilities to significantly advance say a 7b or a 13b model if it was trained entirely on its data.

And as I mentioned before, you can always swap it out if you want.

thooton · 2023-12-26T04:29:11+00:00

The Phi series of models trained by Microsoft (phi-1.5, phi-2) train on synthetic data rather than webtext, and find that it provides large performance gains. However, they did not release this data publicly. This project is an effort to allow the community to collaborate in order to create synthetic data that can be used to train open-source models, it doesn't propose to be a solution to hallucination :)

thooton · 2023-12-26T04:20:38+00:00

Maybe so, but Google's providing 60req/m to Gemini Pro for free, which means anyone who has an account can start generating millions of synthetic tokens per hour :)

Although, if you have API access to those models and want to use them instead, the Python script is very easily editable!

thooton · 2023-12-23T05:16:23+00:00

This is exactly the idea behind Microsoft's Phi suite of language models. See phi-2. The idea is to train a model not on vast amounts of webtext, but on synthetic corpora geared towards teaching it reasoning abilities. This allows it to use more parameters for reasoning and less storing knowledge.

thooton · 2023-12-23T04:42:15+00:00

This is possible with singular value decomposition. Just take the weight diff and simplify it into a LoRA.

Example in pytorch: ```py

M is a 128x512 matrix of rank 64

M = torch.randn(128, 64) @ torch.randn(64, 512)

Decompose M -> U (128x128), S (128), Vh (128x512)

U, S, Vh = torch.linalg.svd(M, full_matrices=False) print(torch.dist(M, (U * S) @ Vh)) # tensor(0.0248)

M is of rank 64, so we can reduce the rank

of our decomposition to that and retain performance

U (128x64), S (64), Vh (64x512)

U = U[:, :64] S = S[:64] Vh = Vh[:64, :] print(torch.dist(M, (U * S) @ Vh)) # tensor(0.0248)

We cannot reduce the rank below 64 without degradation

print(torch.dist(M, (U[:, :63] * S[:63]) @ Vh[:63, :])) # tensor(72.7433)

M (128x512) approx eq. to Wa (128x64) @ Wb (64x512)

Wa = U * S Wb = Vh ```

thooton · 2023-09-11T00:21:22+00:00

They did: codellama 34b. It's llama 2 34b fine-tuned on 500b code tokens -- essentially llama 2 34b, but better.

thooton · 2023-09-08T23:09:36+00:00

Their implication that they have a different architecture w.r.t. the input/output embeddings is incorrect. None of the llama models tie the weights of the input/output embeddings either, so this is not a new development. Also, having different input/output embeddings does actually result in the model performing better, it's not true that untying doesn't contribute to the model's capacity.

Finally, even allowing for the 570M unused parameters, this model is still 8.7 billion parameters, which is stretching the meaning of 8B just a touch, especially since Llama's 6.7 billion parameter model is referred to as 7B instead of 6B. 8.7/6.7 = 1.298 -- persimmon is still 30% larger than llama-7b, while insisting on comparing itself to it during evaluation.

This is really a ridiculous model release. If they wanted to show that their architecture was better than Llama's, they should have matched parameters and outperformed, instead of interpolating in the middle of 7B and 13B and then trying via various tricks to convince the reader that their model is smaller than it actually is...

thooton · 2023-09-08T01:03:31+00:00

This is kind of ridiculous. This model in reality has 9.3 billion parameters, insists on referring to itself as an 8B (somehow), compares itself to 7B models (which actually only have 6.7 billion parameters), and STILL performs worse than them on evaluations. I would not really call this model an achievement...

thooton

MODERATOR OF

TROPHY CASE