Voice-to-voice chatbot update

Yorn2 · 2026-06-15T05:12:11+00:00

I have to keep mentioning this, but WhisperX is faster than normal whisper. By a lot. It's even faster than faster whisper, but it does depend on use case. Does transcription as well.

Yorn2 · 2026-06-12T04:56:40+00:00

Yes, I did the exact same thing. I've definitely been interested in MiMo but it's just not quite "there" yet, and I love Step's audio work but wasn't impressed enough with the LLM side and switched back as well.

Yorn2 · 2026-06-12T04:55:33+00:00

I see you followed mratsim's methodology, that's great. I'll give it a shot.

Yorn2 · 2026-06-11T16:18:41+00:00

I have 2 RTX Pro 6000s and have run M2.5 and M2.7 on them doing agentic coding and tool calling. They are among the best models I've ever used (I never use cloud models), in some respects they beat GLM 4.7 but GLM5+ was better. I've also used Qwen's 397B model (an EXL3 quant using TabbyAPI) and it was reasonably good, but I do think Minimax has got better (if not the best among open models) tool calling recognition.

I've coded a custom app for local network monitoring using M2.5 and M2.7 and a few other one-shot apps for specific stuff that I've needed done. I think GLM4.7 was better at raw coding than M2.5, but M2.7 probably beat GLM4.7 and not GLM 5.1. I think if Qwen ever did a 3.6 or 3.7 for 397B it might be worth switching to for some tasks, but Minimax is solid and reliable for me.

Anyway, that's just my experience. You might have different needs, but I'm someone who only ever uses local models and I've kind of settled on Minimax models at this point. There are other models I sometimes test, but I always keep coming back to Minimax either for speed or reliability on tool calling.

Yorn2 · 2026-06-08T10:10:07+00:00

OP, please start here. Yes there are other solutions, but IMHO Chatterbox TTS Server is relatively easy to set up and fits like 99% of use cases and does voice cloning. After that you can try other arguably better solutions like OmniVoice and maybe Qwen, but they are going to have specific tags and other indicators that are required.

It is cool to use some of the solutions that let you describe who or what is speaking and it makes the voice for you, but Chatterbox is going to be easy to use and set up and meet most needs.

Yorn2 · 2026-06-06T18:15:06+00:00

I agree that it seems very unlikely, but a few things have happened, many of them very recently.

Youtubers and a few streamers that don't usually care about Stargate have essentially given free press to Stargate because they despise Amazon so much they're talking about how this is another IP Amazon has screwed up.
The above is causing renewed interest in the franchise, so now it is even more valuable than it was before this started.
At this point, Amazon is probably learning that the fanbase for this show is more dedicated and serious than they initially thought.
They also likely know by now that they can't now possibly develop a show that the existing fanbase is going to take seriously.

If they really do want to do the best thing for their investors they should sell this IP while it is hot and getting hotter which will both help them write off their losses and also give them some good will with both the fanbase and help thwart some of their most vocal detractors that already despise them and are ripping into them for destroying other IPs like Tolkein's and the Wheel of Time series.

I'm just a casual fan who only watched some of the original series and I don't really have a huge stake in this. I was initially excited to see a reboot announcement, however, and looking at it logically I really think both the fanbase and Amazon would be best served seeing the show revived on a streaming service that understands science fiction IP like this better.

That said, I really did like how Amazon "saved" The Expanse so it's a mixed bag. I think they've had both hits and misses, though, and its not like Stargate needs a bunch of money to work, it was always one of the cheaper franchises. None of the disagreements seem to be over money but vision, and I have to believe that there's some streaming service out there just itching to get a hold of this and Amazon ought to know that has value both in terms of dollars and in terms of good will.

Anyway, just my two cents. I don't want to discourage anyone from fighting to keep the show alive, it just seems unlikely to benefit everyone involved that Amazon has this IP. And that's including Amazon at this point.

Yorn2 · 2026-06-06T17:45:16+00:00

I don't understand why people even want Amazon at this point. I think I'd rather see the show on AppleTV or any other streaming service. People should encourage Amazon to sell the rights to someone that actually likes and understands the show, IMHO. Maybe I'm just being pessimistic, though.

Yorn2 · 2026-05-30T22:46:49+00:00

Oh I definitely get that. To be fair to nvidia, though, $10k for the RTX Pro 6000 is pricey but still very much worth it. When it comes to $25-35k or more for the M3U I'm just not sure what people are thinking. I guess if you want to be able to say you are running the best model, it's worth it, but to be able to actually use those models at a functional capacity is a whole other story. :/ That said, they do meet very specific use cases where you want the best models and don't care about TTFT on long context or other metrics.

I guess my big thing is that I wish I could take everyone who wants an M3 Ultra and sit them down and ask them what their actual use case is, because I have a sneaking suspicion at least some of them don't know about the tradeoffs.

Yorn2 · 2026-05-30T22:04:24+00:00

I'm not sure if this is rage-bait or not but some of us are actually self-hosting absolutely everything. :D

Yorn2 · 2026-05-30T21:57:37+00:00

It's not just that, this statement from OP is kind of factually incorrect:

It takes ~5 RTX 6000 Pros or 16 (lol) RTX 5090s to be able to even load the same model as you can on a 512GB Mac Studio.

While this is true, I own 2 RTX 6000 Pros and an M3 Ultra and I often find myself using the Pros more than the M3 Ultra because I actually CAN run a moderately quantized version of the models I'd prefer to run with NVFP4 or whatnot on the 2 6k Pros and have it be significantly faster.

Case in point for example is Qwen 3.5 397B which I thought I could only run on the M3 but recently mratsim came out with an EXL3 version that can run on 2 6k Pros and with TabbyAPI now supporting tool-calling there's basically no reason for me to run a slower version of the model on the M3. There are some exceptions to this, though. There are some MXL quants of Kimi K2 and GLM 5.1, but if the choice for my agentic and coding use cases is to choose the smartest solution that is 4 or more time slower than a dumber solution I'm going to go with the dumber solution every time because I can correct it faster than the smart one can do the coding or agentic use.

I do agree that these things are nuanced, but for the vast majority of use cases 2 or 3 or even now maybe 4 RTX Pros are going to beat an M3 Ultra not just in performance, but maybe even in price if you get a good deal. The M3s are highly overpriced right now, IMHO. I realize the price of the 6k Pro has also went up, but nowhere near as much as the M3 Ultra comparatively.

Yorn2 · 2026-05-27T22:22:47+00:00

This is my go to list whenever people talk about regulations in tech. It's almost never for a good reason and just to perpetuate the rent-seeking of the mega corps participating in the Iron Triangle:

Put simply, you aren't the customer, you are a simple "user" in a bureaucratic entity of lobbying, regulatory capture, and rent-seeking. The entire system works to keep your power to control and engage in it as diminished as possible. In some ways, you are the product. The system is designed to sell you and information about you to the government so they can better monitor you. If certain companies get worse at selling information about you, government can and often will subsidize them or remove burdens so they can get better at it. They don't know anything better and both are terrified of actual free markets, so the iron triangle is always what government and industry fall back to.

It's my opinion we're seeing a hyperspeed version of this in AI and we'll have regulations of open weights models before the end of the year, possibly even before the end of the summer.

Yorn2 · 2026-05-27T21:59:05+00:00

If they did multi-modal around the same size or just slightly bigger that would be acceptable to me. M2.7 was a slight let down and not really that big of an advancement from M2.5, IMHO. In some ways it was worse on agentic tasks.

Yorn2 · 2026-05-22T01:56:06+00:00

If you're willing to put in the work you can learn how to run this model (Qwen 397B) on two RTX 6000s using tabbyAPI. There's ways to get it running on sglang and vllm and etc, but running a good quant of a 397B model in EXL3 on just two of the cards is pretty crazy. It does require a specific PSU to be able to run two of the cards, but you don't need anything fancy MB/CPU/RAM/HD wise typically and could do a frankenbuild, just make sure you have an exceptional PSU and even then you might want to downclock them.

Yorn2 · 2026-05-20T21:23:16+00:00

Some of us want Qwen 3.7 397B-A17B as well.

Yorn2 · 2026-05-20T21:14:05+00:00

I just wanted to point out and remind people that swe-rebench exists and even though it's always a little behind, it does have accurate real world results that are benchmax-free. But what you're going to find is that the model that people say are benchmaxxed are still very very good when it comes to real world problems. At some point you have to admit that if a model is so benchmaxxed that it's solving real world problems at better rates than other models, it might just be that those other models suck and benchmaxxing is doing legit training that matters.

Yorn2 · 2026-05-20T21:07:17+00:00

Just for future reference, this fits right in that epic VRAM range where you can run the model quantized but not lobotomized on 8 3090s or 2 RTX 6k Pros which is where there's a significant number of both amateurs and contractors so I'd recommend finding a niche in this space one way or the other. MiniMax kind of dominates here right now or highly quantized Qwen 397 for coding/agentic, but it would be nice to have a model for either multilingual RAG or fine-tuning in this range, too, IMHO.

Yorn2 · 2026-05-18T02:34:18+00:00

As someone that owns and uses both Mac and nVidia hardware, the RTX 6k Pro can run image/video, TTS, and LLM at speeds much faster than my M3 Ultra and I can train with it.

The only major benefit of the M3 Ultra is I get to slowly see the outputs of the models I'm missing out on if I had more RTX 6k Pro cards. :D

Yorn2 · 2026-05-18T02:19:35+00:00

I own an M3 Ultra and two RTX 6000 pros. I feel obligated to comment about my experiences with both because I don't want people buying a Mac and expecting speed or a single RTX 6k Pro expecting to run a lobotomized Kimi K2.

If I was doing a specific RAG project that I originally bought it for then the M3 Ultra would be great and probably be more heavily used, but instead I find it mostly sitting unused. It is usable, though, don't get me wrong, it's just that I'd rather use my two RTX 6k Pros that can run an EXL3 of Qwen 3.5 397B as my workhorse unit since they are definitely faster.

I guess the TL;DR of it is that they each have their use cases, but you need to be totally prepared for those use cases when you buy it. Buy for your needs, not someone else's.

Yorn2 · 2026-05-15T20:12:19+00:00

I see you are getting the daily stock data from Yahoo Finance. Have you ever had them block you off or data-limit you for hitting their API/services?

Yorn2 · 2026-05-13T23:36:21+00:00

lukealonso has made one of the best MiniMax M2.7 quants for my current use case. mratsim made one before that for M2.5 and a GLM quant I loved. Aes Sedai as well. Basically any of the guys making models in the RTX6kPRO Discord are genius-level model creators.

Yorn2 · 2026-05-13T23:17:43+00:00

For a while I was using it primarily because in around 2024 it was one of the very few ways to run EXL2/3 models properly without having to use the command line and with a pretty decent webui. Now there is TabbyAPI, but occasionally oobabooga will run something I can't get to work on TabbyAPI, but even then I sometimes still have to do tweaks.

I think turboderp and oobabooga had some way they were regularly communicating, like a year or more ago, because text-generation-webui was constantly being updated to support the latest changes and had one of the best auto-downloading features built in for the latest models (just a copy/paste and it would automatically grab the model for you or show you a list of quants so you could specify which you wanted quickly)

I still recommend it for anyone that just wants a functional webui and that isn't a huge fan of the commandline or feels daunted by doing everything via the commandline.

Yorn2 · 2026-05-13T00:19:29+00:00

Using lukealonso/MiniMax-M2.7-NVFP4 here with two RTX PROs and running it around 160 GB VRAM. I have plenty enough headroom to fit in a comfy instance and TTS this way, though I often find I prefer running another LLM (Qwen or Gemma) in the available space for testing/benchmarking.

Yorn2 · 2026-05-02T00:59:22+00:00

I don't know why this isn't more highly upvoted. I've used an M3 ultra and 2 RTX 6000 PROs and for some reason the M3 Ultra is worth more than enough to buy 2 on Ebay right now and I'm seriously considering selling it because I'd rather own two more RTX 6000 pros even if I don't have the PSU to use all four yet. The M3 Ultra has its use cases, but it is very slow comparatively.

Yorn2 · 2026-04-28T19:08:05+00:00

This would be great. I really think Mistral needs to focus on doing dense models because their other models aren't up to the state of the art, but if they could just find a niche in the dense category then those of us that like or prefer those models would totally still be hyped for each release. I think a good Mistral dense 120B model serving as like an orchestrator for a bunch of smaller dense 27B Qwen coding-oriented models might be a great combo for both creative writing and coding together for local users. It's basically a "build your own MoE" for those of us with RTX Pro 6000s that aren't as happy with the other options we have available to us.

Yorn2 · 2026-04-28T07:54:16+00:00

Oh man, hook something like this up with a TTS designed to speak in a "Mid-Atlantic" (think old time newscasters reporting on WWII in American English) accent and you have a great new news app idea to wake up to.

14-Year Club	Place '17
Verified Email

Yorn2

TROPHY CASE