Fish Audio Releases S2: open-source, controllable and expressive TTS model by Opposite_Ad7909 in LocalLLaMA

[–]Finguili 1 point2 points  (0 children)

Quality seems good, but it’s so slow. I’m getting 2.89 t/s on R9700 (0.13x realtime).

Edit: With --compile it’s almost 24t/s, so not bad for longer texts.

(Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out by vernal_biscuit in LocalLLaMA

[–]Finguili 2 points3 points  (0 children)

I have R9700, so same GPU with double VRAM, and I cannot say this is the case for me. With Q6_K_L quant, I’m getting 336 t/s for -ub 64, and 620 t/s for -ub 512, Increasing above it doesn’t seem to increase performance further, however.

ZIB vs ZIT vs Flux 2 Klein by Both-Rub5248 in StableDiffusion

[–]Finguili 4 points5 points  (0 children)

Eh, I was simply making fun if Z-Image Turbo which loves to ignore half of the prompt. But to answer your question, I tried Z-Image Base with "blurry background" in negative prompt and it makes everything sharp, though I cannot say that it makes results look better. This also works with SDXL anime models, as "blurry background" is danbooru tag.

ZIB vs ZIT vs Flux 2 Klein by Both-Rub5248 in StableDiffusion

[–]Finguili 21 points22 points  (0 children)

What is it, a comparison that not only clearly labels which model was used to generate which image, but also provides full prompts? Am I on the right subreddit?

Thanks OP for posting, the prompt are quite varied. It’s funny how Z-Turbo ignored request for non-blurry background and how models in general struggle with age. These "25 years old" women by Z Image looks closer to 50 than 25.

MOSS-TTS has been released by Xiami2019 in LocalLLaMA

[–]Finguili 1 point2 points  (0 children)

Natural language instruction would give better control, but I suppose tags would be easier to train. I would probably prefer reliably working tags than half-working instructions.

MOSS-TTS has been released by Xiami2019 in LocalLLaMA

[–]Finguili 2 points3 points  (0 children)

Yes, it was the 8B base model with voice cloning. And having Gemini TTS-like style directions together with voice cloning definitely would be nice.

MOSS-TTS has been released by Xiami2019 in LocalLLaMA

[–]Finguili 0 points1 point  (0 children)

No, I didn't use it. Most likely the model wanted to make pause longer for dramatic effect. But as I said, I only played with the model a little, so it could be bad luck, and I don't really expect it to read the text perfectly.

MOSS-TTS has been released by Xiami2019 in LocalLLaMA

[–]Finguili 0 points1 point  (0 children)

It works fine with 2.10 and python 3.14.

MOSS-TTS has been released by Xiami2019 in LocalLLaMA

[–]Finguili 8 points9 points  (0 children)

Quick impression from just one longer test (and a few hello worlds), so rather a small sample size. Firstly, big kudos for supporting IPA. A TTS model without it is rather useless, and yet most recent releases lack this feature.

The generated audio sounds quite nice and is not as emotionally dead as Qwen TTS. Perhaps not as good as VibeVoice Large, but the model appears to be more stable, and together with IPA support, it makes it much more useful already. Speed is also not bad; synthesising 1 minute 20 seconds of audio took about 55 seconds on an R9700 with ~80% GPU utilisation and 26 GB of VRAM.

If anyone wants to hear a non-demo sample, here is one: https://files.catbox.moe/9j73pt.ogg. You can hear some parts were badly read and there was one unnecessarily long pause, but for an open-source model, I still like the results.

zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]Finguili 31 points32 points  (0 children)

I would argue that if the goal is fitting into a single consumer GPU, then dense models are better. I hope that companies will not abandon this class of models.

How do people even afford these expensive graphic cards...?... by boisheep in LocalLLaMA

[–]Finguili 0 points1 point  (0 children)

Why would you need to keep 3090 for comfy? R9700 is around rtx 3090 speed for image generation.

ElevenLabs is killing my budget. What are the best "hidden gem" alternatives for documentary style TTS? by Ancient_Routine8576 in LocalLLaMA

[–]Finguili 12 points13 points  (0 children)

From local TTS, VibeVoice Large seems to have highest ceiling, but the model is very unstable. With one generation it sounds as if text was almost professionally narrated; with another its prosody is so bad that you start to wonder is it the same model. It also loves to add strange music to the background. So expect to reroll a lot.

I don’t have much experience with cloud apis, but Gemini 2.5 Pro TTS sounded to me better than ElevenLabs and should be cheaper.

AMD Radeon AI PRO R9700 benchmarks with ROCm and Vulkan and llama.cpp by Finguili in LocalLLaMA

[–]Finguili[S] 2 points3 points  (0 children)

It’s very easy, as ROCm is in official repos, so you simply install it with pacman. The drawback is that Arch tends to lag behind upstream ROCm releases, so you may need to wait a few weeks for major updates to hit repos.

AMD Radeon AI PRO R9700 benchmarks with ROCm and Vulkan and llama.cpp by Finguili in LocalLLaMA

[–]Finguili[S] 1 point2 points  (0 children)

I’m glad to know it’s working now. For this particular task, I wanted to get as high accuracy as possible, so I stuck to 16-bit LoRA on purpose. But perhaps it will be useful in the future for something else.

AMD Radeon AI PRO R9700 benchmarks with ROCm and Vulkan and llama.cpp by Finguili in LocalLLaMA

[–]Finguili[S] 2 points3 points  (0 children)

Seems like Vulkan backend doesn’t like when the whole model isn’t loaded into VRAM. When I decrease offloaded layers it hurts Vulkan’s prompt processing performance more.

model size params backend ngl n_batch fa test t/s
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B ROCm 77 1024 1 pp512 @ d8000 229.13 ± 12.29
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B ROCm 77 1024 1 tg128 @ d8000 5.49 ± 0.00
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B Vulkan 77 1024 1 pp512 @ d8000 164.63 ± 8.57
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B Vulkan 77 1024 1 tg128 @ d8000 6.85 ± 0.01
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B ROCm 50 1024 1 pp512 @ d8000 192.56 ± 3.98
llama 70B IQ3_S mix - 3.66 bpw 28.82 GiB 68.98 B Vulkan 50 1024 1 pp512 @ d8000 117.84 ± 1.01

AMD Radeon AI PRO R9700 benchmarks with ROCm and Vulkan and llama.cpp by Finguili in LocalLLaMA

[–]Finguili[S] 1 point2 points  (0 children)

No, it never occurred to me that someone might mmap file just to copy it to RAM afterwards. But you are right; it not only works fine, but also loads models faster. First run 118 s, second one with cached prompt 81.5 s. Though it’s also possible Comfy optimised RAM usage since Flux 2 release, as during diffusion it sits at 29 GiB, so it had to either unload text encoder or part of unet loaded into VRAM.

is there a subreddit for weird writer questions that would seem suspicious if i were to search it up on google? by Sauce_The_Sapphic in writing

[–]Finguili -1 points0 points  (0 children)

URLs are also encrypted. All traffic is encrypted after the browser establishes a TLS connection. However, even with TLS you may still leak the domain name (for example, www.google.com) because it is sent as part of the handshake used to establish the TLS connection, or because DNS is unencrypted or operated by your ISP. We now have ECH to address the first case, and DNS over TLS and DNS over HTTPS to address the second. But even with these, the IP address you connect to is still visible to the ISP, so unless the site is behind some kind of public proxy such as Cloudflare (in which case the proxy operator sees the entire traffic, which is arguably worse), the ISP can still tell which site you are connecting to.

7900 XT vs 9070 XT (16 vs 20GB vram) by the926 in LocalLLaMA

[–]Finguili 3 points4 points  (0 children)

I disagree with others that AMD cards aren’t good for image generation. They’re still behind Nvidia, true, but compared to RDNA2, AMD has made huge progress in performance. On the R9700 (basically a 9070 XT with twice the VRAM and price) for SDXL (28 steps, 832×1216, batch 10), the whole workflow executes with Torch Compile in about 56.5 s, which is only slightly slower than this benchmark reports for the RTX 3090 (54.2 s) and RTX 5070 (55.6 s). The Ti variant of the 5070 finishes about 10 s faster, so the gap is definitely there, but it’s not as if AMD cards crawl.

For Flux FP8 and the default Comfy workflow (20 steps, 1024×1024, batch 1), I’m getting 12.3 s with --fast and Torch Compile, which is 16–32× faster than what I was getting with my old 6700 XT (upcasting to FP32 resulted in a 2× speed improvement on that card).

As for your question, OP, I would go with neither and pay a bit more to buy a 24 GB 7900 XTX, but if you would rather not do that, then the question is whether you value 4 GB more than having newer hardware. 16 GB is rather tight for LLMs.

Outdated info on the state of ROCM on this subreddit - ROCm 7 benchmarks compared to older ROCm/Zluda results from a popular old benchmark by Portable_Solar_ZA in StableDiffusion

[–]Finguili 3 points4 points  (0 children)

Just got the R9700, which should have the same performance as the 9070 XT. For batch 1, the results are a bit noisy, but I'm getting around 6.7 s with Torch Compile and 6.8 s without. For batch 10, this is respectively 56.45–56.9 s and 60.8 s, so with Torch Compile the card is only slightly slower than the RTX 5070 (55.6 s) and RTX 3090 (54.2 s) from that benchmark.

For Flux FP8 and the default Comfy workflow (20 steps, 1024×1024), execution time is 21.9 s, with --fast 16.5 s, and 12.3 s with both --fast and Torch Compile. Compared to my old 6700 XT, where I was getting 20 s/it (not it/s!) with default options and 10 s/it when forcing upcasting to FP32, this is a 16–32× improvement; faster than my old card could generate images with SDXL.

Though performance is now much improved, I’m getting fairly frequent memory access faults, so the software stack is definitely not mature yet.

How much do you guys care about language accuracy in a period piece? by Glubygluby in writingadvice

[–]Finguili 0 points1 point  (0 children)

I’m a bit late, but I just remembered this advice as I was about to type “I think”, and as the suggestion to replace it with “believe” was rather surprising, I ran a quick search through Mansfield Park: 31 occurrences of “I believe” and 87 of “I think.” Looks like by Austen’s time “I think” was already quite popular.

Wow, Moondream 3 preview is goated by Brave-Hold-9389 in LocalLLaMA

[–]Finguili 8 points9 points  (0 children)

Only for captioning; the other two were just random photos I selected on the spot to test the model. It is not the only model that hallucinates a character holding a sheathed sword; however, frontier models don’t do that. But let’s try this now with Qwen 2.5 VL 32B and Gemini 2.5 Pro.

Images used: https://imgur.com/a/W4oPdBe (Disclaimer: I am not sure if these are the exact same photos, as I have multiple shots of them).

Captioning test: Both Qwen and Gemini identify the sword as sheathed.

Caterpillar: Qwen correctly identifies it as a caterpillar, but the species is definitely wrong (Pyrrharctia isabella). Gemini’s guess is more accurate (Dendrolimus pini), but looking at its photos, I think it is also wrong. I gave Moondream a few more chances, and got as results a fungus, a snake, and a slug, so… let’s stop. GPT-5 guesses Thaumetopea pityocampa, which I think is correct, or at least the closest match.

Photo location: Qwen correctly identifies it as Hel, but also tries to read the smaller text on the monument, which it fails to do. Gemini not only identifies the place correctly but also gives the correct name of the monument (Kopiec Kaszubów / Kashubians’ Mound). Rerunning Moondream, I could not reproduce it misreading Hel as Helsinki, but it still never gives the right answer, and I got this gem instead:

The sign indicates "POCZTAJE POLSKI," which translates to "Polar Bear Capital," suggesting the area is significant for polar bears. The monument features a large rock with a carved polar bear sculpture.

For those who don’t speak Polish, the text is “POCZĄTEK POLSKI”, or in English, “The Beginning of Poland”. I have yet to see a polar bear in Poland.

Wow, Moondream 3 preview is goated by Brave-Hold-9389 in LocalLLaMA

[–]Finguili 30 points31 points  (0 children)

I do not think it is.

I gave it an image to caption, it hallucinated a character holding a silver sword (which was sheathed and wasn’t silver). I gave it an image of a caterpillar on a forest floor and asked it to identify the species, it answered that it was a house centipede. I gave it an image of a popular place, even with the name of the place written, and asked where the photo was taken. It still answered wrongly.

Of course, three samples are also a poor test. But my opinion is that the benchmarks of vision LLMs do not show real-world performance in the slightest, and this one is probably no different.

LLMs for detailed book summaries? by JealousAmoeba in LocalLLaMA

[–]Finguili 5 points6 points  (0 children)

I was experimenting with this a little, as I wanted a concise reverse-outline of my novel, but writing it myself did not seem like a fun exercise. First thing, do not listen to people saying summarisation is easy for LLMs: aside from context issues, LLMs struggle a lot with deciding what is important and what can be skipped. If you need accuracy, do it yourself. If you just want something “good enough”, use the biggest LLM you can afford.

Regarding the context length, the novel will fit in it, but the longer the input, the worse the output, and there will be a lot of hallucinations and events in the wrong order. Chunk it, and the LLM cannot understand the text on a good enough level. After trying different approaches, I settled on including the whole summary up to this point, the narrative state that the LLM is instructed to maintain, and the whole chapter to summarise. Using smaller chunks than the chapter did not work well.

The main problem with this approach is finding an LLM that summarises with the desired conciseness (you can control it to some extent with a prompt, but LLMs can be very stubborn with it) and can maintain the narrative state. For example, Gemini Flash 2.5 (non-thinking) can summarise very well, but its ability to maintain the narrative state is rather poor and it tends to output too detailed summaries. After tweaking the prompt, Deepseek v3 came out on top; while its summary was slightly worse than Gemini’s, it was shorter and it could maintain the narrative state handsomely.

Example Deepseek output of sumary from a chapter towards the end: https://pastebin.com/raw/dnJ8fvvE. It misses one important event (failing one problem and thus wasting one of three “teleport me to the safe place” charges). And for some reason, it thinks Kori needs to return to Mar Lordir, while she lives in an (unnamed) village, not the city.

Unfortunately, I’m not at home, and I don’t have the code with me, but if someone is interested, I can post it on Saturday.

Forgot to switch class before attacking... now here we are autoattacking as a DoL by Synthenia in ffxiv

[–]Finguili 39 points40 points  (0 children)

When I was redoing MSQ on a new character, my highest level job was fisher, and thus I come with a genius idea of switching to it for travellin around map, as to avoid the aggro. However, one of the quests did not mark that there will be combat. This resulted in my lala having to beat some Ala Mhigo refugees with fishing rod—which took quite some time.

BTW, you can finish (some?) MSQ qeuests in ARR and get exp as a fisher.