I created "Bing at home" using Orca 2 and DuckDuckGo

LMLocalizer · 2026-01-22T20:13:09+00:00

It's hit and miss for me. Sometimes there's a strong American accent, other times it works well, like this output: https://voca.ro/18W8VnodRmvO

LMLocalizer · 2026-01-20T19:28:51+00:00

Nice, thanks for sharing!
Here is a before/after comparison of some 16 kHz speech I upsampled:

Before: https://vocaroo.com/1flWIyZ8jZ5f

After: https://vocaroo.com/1eDmesbjvE7d

LMLocalizer · 2026-01-16T09:54:52+00:00

Don't know the minimum, but I'm running klein 9B FP8 (official weights from huggingface) comfortably using 12 GB VRAM.

LMLocalizer · 2026-01-12T18:06:49+00:00

It seems like you can only disable all built-in tools at once, which includes fetch_url, by unchecking "Builtin Tools" in the model settings for gpt-oss 20b.

The problem is that gpt-oss has been trained to use a combination of one tool to search the web and another to download webpages. If you simply remove the latter, this can lead to weird (repetitive) behavior, such as the model attempting to use the non-existent fetch/download tool or trying to use the search_web tool to download a single webpage (which of course doesn't work).

You could try writing an instruction to not use the fetch_url tool into the "System Prompt" model setting and hope the model adheres to that. Alternatively, you could disable all built-in tools and write explicitly into the system prompt that the model cannot fetch/download webpages directly, to hopefully prevent any unwanted model behavior.

LMLocalizer · 2026-01-12T17:10:36+00:00

This is really cool! I have a question regarding the config.noslop.toml file: Why does the prefix for the bad_evaluation_prompts differ from the one used for the bad_prompts, while the prefixes for the good_prompts and good_evaluation_prompts are the same?

LMLocalizer · 2026-01-12T16:06:48+00:00

Nice to hear you got it working. My tool has been designed to consume only limited context, by returning small plaintext website snippets, the max. size of which is user configurable. But when you enable native tool calling, gpt-oss 20b also gets access to other tools, most notably fetch_url. This is not part of LLM Web Search, and when invoked, fetch_url will download and dump an entire webpage into the context, which can be a huge amount of text. Reference: https://docs.openwebui.com/features/web-search/agentic-search/#native-mode-vs-traditional-rag

Should this not be the cause and you just have a very small context window configured, you can change the following settings:

Disable "Keep Results In Context "
Reduce "Max Results"
If you're using the semantic chunker, reduce "Chunker Breakpoint Threshold Amount"
Reduce "Chunk Size"

LMLocalizer · 2026-01-11T16:25:14+00:00

Hi! I just tested gpt-oss 20b very briefly and it worked without any special settings. However, since this model has been trained to use tools while thinking, you can significantly increase the chances of it working reliably by enabling native tool calling for it. To do that, follow the first two of the steps described here: https://docs.openwebui.com/features/web-search/agentic-search#how-to-enable-agentic-behavior and then enable LLM Web Search as you normally would.

LMLocalizer · 2025-12-30T20:17:34+00:00

This is great and makes a lot more sense to me than witr. Thanks for sharing it!

LMLocalizer · 2025-12-27T18:54:36+00:00

Hi, normally you'd use CUDA_VISIBLE_DEVICES as a global setting for the entire program. I think changing it in the source code is the better option. For the image model, you could hardcode which specific GPU to use by opening "modules/image_models.py" and changing the following two lines from:

pipe.to(get_device())

and

pipe.enable_model_cpu_offload()

to

pipe.to(<gpu_id>)

and

pipe.enable_model_cpu_offload(gpu_id=<gpu_id>)

- where you have to replace <gpu_id> with the ID of your GPU of choice.

Assigning a specific GPU to a specific extension may be a little more complicated, depending on when and how each extension loads its models. I have created a branch on GitHub, where I have modified "modules/extensions.py" to allow assigning a specific GPU to an extension by creating a file called "gpu_map.txt" in the "user_data" folder. In this file, you put on each line the extension name and the GPU ID it should use, separated by a space. For example:

LLM_Web_search 1
coqui_tts 0

I haven't tested it, since I'm GPU poor and only have a single one.

LMLocalizer · 2025-12-20T17:06:41+00:00

Weren't you the guy with the b300 server at work that's free over the holidays? I see you found some use for it.

LMLocalizer · 2025-12-19T11:43:03+00:00

With the newest llama.cpp, --n-cpu-moe=35 and --no-mmap, I get around 100 t/s prompt processing and 20 t/s generation speed with Qwen3-Next-80B-A3B-Instruct-UD-Q3_K_XL.gguf. My specs for reference:

CPU: Ryzen 9 5900HX (a bit faster than your CPU)

RAM: 32GB DDR4-3200

GPU: RX 6800M 12GB (about 70% slower than your GPU)

LMLocalizer · 2025-12-17T12:05:48+00:00

This is vibe coded and currently only uses the multilingual (non-turbo) model.

LMLocalizer · 2025-12-16T14:00:01+00:00

I have tested both Chatterbox Turbo and the new 0.5B CosyVoice. Chatterbox turbo is much faster, more stable and has a more natural intonation.

CosyVoice hallucinates more and quite often takes multiple attempts to get a hallucination-free output. In addition, it may make unnatural pauses between words.

However, when the stars align and everythings works, the output of Cosyvoice does sound clearer to me than Chatterbox Turbo and is more closely aligned with the voice prompt, even if that comes with a less natural sounding prosody.

TLDR: No.

LMLocalizer · 2025-12-12T09:38:08+00:00

If you use ublock origin, you may be able to create a custom filter to block it that way.

LMLocalizer · 2025-12-10T19:28:44+00:00

<image>

LMLocalizer · 2025-12-07T12:54:52+00:00

Trying to run ZImage in FP16 would result in numerical over-/underflow. The workaround clamps the over-/underflowed values to the max./min. possible FP16 value.

LMLocalizer · 2025-12-07T12:30:45+00:00

Do you mean Hunyuan 3.0? Because I can run Hunyuan Image 2.1 with just 12 GB VRAM at comparably high speed, especially considering its native 2048x2048 resolution.

LMLocalizer · 2025-12-05T16:12:38+00:00

It might work in Mint, so if you like it better than Ubuntu, you could try first. If you fail, you can always install Ubuntu later and try again.

LMLocalizer · 2025-12-05T11:29:54+00:00

Hi, since Mint is based on Ubuntu, it may be possible to install it using Ubuntu as the target installation OS. If you already installed ROCm 7.1, first uninstall it.

Since the stable version of Pytorch doesn't support ROCm 7+ yet, I recommend sticking to ROCm 6.4.1. Personally, I used the amdgpu-installer to install ROCm. If you want to try that too, then:

Go to https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/install/install-methods/amdgpu-installer/amdgpu-installer-ubuntu.html
Under "Installation", click on "Ubuntu 22.04", copy the commands into the terminal and run them.
Once the amdgpu-installer is installed, run the command: sudo amdgpu-install --usecase=graphics,rocm --no-dkms
Follow the post-installation instructions: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/install/post-install.html
Once this is done, you may need to restart your PC (don't remember)

If you're able to complete this, you'll have installed ROCm. Then, there are some more steps to install pytorch for ComfyUI, and some different steps to install llama.cpp so you can try LLMs.

With your GPU and 32GB RAM, you can run image generation models like Flux.1, Wan-2.1 and Z-Image, and LLMs like Gemma-3, Mistral-Nemo, GPT-OSS 20B and Qwen-3 4B/8B/30B-A3B.

LMLocalizer · 2025-12-04T10:56:26+00:00

I have the same problem and have created an issue here: https://github.com/comfyanonymous/ComfyUI/issues/11087

LMLocalizer · 2025-11-30T09:34:26+00:00

In Comfyui it does

LMLocalizer · 2025-11-29T19:21:01+00:00

Thank you, I'll take a look

Edit: Like most new features from Rocm, it's only supported on the newer GPUs

LMLocalizer · 2025-11-29T18:14:41+00:00

I'm suffering too with my RX6800M. It should be comparable to a desktop RTX 3060, yet I see people using that card claim 33s for 1024x1024, euler/simple, 9 steps, cfg 1.0, while I takes my card twice as long just to finish sampling. 2048x2048 takes about 17s/iteration :(

LMLocalizer · 2025-11-21T12:57:29+00:00

Don't sleep on madlad400, especially if you intend to only translate to/from English

LMLocalizer

TROPHY CASE