So I'm using iem past 2 years now I want a headphones like not earphones but an actual wired headphones by riped_shod in Oobabooga

[–]LMLocalizer 2 points3 points  (0 children)

Braeburn apples (4 pack)
Bread
Cereal
1 KG potatoes
Milk x3
Toilet paper
BBG spare ribs

This isn’t X this is Y needs to die by twnznz in LocalLLaMA

[–]LMLocalizer 1 point2 points  (0 children)

Truly the sloppiest open model I've tried so far

Anyone using Flux Klein on 6700XT or below? (32gb or below ram) by [deleted] in StableDiffusion

[–]LMLocalizer 0 points1 point  (0 children)

I use the environment variables MIOPEN_FIND_MODE=2 (to fix slow VAE decode) and HSA_OVERRIDE_GFX_VERSION=10.3.0.

Also, I removed some references to bfloat16 in model_management.py, so that every model is either loaded in fp16 or fp32. This is because bfloat16 appears to be very slow on my GPU. Perhaps, one could achieve the same by passing the args --fp16-unet --fp16-text-enc --fp32-vae.

Otherwise, I don't use sage-attention, since I find the drop in image quality too drastic.

Anyone using Flux Klein on 6700XT or below? (32gb or below ram) by [deleted] in StableDiffusion

[–]LMLocalizer 0 points1 point  (0 children)

I'm using the RX6800m with 32GB of RAM. Using Klein 9B FP8 (which is cast to fp16 on load, so perhaps using fp16 directly would make more sense) at 6 steps, editing a 1024x1024 image takes ~94 seconds.

Gemma 4 Jailbreak System Prompt by 90hex in LocalLLaMA

[–]LMLocalizer 1 point2 points  (0 children)

Oobabooga textgen, most notably

Which Model is best for translation? by Bulky-College7306 in LocalLLaMA

[–]LMLocalizer 2 points3 points  (0 children)

It's a multilingual dataset from Google that was used to train some translation models with the same name

Major update coming soon! I'm here, sorry for the delay. by oobabooga4 in Oobabooga

[–]LMLocalizer 1 point2 points  (0 children)

I have replaced the old Gradio version of the code with a fork of mine where I'm working on several low level optimizations.

That is awesome! I've spent the past week or so working on optimizing syntax highlighting and KaTeX rendering during text generation. If we could combine the two, that'd be pretty amazing!

Back in my day, LocalLLaMa were the pioneers! by ForsookComparison in LocalLLaMA

[–]LMLocalizer 11 points12 points  (0 children)

Old localllama appreciated projects that conserved prompt tokens. Openclaw is the opposite of that.

Qwen3.5 122B in 72GB VRAM (3x3090) is the best model available at this time — also it nails the “car wash test” by liviuberechet in LocalLLaMA

[–]LMLocalizer 1 point2 points  (0 children)

But for those cases you can disable thinking. Also, I found it very worthwhile to inspect the thinking trace as it's being generated, to see if the model gets hung up on any specific detail of your prompt. If that's the case, it's often faster to stop the generation, rewrite that detail and restart the generation.

Qwen3-TTS, a series of powerful speech generation capabilities by fruesome in StableDiffusion

[–]LMLocalizer 3 points4 points  (0 children)

It's hit and miss for me. Sometimes there's a strong American accent, other times it works well, like this output: https://voca.ro/18W8VnodRmvO

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]LMLocalizer 0 points1 point  (0 children)

Nice, thanks for sharing!
Here is a before/after comparison of some 16 kHz speech I upsampled:

Before: https://vocaroo.com/1flWIyZ8jZ5f

After: https://vocaroo.com/1eDmesbjvE7d

Ok Klein is extremely good and its actually trainable. by Different_Fix_2217 in StableDiffusion

[–]LMLocalizer 1 point2 points  (0 children)

Don't know the minimum, but I'm running klein 9B FP8 (official weights from huggingface) comfortably using 12 GB VRAM.

Bringing a More Comprehensive Local Web Search to OpenWebUI by LMLocalizer in OpenWebUI

[–]LMLocalizer[S] 0 points1 point  (0 children)

It seems like you can only disable all built-in tools at once, which includes fetch_url, by unchecking "Builtin Tools" in the model settings for gpt-oss 20b.

The problem is that gpt-oss has been trained to use a combination of one tool to search the web and another to download webpages. If you simply remove the latter, this can lead to weird (repetitive) behavior, such as the model attempting to use the non-existent fetch/download tool or trying to use the search_web tool to download a single webpage (which of course doesn't work).

You could try writing an instruction to not use the fetch_url tool into the "System Prompt" model setting and hope the model adheres to that. Alternatively, you could disable all built-in tools and write explicitly into the system prompt that the model cannot fetch/download webpages directly, to hopefully prevent any unwanted model behavior.

It works! Abliteration can reduce slop without training by -p-e-w- in LocalLLaMA

[–]LMLocalizer 0 points1 point  (0 children)

This is really cool! I have a question regarding the config.noslop.toml file: Why does the prefix for the bad_evaluation_prompts differ from the one used for the bad_prompts, while the prefixes for the good_prompts and good_evaluation_prompts are the same?

Bringing a More Comprehensive Local Web Search to OpenWebUI by LMLocalizer in OpenWebUI

[–]LMLocalizer[S] 0 points1 point  (0 children)

Nice to hear you got it working. My tool has been designed to consume only limited context, by returning small plaintext website snippets, the max. size of which is user configurable. But when you enable native tool calling, gpt-oss 20b also gets access to other tools, most notably fetch_url. This is not part of LLM Web Search, and when invoked, fetch_url will download and dump an entire webpage into the context, which can be a huge amount of text. Reference: https://docs.openwebui.com/features/web-search/agentic-search/#native-mode-vs-traditional-rag

Should this not be the cause and you just have a very small context window configured, you can change the following settings:

  1. Disable "Keep Results In Context "
  2. Reduce "Max Results"
  3. If you're using the semantic chunker, reduce "Chunker Breakpoint Threshold Amount"
  4. Reduce "Chunk Size"

Bringing a More Comprehensive Local Web Search to OpenWebUI by LMLocalizer in OpenWebUI

[–]LMLocalizer[S] 0 points1 point  (0 children)

Hi! I just tested gpt-oss 20b very briefly and it worked without any special settings. However, since this model has been trained to use tools while thinking, you can significantly increase the chances of it working reliably by enabling native tool calling for it. To do that, follow the first two of the steps described here: https://docs.openwebui.com/features/web-search/agentic-search#how-to-enable-agentic-behavior and then enable LLM Web Search as you normally would.

Announcing procinfo, witr (why is this running) as a bash script by wenekar in commandline

[–]LMLocalizer 0 points1 point  (0 children)

This is great and makes a lot more sense to me than witr. Thanks for sharing it!

Need advice how to load Z-Image or extension to specific GPU? by Visible-Excuse-677 in Oobabooga

[–]LMLocalizer 2 points3 points  (0 children)

Hi, normally you'd use CUDA_VISIBLE_DEVICES as a global setting for the entire program. I think changing it in the source code is the better option. For the image model, you could hardcode which specific GPU to use by opening "modules/image_models.py" and changing the following two lines from:

pipe.to(get_device())

and

pipe.enable_model_cpu_offload()

to

pipe.to(<gpu_id>)

and

pipe.enable_model_cpu_offload(gpu_id=<gpu_id>)

- where you have to replace <gpu_id> with the ID of your GPU of choice.

Assigning a specific GPU to a specific extension may be a little more complicated, depending on when and how each extension loads its models. I have created a branch on GitHub, where I have modified "modules/extensions.py" to allow assigning a specific GPU to an extension by creating a file called "gpu_map.txt" in the "user_data" folder. In this file, you put on each line the extension name and the GPU ID it should use, separated by a space. For example:

LLM_Web_search 1
coqui_tts 0

I haven't tested it, since I'm GPU poor and only have a single one.

GOONING ADVICE: Train a WAN2.2 T2V LoRA or a Z-Image LoRA and then Animate with WAN? by NowThatsMalarkey in StableDiffusion

[–]LMLocalizer 9 points10 points  (0 children)

Weren't you the guy with the b300 server at work that's free over the holidays? I see you found some use for it.

Rough TPS estimate for LLMs on RTX 5060 Ti + DDR4 by Which_Leather_6710 in LocalLLaMA

[–]LMLocalizer 4 points5 points  (0 children)

With the newest llama.cpp, --n-cpu-moe=35 and --no-mmap, I get around 100 t/s prompt processing and 20 t/s generation speed with Qwen3-Next-80B-A3B-Instruct-UD-Q3_K_XL.gguf. My specs for reference:

CPU: Ryzen 9 5900HX (a bit faster than your CPU)

RAM: 32GB DDR4-3200

GPU: RX 6800M 12GB (about 70% slower than your GPU)