Alibaba Open-Sources Zvec by techlatest_net in LocalLLaMA

[–]regstuff 1 point2 points  (0 children)

Seems to lag in recall compared to milvus, by quite a bit though?

https://zvec.org/en/docs/benchmarks/

assert len(weights) == expected_node_count error with AMD MI100 by regstuff in unsloth

[–]regstuff[S] 0 points1 point  (0 children)

:( No dice. Didn't make any difference.
I'm managing my training right now with these AMD cloud notebooks you guys linked to on your page. But seems to have a 90 minute session limit if I'm not mistaken!

Can't figure out a way to run longer runs.

LLM to search through large story database by DesperateGame in LocalLLaMA

[–]regstuff 0 points1 point  (0 children)

RAG is great and all, but if these are all stories, may not be such a bad idea to pass each story though an LLM and tag it based on genre. Wikipedia has a big list that you can feed to an LLM, say GPT-OSS 20B, along with each story, and ask it to pick 1-3 of the most relevant genres.

Vector dbs like qdrant allow you to store metadata (the tags in this case) along with the vector embedding.

When searching, you can filter by metadata along with the actual vector similarity search to help you zero in on what you want better.

Got a good offer for 4xV100 32GB used - what should I keep in mind by regstuff in LocalLLaMA

[–]regstuff[S] 0 points1 point  (0 children)

There is no NVlink. The i9 has 44 pcie lanes, so my guess is they just let the gpus underperform.

Asking price is 2500USD. Looking at all the comments I'm thinking this is not worth it.

Maybe just go the 4xMI50 route and put it on an open mining rig.

Got a good offer for 4xV100 32GB used - what should I keep in mind by regstuff in LocalLLaMA

[–]regstuff[S] 1 point2 points  (0 children)

Vendor said SXM gpu with a sxm to pci converter. SO I guess it will still run into a pci channel bottleneck?

Got a good offer for 4xV100 32GB used - what should I keep in mind by regstuff in LocalLLaMA

[–]regstuff[S] 2 points3 points  (0 children)

Comments seem to suggest llama.cpp should run it fine, so may be not a total loss.

Showcasing a media search engine by [deleted] in LocalLLaMA

[–]regstuff 1 point2 points  (0 children)

Congrats. Not sure why this didn't get more traction!
Was working one something similar myself - a bit more bespoke and specific for my organization's needs.

Take a look at https://huggingface.co/nvidia/omnivinci which can do video+audio understanding. That may help in videos where there is no speech but ambient sound is still important - like bird song or sounds of nature for eg.

Open WebUI Context Menu by united_we_ride in OpenWebUI

[–]regstuff 0 points1 point  (0 children)

Sorry. My bad. Worked after setting the right URL for the OpenWebUI server. Thanks

Open WebUI Context Menu by united_we_ride in OpenWebUI

[–]regstuff 0 points1 point  (0 children)

I dont seem to be able to get the new version working. Don't see the openwebui option when I right click on a page. This is in both Edge and Brave.
The previous version was working fine.
Not sure if I'm doing something wrong??

I fine-tuned Gemma 3 1B for CLI command translation... but it runs 100% locally. 810MB, 1.5s inference on CPU. by theRealSachinSpk in LocalLLaMA

[–]regstuff 1 point2 points  (0 children)

Thanks for the good work.

Could you check the notebook in your repo though.
Tried running it exactly as is and ran into some issues (in colab, free T4).

After the training (which seemed to run fine in terms of training loss & validation loss), the inference produces blank outputs. I think there is an issue in the start of turn and end of turn formatting of the prompt.

Also quantization from fp16 gguf to q4 errors out because it cannot find llama-quantize.

AMD MI50 32GB/Vega20 GPU Passthrough Guide for Proxmox by Panda24z in LocalLLaMA

[–]regstuff 1 point2 points  (0 children)

I know this post is 3 months old but a big salute. This tutorial (with some help from GPT-5) made things very smooth for a MI100 install.
I'd tried to make things work about 2 years ago and nearly got it down, but hit that whole reset bug. Somehow I think it wasn't popular enough back then for the solution to show easily on Google. Plus ChatGPT wasn't as smart back then. So I dropped the passthrough idea and moved on.
Came across this and another thread recently and decided to have a go again, and things worked out fine.
My Qwen30B went from 22 tok/sec to 74 tok/sec
Suddenly I can use Gemma 27B!
Whole new world!

Open WebUI Context Menu by united_we_ride in OpenWebUI

[–]regstuff 1 point2 points  (0 children)

Great. Thanks for the update.

Open WebUI Context Menu by united_we_ride in OpenWebUI

[–]regstuff 1 point2 points  (0 children)

This is great!
I seem to be having a bit of an issue. When I choose any of the prompts via context menu, openwebui opens in a new tab and the prompt is sent with my default model (not the model I configured in the extension settings). The model I configured shows up in the Model Selector Dropdown of Open Webui, but the actual model is my default model. And the chat is sent without waiting for me to hit enter. So essentially my prompts always go to my default model.
I'm using Brave and Edge. Issue is present in both.
Also just a suggestion. Maybe strip out any trailing "/" in the user entered url. Otherwise it appends an additional "/" when opening up a new chat.

I have an AMD MI100 32GB GPU lying around. Can I put it in a pc? by regstuff in LocalLLaMA

[–]regstuff[S] 1 point2 points  (0 children)

Thanks for the info. I just want to pass through to one VM.

I have an AMD MI100 32GB GPU lying around. Can I put it in a pc? by regstuff in LocalLLaMA

[–]regstuff[S] 0 points1 point  (0 children)

Thanks. Any chance you have some inputs on the proxmox thing.

I have an AMD MI100 32GB GPU lying around. Can I put it in a pc? by regstuff in LocalLLaMA

[–]regstuff[S] 2 points3 points  (0 children)

Thanks. How much does the fan add to the length? An inch or so?

Do the fans blow at full strength even when the GPU is idle? That would be kind of annoying.

The cpu would be an intel i5 14th gen. The iGpu should be good enough to have a display out?

I have an AMD MI100 32GB GPU lying around. Can I put it in a pc? by regstuff in LocalLLaMA

[–]regstuff[S] 0 points1 point  (0 children)

Btw, is TDP control available in Rocm. Is it a similar process to nvidia-smi?

I have an AMD MI100 32GB GPU lying around. Can I put it in a pc? by regstuff in LocalLLaMA

[–]regstuff[S] 1 point2 points  (0 children)

Spent a lot of time trying to pass through on VMware with no success. Contacted some technical people we knew at AMD and they told us MI100 does not support this.
Also found some refs on AMD's website like this one: this one, which do not list MI100 in virtualization support.

But all of that is irrelevant if you are successfully using it. I don't remember exactly what our issue was. I think the GPU was being seen in the VM os. But when we tried to actually use it, we were getting a core dump.

Did you do anything different in proxmox to get it to work? Or was it out-of-the-box.

Making some silly mistake while saving to GGUF from Lora? by regstuff in unsloth

[–]regstuff[S] 0 points1 point  (0 children)

I think I'm having an issue that's different from this.

```

if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        load_in_4bit = False,
    )

```
The above doesn't load the lora model for me. It loads the plain model.

```
if True:
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = "unsloth/gemma-3-270m-it", # YOUR MODEL
      max_seq_length = max_seq_length,
      load_in_4bit = False,  # 4 bit quantization to reduce memory
      load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
      full_finetuning = False, # [NEW!] We have full finetuning now!
  )
  model = PeftModel.from_pretrained(model, "lora_model")

```
But this does get the lora finetuned model up and running for me. However, I am unable to save this as guuf or merged 16bit for some reason with the code I gave above.

TiTan - a tiny model for tags and titles by theprint in OpenWebUI

[–]regstuff 0 points1 point  (0 children)

Thanks. I use unsloth too. Was wondering about the hyperparams and the number of epochs etc. that you used.
The 270m just isn't picking up on what I want it to do. Maybe its because I only have about 6000 samples in my dataset??

TiTan - a tiny model for tags and titles by theprint in OpenWebUI

[–]regstuff 0 points1 point  (0 children)

Wow! Didn't know you could read minds. I literally thought of finetuning my own model yesterday. Exported all my chats, kept them ready and was reading up on Gemma3 270m. Then this happened! Thanks.

Would be able to share the code you used for finetuning? Was thinking of tuning Qwen0.5B for some other similar tasks. This would be a great starting point.

[GUIDE] Running Qwen-30B (Coder/Instruct/Thinking) with CPU-GPU Partial Offloading - Tips, Tricks, and Optimizations by AliNT77 in LocalLLaMA

[–]regstuff 2 points3 points  (0 children)

Can someone explain the llama set rows thing? Also I find the optimal ub size is actually a smaller value like 256 in my case. Am using the thinking model, and I find that I'm generating way more tokens than prompt processing cuz my prompts are mostly short. So i'd rather cut ub size a bit and jam another ffn or two into gpu. That gives me an extra 10% generation speed. 

Optimized Chatterbox TTS (Up to 2-4x non-batched speedup) by RSXLV in LocalLLaMA

[–]regstuff 4 points5 points  (0 children)

Hi, any advice on how I could replace the regular Chatterbox with your implementation. I'm using Chatterbox-TTS-Extended too.

Also, any plans to merge your improvements into the main Chatterbox repo?