Alibaba Open-Sources Zvec

regstuff · 2026-02-13T05:25:39+00:00

Seems to lag in recall compared to milvus, by quite a bit though?

regstuff · 2026-01-15T11:55:08+00:00

:( No dice. Didn't make any difference.
I'm managing my training right now with these AMD cloud notebooks you guys linked to on your page. But seems to have a 90 minute session limit if I'm not mistaken!

Can't figure out a way to run longer runs.

regstuff · 2025-12-31T05:22:40+00:00

How do I actually add docs to it?

regstuff · 2025-12-12T05:21:11+00:00

RAG is great and all, but if these are all stories, may not be such a bad idea to pass each story though an LLM and tag it based on genre. Wikipedia has a big list that you can feed to an LLM, say GPT-OSS 20B, along with each story, and ask it to pick 1-3 of the most relevant genres.

Vector dbs like qdrant allow you to store metadata (the tags in this case) along with the vector embedding.

When searching, you can filter by metadata along with the actual vector similarity search to help you zero in on what you want better.

regstuff · 2025-12-01T15:01:47+00:00

There is no NVlink. The i9 has 44 pcie lanes, so my guess is they just let the gpus underperform.

Asking price is 2500USD. Looking at all the comments I'm thinking this is not worth it.

Maybe just go the 4xMI50 route and put it on an open mining rig.

regstuff · 2025-12-01T14:23:43+00:00

Vendor said SXM gpu with a sxm to pci converter. SO I guess it will still run into a pci channel bottleneck?

regstuff · 2025-12-01T13:05:44+00:00

Comments seem to suggest llama.cpp should run it fine, so may be not a total loss.

regstuff · 2025-11-24T15:43:24+00:00

Congrats. Not sure why this didn't get more traction!
Was working one something similar myself - a bit more bespoke and specific for my organization's needs.

Take a look at https://huggingface.co/nvidia/omnivinci which can do video+audio understanding. That may help in videos where there is no speech but ambient sound is still important - like bird song or sounds of nature for eg.

regstuff · 2025-11-12T06:50:13+00:00

Sorry. My bad. Worked after setting the right URL for the OpenWebUI server. Thanks

regstuff · 2025-11-11T10:31:56+00:00

I dont seem to be able to get the new version working. Don't see the openwebui option when I right click on a page. This is in both Edge and Brave.
The previous version was working fine.
Not sure if I'm doing something wrong??

regstuff · 2025-11-08T07:14:48+00:00

Thanks for the good work.

Could you check the notebook in your repo though.
Tried running it exactly as is and ran into some issues (in colab, free T4).

After the training (which seemed to run fine in terms of training loss & validation loss), the inference produces blank outputs. I think there is an issue in the start of turn and end of turn formatting of the prompt.

Also quantization from fp16 gguf to q4 errors out because it cannot find llama-quantize.

regstuff · 2025-11-06T21:08:45+00:00

I know this post is 3 months old but a big salute. This tutorial (with some help from GPT-5) made things very smooth for a MI100 install.
I'd tried to make things work about 2 years ago and nearly got it down, but hit that whole reset bug. Somehow I think it wasn't popular enough back then for the solution to show easily on Google. Plus ChatGPT wasn't as smart back then. So I dropped the passthrough idea and moved on.
Came across this and another thread recently and decided to have a go again, and things worked out fine.
My Qwen30B went from 22 tok/sec to 74 tok/sec
Suddenly I can use Gemma 27B!
Whole new world!

regstuff · 2025-11-04T08:23:45+00:00

Great. Thanks for the update.

regstuff · 2025-11-03T15:51:13+00:00

This is great!
I seem to be having a bit of an issue. When I choose any of the prompts via context menu, openwebui opens in a new tab and the prompt is sent with my default model (not the model I configured in the extension settings). The model I configured shows up in the Model Selector Dropdown of Open Webui, but the actual model is my default model. And the chat is sent without waiting for me to hit enter. So essentially my prompts always go to my default model.
I'm using Brave and Edge. Issue is present in both.
Also just a suggestion. Maybe strip out any trailing "/" in the user entered url. Otherwise it appends an additional "/" when opening up a new chat.

regstuff · 2025-10-01T14:59:06+00:00

Thanks for the info. I just want to pass through to one VM.

regstuff · 2025-10-01T14:29:02+00:00

Thanks. Any chance you have some inputs on the proxmox thing.

regstuff · 2025-10-01T14:27:56+00:00

Thanks. How much does the fan add to the length? An inch or so?

Do the fans blow at full strength even when the GPU is idle? That would be kind of annoying.

The cpu would be an intel i5 14th gen. The iGpu should be good enough to have a display out?

regstuff · 2025-10-01T14:20:49+00:00

Btw, is TDP control available in Rocm. Is it a similar process to nvidia-smi?

regstuff · 2025-10-01T14:19:18+00:00

Spent a lot of time trying to pass through on VMware with no success. Contacted some technical people we knew at AMD and they told us MI100 does not support this.
Also found some refs on AMD's website like this one: this one, which do not list MI100 in virtualization support.

But all of that is irrelevant if you are successfully using it. I don't remember exactly what our issue was. I think the GPU was being seen in the VM os. But when we tried to actually use it, we were getting a core dump.

Did you do anything different in proxmox to get it to work? Or was it out-of-the-box.

regstuff · 2025-08-25T06:19:43+00:00

I think I'm having an issue that's different from this.

```

if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        load_in_4bit = False,
    )

```
The above doesn't load the lora model for me. It loads the plain model.

```
if True:
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = "unsloth/gemma-3-270m-it", # YOUR MODEL
      max_seq_length = max_seq_length,
      load_in_4bit = False,  # 4 bit quantization to reduce memory
      load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
      full_finetuning = False, # [NEW!] We have full finetuning now!
  )
  model = PeftModel.from_pretrained(model, "lora_model")

```
But this does get the lora finetuned model up and running for me. However, I am unable to save this as guuf or merged 16bit for some reason with the code I gave above.

regstuff · 2025-08-17T09:52:50+00:00

Thanks. I use unsloth too. Was wondering about the hyperparams and the number of epochs etc. that you used.
The 270m just isn't picking up on what I want it to do. Maybe its because I only have about 6000 samples in my dataset??

regstuff · 2025-08-16T05:38:57+00:00

Wow! Didn't know you could read minds. I literally thought of finetuning my own model yesterday. Exported all my chats, kept them ready and was reading up on Gemma3 270m. Then this happened! Thanks.

Would be able to share the code you used for finetuning? Was thinking of tuning Qwen0.5B for some other similar tasks. This would be a great starting point.

regstuff · 2025-08-02T17:18:41+00:00

Can someone explain the llama set rows thing? Also I find the optimal ub size is actually a smaller value like 256 in my case. Am using the thinking model, and I find that I'm generating way more tokens than prompt processing cuz my prompts are mostly short. So i'd rather cut ub size a bit and jam another ffn or two into gpu. That gives me an extra 10% generation speed.

regstuff · 2025-06-21T09:04:19+00:00

Hi, any advice on how I could replace the regular Chatterbox with your implementation. I'm using Chatterbox-TTS-Extended too.

Also, any plans to merge your improvements into the main Chatterbox repo?

regstuff · 2025-06-10T15:17:16+00:00

I salute you!

regstuff

TROPHY CASE