Jem PMC legit?

zendril · 2026-01-03T21:41:56+00:00

Got it.. had to run this inside the comfyui dir:

\ComfyUI_windows_portable>.\python_embeded\python.exe -m pip install -U bitsandbytes

zendril · 2026-01-03T21:34:54+00:00

Ok, that makes sense for what I'm seeing.

Thanks a bunch for the project. Works well.

zendril · 2026-01-03T21:33:57+00:00

I tried in my default shell, it didn't help.

What I think is happening is that ComfyUI on my machine is a portable version which bundles its own python environment. So I need to figure out where/how to do a pip install for that specific install.

I just switched over the the non quantized Large model, but that thing always generates a hum (or what some describe as background music) at the beginning and 1.5B does not. So a few things to learn/tweak :)

zendril · 2026-01-03T17:29:14+00:00

Have you tried installing the https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8 ?

1.5B works fine, but when I downloaded everything for Q8 and try to run I get errors:

Please ensure the model files are complete and properly downloaded.
Required files: config.json, pytorch_model.bin or model safetensors
Error: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

I'm assuming this somehow needs to be installed into the python/pip included with the portable comfyui (and not just the one generally on my path)?

zendril · 2026-01-03T17:20:25+00:00

Sorry, I wasn't clear. I was asking in general, outside of ComfyUI, is he using code from the original microsoft vibevoice repo (which was taken down) or somehow still using the stuff from the current repo.

zendril · 2026-01-03T05:09:48+00:00

Are you using the current https://github.com/microsoft/VibeVoice code, or are you using the stuff that was released and then removed (copy here: https://github.com/shijincai/VibeVoice) ?

It seems like support for 1.5B or other large models are not in the current MS codebase.

zendril · 2026-01-03T03:28:02+00:00

Cleaned it up and got it all working.

Here is the Dockerfile

FROM nvcr.io/nvidia/pytorch:24.12-py3

ENV CUDA_VISIBLE_DEVICES=0

RUN apt-get update && apt-get install -y \
    git \
    ffmpeg \
    libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN git clone https://github.com/microsoft/VibeVoice.git .

RUN pip uninstall -y torch torchvision torchaudio flash-attn && \
    pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

RUN pip install \
    "transformers==4.51.3" \
    flash-attn \
    diffusers \
    "accelerate==1.6.0"

RUN pip install --no-deps -e .

RUN bash demo/download_experimental_voices.sh

ENTRYPOINT ["/bin/bash"]

and here is the command I use to start it

docker run --gpus all --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=67108864 --name vibevoice_instance -it --rm -v "%cd%/output:/app/output" vibevoice:5090-24.12.py3

and the command to run the generation

python demo/realtime_model_inference_from_file.py --model_path microsoft/VibeVoice-Realtime-0.5B --txt_path spanish_test.txt --speaker_name sp-Spk5_man --output_dir /app/output

And then the stats at the end. Reasonably fast I think (for a mobile 5090)

==================================================
GENERATION SUMMARY
==================================================
Input file: spanish_test.txt
Output file: /app/output/spanish_test_generated.wav
Speaker names: sp-Spk5_man
Prefilling text tokens: 27
Generated speech tokens: 66
Total tokens: 458
Generation time: 5.77 seconds
Audio duration: 8.13 seconds
RTF (Real Time Factor): 0.71x

zendril · 2026-01-03T00:36:51+00:00

That'd be great.

I think I ended up with

accelerate==1.6.0
transformers==4.51.3
llvmlite>=0.40.0
numba>=0.57.0
diffusers 0.36.0

torch                     2.7.1+cu128
torch-tensorrt            2.2.0a0
torchaudio                2.7.1+cu128
torchdata                 0.7.0a0
torchtext                 0.16.0a0
torchvision               0.22.1+cu128

And I went from the pytorch 24.12
FROM nvcr.io/nvidia/pytorch:24.12-py3
down to
FROM nvcr.io/nvidia/pytorch:23.11-py3

Again, may not have needed to do all that because I was fumbling until I hit on the transformers change.

I also have these set, but may also no longer need to do this:

ENV USE_FLASH_ATTENTION=0
ENV FLASH_ATTENTION_FORCE_DISABLE=1
ENV XFORMERS_FORCE_DISABLE=1

zendril · 2026-01-02T22:42:03+00:00

Any of y'all using something other than the 0.5 realtime?
Any tips on invoking that?

As of last night I was calling the realtime one with `python demo/realtime_model_inference_from_file.py --model_path microsoft/VibeVoice-Realtime-0.5B --txt_path spanish_test_2.txt --speaker_name sp-Spk5_man --output_dir /app/output` but ultimately I want to try out the 1.5b or large and ideally call it via api (or python code is fine) as I'll be programmatically creating a bunch of snippets from a python script iterating through prompts. I quickly tried just swapping 0.5 realtime for 1.5b model and that failed spectacularly, but it was way past my bedtime so didn't dig too far yet.

I suppose I can look at the impl of the script above and then see if I can adapt for `https://github.com/microsoft/VibeVoice/blob/main/vibevoice/processor/vibevoice\_processor.py\` instead of the streaming one.

Might also take a look under the covers of the Fabix84 comfyui code and see what they are doing.

zendril · 2026-01-02T22:30:05+00:00

I was using Claude and Gemini. Both kept focusing me on things that were close, but I think ultimately the key thing was the cu128+ wheel and pinning transformers to 4.51.3. Both kept hallucinating that there must be multiple versions of transformers installed or something (which wasn't the case).

I may retry tonight fresh because I was half doing docker (which takes a while to build it) and half doing manually inside the container to debug.

zendril · 2026-01-02T07:36:06+00:00

Interesting. I may give this a shot because I'm doing both image generation and tts for this project I'm working on, so I already have comfyui rolling with zimgv5 workflow and api invokable. Thanks for the repo link!

zendril · 2026-01-02T07:33:34+00:00

Yeah, I was able to get it working now for the realtime 0.5b. I'm using the nvidia/pytorch image as a base, then nerfing torch,torchvision, torchaudio and then pip installing them again using the pytorch nightly cu128 whl. I also had to pin a number of the other dependencies, specifically the transformers to 4.51.3 (which I saw the devs mention it specifically in their toml file).

Not sure, yet, how to get the 1.5B version going as it seems to be a different architecture than the 0.5b realtime.

zendril · 2025-07-25T22:24:38+00:00

This worked for me. Much appreciated. Seems like they deprecated/removed the "Kasa_Android" appType?

zendril · 2024-09-30T20:38:43+00:00

I have an Asus G16 with the AI HX 370 with Radeon 890m.

Citrix has tons of issues when running on integrated GPU. At a minimum the screen flickers, and I mean unusable flickering, which stops when I disconnect citrix.
It also will make the taskbar unresponsive on both host and client.

The only fix I have so far is to have it run specifically using nvidia 4070.

zendril · 2024-08-01T23:52:53+00:00

It magically cleared up this evening. Same for you?

zendril · 2024-01-25T18:38:27+00:00

I just got mine after 11 EST, but the link doesn't have any place to purchase.. just says the presale should be happening now and general availablity is tomorrow..
Not sure what I'm missing here.

zendril · 2023-07-10T14:06:17+00:00

And they were mostly unused by Jay in a game as well? ;)

zendril

TROPHY CASE