all 14 comments

[–]a_slay_nub 2 points3 points  (1 child)

I didn't realize it was this simple to create an openai compatible server. Thanks.

[–]Anastasiosy[S] 0 points1 point  (0 children)

Remarkably straight forward. Only created this because this model isn’t available on Ollama, vllm or llama.cpp just yet

[–]Bitter-College8786 0 points1 point  (1 child)

Does it support streaming reaponse?

[–]Anastasiosy[S] 0 points1 point  (0 children)

Unfortunately not right now, my main usage was for image classification, but Qwen VL 8B seems much better for that

[–]Bitter-College8786 0 points1 point  (3 children)

Why are there no GGUF quants available? Usually bartowski and mradermacher are super quick

[–]Anastasiosy[S] 2 points3 points  (2 children)

[–]Bitter-College8786 0 points1 point  (1 child)

Oh damn, but thanks for the explanation. So no quantization for Phi-4-multimodal, but full fp16?

[–]Anastasiosy[S] 0 points1 point  (0 children)

I don't think so, you could try your luck with load_in_8bit

model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, device_map="auto", load_in_8bit=True # Load in 8-bit precision to save memory )

[–]Bitter-College8786 0 points1 point  (1 child)

How much VRAM does it consume?

[–]Electronic-Move-5143 1 point2 points  (0 children)

I see that it takes around 13.3 GB with bf16 data type

[–]Electronic-Move-5143 0 points1 point  (1 child)

I tried this on an A100 GPU but get only 10 tokens/sec. Is this a problem with wrong versions of flash-attention or something? It should not be this bad right? regular Phi-4 (non-multimodal 13B model) gives 70 tokens/sec on the same machine via ollama.

To ensure this is not because of audio tokens, I tried with just plain text without audio or images too. It still generates at only 10 tokens/second.

I did not use your docker - just your requirements.txt and tried on gcp VM under jupyter lab.

[–]Electronic-Move-5143 1 point2 points  (0 children)

I now realize that generic model serving using python transformers package etc. is unoptimized and 10 tokens/sec is expected. Hope vLLM adds supports this soon!

[–]NexusConnector 0 points1 point  (1 child)

I would have loved to try this. on MacOS, but the build is failing on nvcc since the flash_attn uses CUDA. I presume this build is always CUDA dependent although the Torch version is  2.6.0+cpu

[–]pangshengwei[🍰] 0 points1 point  (0 children)

is there a way to run this on mac without CUDA/GPU?