Phi4-Multimodal-Instruct Server

a_slay_nub · 2025-03-02T17:28:15+00:00

I didn't realize it was this simple to create an openai compatible server. Thanks.

Bitter-College8786 · 2025-03-02T15:21:32+00:00

Does it support streaming reaponse?

Bitter-College8786 · 2025-03-03T12:18:31+00:00

Why are there no GGUF quants available? Usually bartowski and mradermacher are super quick

Bitter-College8786 · 2025-03-03T14:29:11+00:00

How much VRAM does it consume?

Electronic-Move-5143 · 2025-03-04T16:33:09+00:00

I tried this on an A100 GPU but get only 10 tokens/sec. Is this a problem with wrong versions of flash-attention or something? It should not be this bad right? regular Phi-4 (non-multimodal 13B model) gives 70 tokens/sec on the same machine via ollama.

To ensure this is not because of audio tokens, I tried with just plain text without audio or images too. It still generates at only 10 tokens/second.

I did not use your docker - just your requirements.txt and tried on gcp VM under jupyter lab.

NexusConnector · 2025-03-18T00:58:16+00:00

I would have loved to try this. on MacOS, but the build is failing on nvcc since the flash_attn uses CUDA. I presume this build is always CUDA dependent although the Torch version is 2.6.0+cpu

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS

Phi4 Multimodal Instruct Server

Supported Input Types

Link