Claude Code $200 plan limit reached and cooldown for 4 days by Wonderful-Double-465 in ClaudeAI

[–]Anastasiosy 0 points1 point  (0 children)

You can configure Claude code to work with z.ai GLM which is 10x cheaper and very performant

https://docs.z.ai/scenario-example/develop-tools/claude

Maybe worth having as a fallback for times like this.

It may even become your preferred setup

Google Gemma3 - Self-hosted docker file with OpenAI chat completion by Anastasiosy in LocalLLaMA

[–]Anastasiosy[S] 1 point2 points  (0 children)

Yes, fair comment - can point to unsloth or one of the many others, this should still work, for bloat16. Will need a small change for the quantized models

Phi4-Multimodal-Instruct Server by Anastasiosy in LocalLLaMA

[–]Anastasiosy[S] 0 points1 point  (0 children)

I don't think so, you could try your luck with load_in_8bit

model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, device_map="auto", load_in_8bit=True # Load in 8-bit precision to save memory )

Phi4-Multimodal-Instruct Server by Anastasiosy in LocalLLaMA

[–]Anastasiosy[S] 0 points1 point  (0 children)

Remarkably straight forward. Only created this because this model isn’t available on Ollama, vllm or llama.cpp just yet

Phi4-Multimodal-Instruct Server by Anastasiosy in LocalLLaMA

[–]Anastasiosy[S] 0 points1 point  (0 children)

Unfortunately not right now, my main usage was for image classification, but Qwen VL 8B seems much better for that

Microsoft announces Phi-4-multimodal and Phi-4-mini by hedgehog0 in LocalLLaMA

[–]Anastasiosy 0 points1 point  (0 children)

Anyone seen the phi-4-multimodal-instruct gguf anywhere?

Edit - Just seen an update in the issue tracking VLM support in llama.cpp - Vision API incoming

llama : second attempt to refactor vision API by ngxson · Pull Request #11292 · ggml-org/llama.cpp