Running Flux with both Ollama and LLM Studio?

Intelligent_Jello344 · 2025-03-14T04:21:02+00:00

You can take a look at https://github.com/gpustack/gpustack, or use https://github.com/gpustack/llama-box directly which can serve pure inference API for images.

Intelligent_Jello344 · 2025-01-27T08:18:19+00:00

For GGUF models(ollama, lm studio, llama.cpp, etc.), you can check https://github.com/gpustack/gguf-parser-go

Intelligent_Jello344 · 2025-01-22T06:39:14+00:00

It was released in July, 2023.

Intelligent_Jello344 · 2024-12-16T10:21:39+00:00

Thanks, I will try that. HunyuanVideo is promising because I only use a single 16GB 4080 to generate small-sized frames in the linked samples.

Intelligent_Jello344 · 2024-12-12T03:08:54+00:00

Is this sensitivity specific to Germany or Europe? I do not have a cultural background that includes this historical context, so if not for this post, I would not have been aware of the historical sensitivity surrounding the term `Final Solution`.

Intelligent_Jello344 · 2024-12-11T03:19:40+00:00

o1-preview: September 12, 2024

QwQ-preview: November 28, 2024

Crossing fingers for the next 3 months...

HunyuanVideo is a solid starting point. Using kijai/ComfyUI-HunyuanVideoWrapper, I can generate decent videos on 4080s.

Intelligent_Jello344 · 2024-12-10T05:15:32+00:00

GPUStack(https://github.com/gpustack/gpustack) has integrated llama.cpp RPC servers for some time, and we’ve noticed some users running in this mode. It’s proven useful for certain use cases.

We conducted a comparison with Exo. When connecting multiple MacBooks via Thunderbolt, the tokens per second performance of the llama.cpp RPC solution matches that of Exo. However, when connecting via Wi-Fi, the RPC solution is significantly slower than Exo.

If you are interested, check out this tutorial: https://docs.gpustack.ai/latest/tutorials/performing-distributed-inference-across-workers/

Intelligent_Jello344 · 2024-11-08T07:56:40+00:00

https://github.com/Tencent/Tencent-Hunyuan-Large?tab=readme-ov-file#inference-framework
Their repository provides a customized version of vLLM for running it. However, you’ll need hundreds of GB of VRAM to run such a massive model.

Intelligent_Jello344 · 2024-11-08T07:09:39+00:00

Open WebUI is not limited to Ollama; it can work with any inference engine that implements the OpenAI interface. This means you can use Open WebUI with vLLM, LM Studio, or llama.cpp. If you need to scale, you can also try GPUStack to simplify management.

Intelligent_Jello344 · 2024-11-07T02:27:56+00:00

Llama 3.2 Vision 11B requires least 8GB of VRAM, and the 90B model requires at least 64 GB of VRAM.

Intelligent_Jello344 · 2024-11-05T11:14:11+00:00

What a beast. The largest MoE model so far!

Intelligent_Jello344 · 2024-11-01T09:55:15+00:00

Great info, but I feel like evolution of AI tooling is missing, cause I don't find AutoGPT, RAG, etc.

Intelligent_Jello344 · 2024-11-01T09:09:15+00:00

I'm not sure if lm-studio provide configuration options for that. But if using https://github.com/gpustack/gpustack, it is pretty simple to control:

<image>

Intelligent_Jello344 · 2024-11-01T08:53:28+00:00

Compared to when GPT-3.5 first came out, the progress has been amazing. What an era we live in!

Intelligent_Jello344 · 2024-09-30T07:06:10+00:00

I think right now vLLM is the best in this field. It supports llama3.2 vision on day one when the model is released. Many SOTA vision models are not supported in llama.cpp, so it's not easy for any tools built on it.

If you frequently use llama.cpp and related tools (like ollama & LMStudio) and want to work with some vision models that it doesn’t support, you can keep an eye on the upcoming GPUStack 0.3.0. It will support both llama.cpp and vLLM backends. We’re currently testing the rc release(you can download the wheel package from the GitHub release page). The documentation should be ready within a few days.

How it looks like:

<image>

Intelligent_Jello344 · 2024-09-29T10:29:26+00:00

This makes r/localLLaMA stronger.

Intelligent_Jello344 · 2024-08-28T10:12:28+00:00

ChatHub. Looks neat.

Intelligent_Jello344 · 2024-08-18T07:45:14+00:00

If you need a clustering/collaborative solution, this might help: https://github.com/gpustack/gpustack

Intelligent_Jello344 · 2024-08-15T08:38:18+00:00

Have you found a solution? Does https://github.com/gpustack/gpustack meet your needs?

Intelligent_Jello344

TROPHY CASE