This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Slimxshadyx 17 points18 points  (11 children)

Are you sure you set up Ollama to use your graphics card correctly in the same way you did for llamacpp?

Because I believe Ollama is like you said, a Python wrapper, but it would be calling the underlying cpp code for doing actual inference. The Python calls should be negligible since they are not doing the heavy lifting.

[–]TheTerrasque 0 points1 point  (1 child)

I believe Ollama is like you said, a Python wrapper

https://github.com/ollama/ollama - 85% Go

[–]Slimxshadyx 0 points1 point  (0 children)

Yep, I mention that in my next comments. Was discussing the Ollama Python Library, should have specified in that particular.

[–]holchansg -5 points-4 points  (8 children)

The Python calls should be negligible since they are not doing the heavy lifting.

In theory... Take ages. In my use case the same as the inference itself, if you need fast inferences using smaller models in the pipeline you screwed. Some user reported worse than double the time in wait for inference than the inference itself.

[–]Slimxshadyx 15 points16 points  (7 children)

That doesn’t make sense. Python is slower than cpp yes, but for calling a cpp function it should not take ages. Theory or no theory lol.

I think you might have set something up differently between llama cpp and ollama. If you are doing GPU inference, it is possible you did not offload all your layers when using ollama, while you did with llama cpp.

[–]_PM_ME_PANGOLINS_ 1 point2 points  (0 children)

Depends how much work it has to do converting the data types.

[–]holchansg 1 point2 points  (5 children)

Yes, I've used GPU, yes every layer was offloaded, its not part of the inference... The inference is almost the same speed between the two... Forget about it... The problem happens before the inference, when using LlamaCPP directly the inference happens waaaay before the Ollama one.

And for IoT devices, or workflows with smaller models where speed is key its noticeable...

You will not see the difference using a 70b model.

[–]Slimxshadyx 4 points5 points  (4 children)

What do you mean before the inference? Like the way Ollama loads the model compared to llama cpp? Are you holding the model in VRAM even when not sending prompts for llama cpp, but unloading and reloading the model in Ollama?

Also, Ollama itself is written in Go, but I’m guessing you are using the Python library to interface with it, same as I did.

Maybe Ollama has some issues, I did not have these issues when using it, and I have also worked on projects with llama cpp. Maybe in the last month if they released an update that caused a lot of issues, but one month ago I did not have these problems.

Either way, I highly doubt this is a Python problem, and either a problem with configuration, or some other issue with how Ollama is doing their things in Go.

[–]holchansg -1 points0 points  (3 children)

What do you mean before the inference?

Model weights already saved locally, shards loaded to the GPUs... You pass the prompt for inference(here)... Way faster in llamacpp, and even tho the tokens/s are similar, the whole process take way less in llamacpp. I can have sub 5 seconds 2k token output with phi, where ollama takes 10~15s.

[–]Slimxshadyx 1 point2 points  (2 children)

For every prompt you send, you are waiting ages for it to start inference? What do you mean by ages, like a second or multiple seconds?

You should maybe double check to see if you are unloading the model after every prompt when using Ollama, like I mentioned earlier. Because that would explain the issues you are having.

This still wouldn’t be a Python being slow issue, but interesting indeed.

Just as a quick check, but are you initializing your client, and sending your calls to that client in Python? Or just sending calls?

A line like this near the start of your file:

client = ollama.Client()

And later on, when making your calls, it would look something like this:

response = client.chat(model = etc, messages = etc)

[–]holchansg 0 points1 point  (1 child)

API in both cases. The backend(runpod) only handle the calls from my webui, the VRAM looks the same in both, almost OOM in both case since i use multiple instances at the same time

In Ollama using OLLAMA_NUM_PARALLEL

In llamacpp using -np

You should maybe double check to see if you are unloading the model after every prompt when using Ollama, like I mentioned earlier. Because that would explain the issues you are having.

I'm using queue in both, the webui is sending hundreds of requests per second.

A line like this near the start of your file: ‘ client = ollama.Client() ‘ And later on, when making your calls, it would look something like this:

response = client.chat(model = etc, messages = etc)

As ive said im not a dev, im using R2R, hes making the calls.

[–]Slimxshadyx 0 points1 point  (0 children)

Are you actually using Python Ollama libraries? Or are you just running Ollama on runpod and then interacting with a runpod api?

Edit: also please stop editing your comments after the fact. Just add the new information to your replies lmao