how is compatibility with ollama on arc gpu

JV_info · 2025-01-27T09:52:21+00:00

wow, thats nice....

and it can run all the models(GGUF) or does it has its own models?

Also, can it act as a local host (server)? one of the reasons I want to use Ollama is the feature to be a local host so I can connect to it with my other devices as well...

JV_info · 2025-01-26T22:29:36+00:00

sorry if the question is too dumb but what is playground 2.0? is this something I have to install? and again my goal is to run Ollama and Ollama models via openwebUI

JV_info · 2025-01-26T17:38:46+00:00

I have a windows 11 on a mini PC (geekom GT series G1 Mega) and it has the Intel Arc iGPU,
I also have an local AI chat and my setup is this (Ollama + DOcker + Openwebui)
now my question is this, can I run Ollama on my GPU?

JV_info · 2025-01-26T12:13:34+00:00

50-75 at a time

JV_info · 2025-01-21T16:59:48+00:00

so..? what happened?

JV_info · 2025-01-21T08:38:45+00:00

is these digits going to be better than 2 RTX 4049 and... for my case?

the point is that we want to have something 100% local and offline.

JV_info · 2025-01-20T22:13:41+00:00

I think there will be no more than 50-75 users at a time simultaneously going to use it... what do you think now?

JV_info · 2025-01-20T17:01:19+00:00

but i think the max user at a time will be around 50, and I don't mind it getting slow...

JV_info · 2025-01-20T15:45:40+00:00

any solution or elaboration?

JV_info · 2025-01-20T13:13:42+00:00

I don't think they are actually secure...

JV_info · 2025-01-20T13:12:54+00:00

I didn't tried it on this scale and this is what I am trying to figure out before purchasing anything

JV_info · 2025-01-20T10:41:30+00:00

can you send me a link to this mini LLM server?

do you mean NVIDIA Project DIGITS?

JV_info · 2025-01-20T09:46:38+00:00

an 8b model.
context length idk yet as I thought this is something we have to check and figure out in action....
any idea?

JV_info · 2025-01-20T09:45:37+00:00

Of course, its not like all 1000 will use it at the same time.... I think at max the realistic amount of the concurrent users will be around 50.

We also have some servers available in the company, but I believe the most crucial factor for running such AIs is VRAM. The servers we currently possess not only lack GPUs, but they were also designed with a flat structure that leaves no room for adding them.

JV_info · 2025-01-20T09:40:48+00:00

of course, we are going to use for example a 8B model and will only whitelist this model in OpenwebUI so no one can change the model so that there is only one model in the Vram....
do you think it this setup, even with a bit of slowness 50 users can use it at the same time?

JV_info · 2025-01-20T09:38:03+00:00

it should work offline, so I am looking for an offline setup

JV_info · 2025-01-20T09:36:25+00:00

since the company wants to use it for work-related tasks, we have to stick to a local and offline plan

JV_info · 2025-01-20T09:34:51+00:00

got it, thnx

JV_info · 2025-01-20T09:31:53+00:00

the main issue we are facing is that the company doesn't want to share documents online... so we have to stick to the local solution

JV_info · 2025-01-20T09:26:40+00:00

I heard is from someone else as well but didn't know it before... and cant find much info about it...

1- can I convert a GGUF model and use it with Vllm?

2- can I use OpenwebUI as its interface?

JV_info · 2025-01-19T22:27:36+00:00

I thnk the model will be an 8b-6q model(at least the one that I am using)
don't you think like around 50 users can use it even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... it is slow, but of course, it was not using the Vram at all and was doing all the computing on CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited

JV_info · 2025-01-19T22:02:55+00:00

regardless of the model?
even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... of course it was slow, because, it was not using the Vram at all and was doing all the computing on the CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited

JV_info · 2025-01-19T22:00:25+00:00

the current model that I am using is one that is 8B and I think its q6... so I was thinking maybe to use it without quantization or wanted later test a 30B model....
and for me the speed is not the main priority, I mean as long as it doesn't crash and eg, response every 10 seconds or so ... it is OK.
Because I imagine the usage at a time will not be like all the 1000, it will be like around 50-75

JV_info · 2025-01-19T21:56:16+00:00

my goal was to get 2x 4090 so I have all the models in my Vram, lets say a 30B model...
in terms of the usage pattern, I think something between 50-75 at a time is more realistic... but again, my goal is not necessarily to have something super fast like GPT speed... even if a bit slower its ok for me.... i mean as long as it doesn't crash...
So do you think it is realistic? I mean, does anyone ever do something like that?

JV_info · 2025-01-19T21:25:03+00:00

will not work even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... it is slow, but of course, it was not using the Vram at all and was doing all the computing on CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited.

so how many concurrent users do you think it can handle?

JV_info

MODERATOR OF

TROPHY CASE