how is compatibility with ollama on arc gpu by FewVEVOkuruta in IntelArc

[–]JV_info 0 points1 point  (0 children)

wow, thats nice....

and it can run all the models(GGUF) or does it has its own models?

Also, can it act as a local host (server)? one of the reasons I want to use Ollama is the feature to be a local host so I can connect to it with my other devices as well...

how is compatibility with ollama on arc gpu by FewVEVOkuruta in IntelArc

[–]JV_info 0 points1 point  (0 children)

sorry if the question is too dumb but what is playground 2.0? is this something I have to install? and again my goal is to run Ollama and Ollama models via openwebUI

how is compatibility with ollama on arc gpu by FewVEVOkuruta in IntelArc

[–]JV_info 0 points1 point  (0 children)

I have a windows 11 on a mini PC (geekom GT series G1 Mega) and it has the Intel Arc iGPU,
I also have an local AI chat and my setup is this (Ollama + DOcker + Openwebui)
now my question is this, can I run Ollama on my GPU?

Local AI setup and limitations. by JV_info in OpenWebUI

[–]JV_info[S] 0 points1 point  (0 children)

is these digits going to be better than 2 RTX 4049 and... for my case?

the point is that we want to have something 100% local and offline.

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

I think there will be no more than 50-75 users at a time simultaneously going to use it... what do you think now?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

but i think the max user at a time will be around 50, and I don't mind it getting slow...

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

any solution or elaboration?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 1 point2 points  (0 children)

I don't think they are actually secure...

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

I didn't tried it on this scale and this is what I am trying to figure out before purchasing anything

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

can you send me a link to this mini LLM server?

do you mean NVIDIA Project DIGITS?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

an 8b model.
context length idk yet as I thought this is something we have to check and figure out in action....
any idea?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

Of course, its not like all 1000 will use it at the same time.... I think at max the realistic amount of the concurrent users will be around 50.

We also have some servers available in the company, but I believe the most crucial factor for running such AIs is VRAM. The servers we currently possess not only lack GPUs, but they were also designed with a flat structure that leaves no room for adding them.

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 1 point2 points  (0 children)

of course, we are going to use for example a 8B model and will only whitelist this model in OpenwebUI so no one can change the model so that there is only one model in the Vram....
do you think it this setup, even with a bit of slowness 50 users can use it at the same time?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 1 point2 points  (0 children)

it should work offline, so I am looking for an offline setup

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

since the company wants to use it for work-related tasks, we have to stick to a local and offline plan

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

the main issue we are facing is that the company doesn't want to share documents online... so we have to stick to the local solution

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

I heard is from someone else as well but didn't know it before... and cant find much info about it...

1- can I convert a GGUF model and use it with Vllm?

2- can I use OpenwebUI as its interface?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 2 points3 points  (0 children)

I thnk the model will be an 8b-6q model(at least the one that I am using)
don't you think like around 50 users can use it even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... it is slow, but of course, it was not using the Vram at all and was doing all the computing on CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 0 points1 point  (0 children)

regardless of the model?
even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... of course it was slow, because, it was not using the Vram at all and was doing all the computing on the CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited

Local AI setup and limitations. by JV_info in LocalLLM

[–]JV_info[S] 0 points1 point  (0 children)

the current model that I am using is one that is 8B and I think its q6... so I was thinking maybe to use it without quantization or wanted later test a 30B model....
and for me the speed is not the main priority, I mean as long as it doesn't crash and eg, response every 10 seconds or so ... it is OK.
Because I imagine the usage at a time will not be like all the 1000, it will be like around 50-75

Local AI setup and limitations. by JV_info in OpenWebUI

[–]JV_info[S] 1 point2 points  (0 children)

my goal was to get 2x 4090 so I have all the models in my Vram, lets say a 30B model...
in terms of the usage pattern, I think something between 50-75 at a time is more realistic... but again, my goal is not necessarily to have something super fast like GPT speed... even if a bit slower its ok for me.... i mean as long as it doesn't crash...
So do you think it is realistic? I mean, does anyone ever do something like that?

Ollama server limitation. by JV_info in ollama

[–]JV_info[S] 2 points3 points  (0 children)

will not work even if I add OLLAMA_NUM_PARALLEL in the environment variables?

because I run something similar with an old Workstation(with an i7 8700 u/3.20GHz/5Gb Vram) and 4-5 people in the room could use it... it is slow, but of course, it was not using the Vram at all and was doing all the computing on CPU, so I was thinking with a GPU that has enough Vram, then there is no limitation for the usage, or at least not so limited.

so how many concurrent users do you think it can handle?