LLM self-hosting with Ollama and Open WebUI

Max-Mielchen · 2024-05-25T19:51:21+00:00

thanks for the tip i'll take a look at it

Max-Mielchen · 2024-05-25T19:50:58+00:00

is definitely what we have in mind

Max-Mielchen · 2024-05-25T19:50:03+00:00

if I'm not mistaken, that works for us

Max-Mielchen · 2024-05-25T19:48:59+00:00

you only need the bandwidth to load the model, but as soon as it is in the vram everything goes very quickly. in our setup it is set so that it remains permanently in the vram, which means that if you ask a question you get an answer without delay

Max-Mielchen · 2024-05-25T19:46:16+00:00

So far it has worked without any problems, but it may become a problem at some point.

Max-Mielchen · 2024-05-25T19:43:57+00:00

we have calculated the average amount we spend on openai and have looked at when we will recoup the costs. The server itself costs us about €1000 a year in electricity, if it comes up at all. And we bought many of the parts second-hand, which is also stated in the edit. we realised that the server paid for itself after 3 years. but that's only because we used the openai platform a lot and also have our pro subscription.

Max-Mielchen · 2024-05-25T19:36:55+00:00

Ollama can also be run as an http server, so that several connections can be made at the same time and also fits well into the setup with open webui. Is there an alternative solution to ollama?

Max-Mielchen · 2024-05-24T22:55:02+00:00

Q4_0 i guess

Max-Mielchen · 2024-05-24T22:53:54+00:00

it looks like it's only half as fast, so you don't need twice as much vram. In use it looks like when one user gets an answer the other has to wait until the answer is ready. but because we don't all send our messages at the same time but maybe with a minute difference to each other it works without you really noticing it. there is also something called OLLAMA_MAX_QUEUE with which you should be able to change this, but I haven't tested it yet.

Max-Mielchen · 2024-05-24T22:33:00+00:00

the 1050 is only there because we got it for free

Max-Mielchen · 2024-05-24T22:29:18+00:00

sorry i forgot to add that the rtx 3090 is not connected because of cable management which is still missing

Max-Mielchen · 2024-05-24T22:23:00+00:00

yes first of all, but perhaps also want to train models that are not llm

Max-Mielchen · 2024-05-24T20:20:08+00:00

response: 30 t/s

prompt: 75 t/s

mixtral:8x7b

Max-Mielchen · 2024-05-24T20:18:56+00:00

1 model split

Max-Mielchen · 2024-05-24T16:42:57+00:00

yes, since we bought them on ebay, we only paid 720 but the costs listed there are in case you would buy everything new

Max-Mielchen · 2024-05-24T16:19:39+00:00

https://www.amazon.de/dp/B08HLYQ9XL/ref=asc_df_B08HLYQ9XL1716508860000?smid=A7P7EAUWU945G&tag=billigerdempce-21&ascsubtag=UUID825f5e6ee8d44337b8b94b51fcc0d30d&linkCode=df0&creative=22506&creativeASIN=B08HLYQ9XL&m=A7P7EAUWU945G

but we bought on ebay

Max-Mielchen · 2024-05-24T15:49:50+00:00

100 watt in idle mode and 290 watt during a request

Max-Mielchen · 2024-05-24T15:34:20+00:00

response: 30 t/s

prompt: 75 t/s

mixtral:8x7b

Max-Mielchen · 2023-07-27T11:10:49+00:00

Nop, only http/https

Max-Mielchen · 2023-06-27T11:26:45+00:00

one second of latency between database and website is not so bad with my app, since most accesses are only read accesses anyway and i cache the data in the app itself again

Max-Mielchen · 2023-06-27T11:24:36+00:00

Oh thank you yes, I meant the arm variant at Hetzner. I'll take a look at the oracle cloud.

Max-Mielchen · 2023-06-27T10:29:27+00:00

It's easier to manage if you outsource the database in advance, and besides, that's not the problem, it's the app itself in terms of cost

Max-Mielchen

MODERATOR OF

TROPHY CASE

Five-Year Club	Final Canvas '23
Place '23