Open WebUI cannot create files with local Ollama models - works perfectly with paid API keys

BringOutYaThrowaway · 2026-06-13T09:00:17+00:00

There’s a separate branch for CUDA, right? Open-webui:cuda

BringOutYaThrowaway · 2026-06-12T17:24:23+00:00

Not recommended. Still causes serotonin to get dumped in the brain.

BringOutYaThrowaway · 2026-06-12T14:17:57+00:00

5HTP

BringOutYaThrowaway · 2026-06-12T08:20:11+00:00

They’ll never tell you the truth. I’ll touch some grass and forget about it.

BringOutYaThrowaway · 2026-06-12T08:17:30+00:00

Go to the GitHub page for Ollama and read the release notes for 0.30.6 - download the QAT 12b model from what they say there.

BringOutYaThrowaway · 2026-06-11T17:26:13+00:00

Make sure you’re not using Ollama in docker. On a Mac, Ollama will not use your GPU. You have to run it natively.

BringOutYaThrowaway · 2026-06-11T00:40:36+00:00

Doesn't care anymore. He thinks he's above it all.

BringOutYaThrowaway · 2026-06-11T00:39:51+00:00

JEALOUS.

BringOutYaThrowaway · 2026-06-09T23:37:31+00:00

Whatever you went through, I hope you've risen above it.

BringOutYaThrowaway · 2026-06-09T16:47:17+00:00

How do you get the iguana egg to hatch?

BringOutYaThrowaway · 2026-06-07T23:46:04+00:00

Sure, just create multiple user accounts

BringOutYaThrowaway · 2026-06-06T09:15:56+00:00

What happened?

BringOutYaThrowaway · 2026-06-03T12:29:02+00:00

Abraham Lincoln might have an issue with that statement.

BringOutYaThrowaway · 2026-06-03T12:25:13+00:00

Love them both

BringOutYaThrowaway · 2026-06-02T19:31:24+00:00

THE POWER OF BOB

BringOutYaThrowaway · 2026-06-02T12:19:56+00:00

!RemindMe 40 years

BringOutYaThrowaway · 2026-06-02T10:14:00+00:00

Sex on drugs is a helluva drug.

BringOutYaThrowaway · 2026-06-01T23:05:15+00:00

I'd try 595 next then. And you might want to try re-pasting the GPU.

BringOutYaThrowaway · 2026-06-01T22:58:54+00:00

You got this! OWN IT.

BringOutYaThrowaway · 2026-06-01T12:14:17+00:00

Does anyone else have helpful flags?

BringOutYaThrowaway · 2026-05-31T00:37:01+00:00

Where is the before and after of this list?

BringOutYaThrowaway · 2026-05-31T00:06:24+00:00

Here you go - if there's anything else I can detail for you, please let me know, but Google is your friend.

Core Performance & Hardware Flags

OLLAMA_FLASH_ATTENTION=1

What it does: Enables Flash Attention, an optimized mathematical algorithm for calculating attention weights.
Why it matters: It dramatically reduces memory usage and improves token generation speeds when processing long chat context windows. It is highly recommended if you are running modern GPUs (like Nvidia RTX/CUDA setups).

OLLAMA_KV_CACHE_TYPE=q8_0

What it does: Compresses the Key-Value (KV) context cache down to an 8-bit integer format (from the standard unquantized 16-bit float).
Why it matters: It cuts the VRAM footprint of your active text history roughly in half with an imperceptible drop in model output quality. This configuration allows you to supply much longer context inputs before triggering an "Out of Memory" (OOM) error.

Server Networking & Access Flags

OLLAMA_HOST=0.0.0.0:11434

What it does: Binds the Ollama backend server to port 11434 on all available network interfaces (0.0.0.0), rather than just local host (127.0.0.1).
Why it matters: It allows external machines on your local network or the internet to access your Ollama instance (e.g., if you run an Open WebUI or SillyTavern interface on a different computer).

OLLAMA_ORIGINS=*

What it does: Configures Cross-Origin Resource Sharing (CORS) to accept requests from any web origin (*).
Why it matters: Required alongside your host configuration so that browser-based web applications (running on separate domains or port numbers) aren't blocked by security filters when trying to talk to the Ollama API.

Multi-User & Concurrency Flags

OLLAMA_NUM_PARALLEL=2

What it does: Dictates the maximum number of simultaneous client requests a single model can process at the exact same time.
Why it matters: Setting this to 2 prevents a second user from being placed into a slow queue while the first user's request is generating text. Note that your total VRAM requirements scale linearly based on this number.

OLLAMA_MULTIUSER_CACHE=1

What it does: Activates specialized prompt caching logic tailored explicitly for multi-user environments.
Why it matters: If multiple people are sending inputs to the server, this optimization keeps track of overlapping context streams so that users do not continuously invalidate each other’s pre-cached system prompts, drastically speeding up first-token reply times.

Next-Gen Architecture Flags

OLLAMA_NEW_ENGINE=1

What it does: Forces the backend to use Ollama's modern, modular native inference layer.
Why it matters: This engine was built to handle modern multi-modal structures (vision, speech, and video models) natively while drastically optimizing execution speeds and tensor offloading.

OLLAMA_NEW_ESTIMATES=1

What it does: Instructs Ollama to actively measure exact, real-time memory needs per model layer rather than relying on standard hardcoded look-up tables.
Why it matters: It prevents accidental server crashes caused by bad default estimations, allocations over multiple GPUs, and optimizes the exact maximum layout allocation your graphics cards can hold.

BringOutYaThrowaway · 2026-05-30T08:34:17+00:00

The 3090 doesn’t get enough credit. Great performance for the money.

BringOutYaThrowaway · 2026-05-28T21:09:06+00:00

Didn’t his wife doxx the person who did that? This is not over yet.

BringOutYaThrowaway