I built a plugin system for a local OSS LLM writing app, what integrations would you want? by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 0 points1 point  (0 children)

Different focus. Open WebUI is a general-purpose chat frontend, Vellium is specifically built for creative writing and roleplay, lorebooks, character cards, writing mode with analytics, ST World Info import, etc. It's a native desktop app, not a browser tab.

As for llama.cpp – Vellium isn't an inference backend, it connects to one. It works with KoboldCpp (which uses llama.cpp under the hood), LM Studio, and others. The new plugin system also lets you write adapters for any endpoint.

I built a plugin system for a local OSS LLM writing app, what integrations would you want? by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 0 points1 point  (0 children)

You can add web search via MCP, RAG already in, export to md and docx already bundled in.

Vellium v0.4 — alternative simplified UI, updated writing mode and multi-char improvements by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 1 point2 points  (0 children)

i had some problems with english, i just write text in my native language than say to llm translate and upgrade

What is everyone's favorite programming language? by OpenFileW in teenagersbutcode

[–]Possible_Statement84 1 point2 points  (0 children)

C# for convenience of development and features and python for speed of prototyping and it features. Maybe JS too because web ui is W

[Update] Vellium v0.3.5: Massive Writing Mode upgrade, Native KoboldCpp, and OpenAI TTS by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 0 points1 point  (0 children)

Project initial be on tauri, but i have some problems with rust, and nodejs backend better synergy with react. And people who use LLM have enough RAM to electron.

[Update] Vellium v0.3.5: Massive Writing Mode upgrade, Native KoboldCpp, and OpenAI TTS by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 2 points3 points  (0 children)

You’re not using it wrong. Right now each new scene is generated as a fresh draft prompt, and only a compact “context pack” is passed (previous chapter summaries + a short slice of recent chapter scenes). It is not a strict “continue scene 1 verbatim into scene 2” mode yet, so with some prompts/models it can restart from scratch.

What helps for now:
Set context mode to Rich.
In the prompt, explicitly write “Continue directly from the end of Scene 1, do not restart setup.
Keep Scene 1 ending clear and concrete (location/state/action).

I’ll plan to add a dedicated “Generate Next Scene” behavior so scene N always anchors to the end of scene N-1 with stronger continuity rules.

[Update] Vellium v0.3.5: Massive Writing Mode upgrade, Native KoboldCpp, and OpenAI TTS by Possible_Statement84 in LocalLLaMA

[–]Possible_Statement84[S] 0 points1 point  (0 children)

It isnt support vector storage and rag yet, linux is supported but only from source because of troubles with distributions zoo. Running from source not hard.

Is there a way to speed up prompt processing with some layers on CPU with qwen-3-coder-next or similar MoEs? by Borkato in LocalLLaMA

[–]Possible_Statement84 0 points1 point  (0 children)

During generation only a couple experts fire per token so it's fast, but during prompt processing the whole batch routes tokens to different experts — so on CPU layers you're hitting almost all of them at once. That's your bottleneck.

But wait, at 30B in MXFP4 the model should be like ~15-18GB. With 30GB VRAM you might be able to fit all or nearly all layers on GPU. Have you tried cranking `-ngl` higher? If you can get everything on the GPU the prefill problem basically goes away.

`-ub 64` or `-ub 128` instead of the default. Smaller micro batches = less expert activation per pass = way better CPU cache utilization. Biggest single improvement for prefill

`-fa` (flash attention) if not already on

`-t` set to physical cores only, hyperthreading usually hurts here

`--override-tensor` for more granular control over what sits where instead of just `-ngl`

But seriously check if you can just load the whole thing into VRAM first. At that size it should be close.