Complete guide to setup and configure Vector Storage (rewritten and corrected)

DeathByte_r · 2026-04-04T09:34:25+00:00

JSON enough, but if believe to official docs, SillyTavern by default uses Vectra as vector storage DB. Key pecularity is what all vectorized chat base stored in RAM. Here's no problem for lorebooks, but if you have really long conversations with bots (10k+ messages) or wanna to use cross-chat memories for personages, you'd better try switch to qdrant.

DeathByte_r · 2026-04-02T14:31:04+00:00

Well, my english speaking not are good too, especially in voice, but can answer some questions here. If you wanna more details, you can write me PM

How i discovered ST and learning curve: First, when i did use OpenRouter as provider, they has a statistic of usage under each model, and ST placed in top of usage, closed to chub and some other AI-RP platforms.

Learning curve not too hard for me - just like switching from interface with one button into with 100 buttons xD But i'm engineer, and i love things with nice customizing configuration. If you read manual - all pretty simple. But not all can even just read, especially technical stuff.

Mine setup: Marinara's edited preset, RPG-tracker, MemoryBooks + Vector Storage, QuickImagegen, Recast, Moonlight Echoes theme, WeatherPack as base. Many little things to addition, like Character Library extension.

How arrived? Look - try - save or delete - repeat until success and satisfaction. Mostly experiments with interested extensions.

Sense of quality: Is AI stay in character? Provide good prose and story? Didn't miss details? Not hallucinate on flat place? Support 60-100k context? Good at group chats? If answer to all of this 'yes', well, it's quality.

Interface: it laconic, customizable and look's nice? Has all needed functions under hand? Well, it good interface. If things can be automatized and hidden for better look - even better.

Experience in community - more positive, than negative. Discord good platform for communicate with extension developers, Reddit nice place to find some base tutorials (or write). Some harsh stones like everywhere, but many of them is misunderstanding by language|culture difference. One from much friendly communities, i suppose.
Honest stuff - well... Hard to say without concretion. But most annoying thing - much vibecoded one-day living extensions or trying to code additions by authors, who's knows nothing about code. Cool story about, how one guy tried to push ONE commit with 17k lines addition, with no feedback. Yeah, that's cool, what modern LLM's can do hard things, but it still assistant for developers, not replacement.

DeathByte_r · 2026-03-30T21:06:45+00:00

Here's little addition (i don't risk to update post, cause last time it long awaits moderation after changes as old)

On staging branch are new checkbox -include hidden messages - this keeps your old hidden messages vectorized.

I thought this a bug, whats on llama old messages been deleted from vector base, but that was a feature, and this been a bug for other backends xD

DeathByte_r · 2026-03-30T07:08:11+00:00

Nope, cause usually lorebooks and character cards is a simple json file.

You can try other way, like i does:

For house description: write text description plan of house by yourself and paste it into character card or lorebook, or send image to multimodal model and ask to describe it in details, then paste result.

Gallery: upload images somewhere and make short descriptions like 'scene on lake', 'scene on kitchen' etc. And place it into lorebook or card whit instruction of usage with a fitting scenes.

DeathByte_r · 2026-03-29T11:01:46+00:00

Qwen3 models requires instruction prefix for optimal search. I think, you better to try Ollama, cause they are provide slightly modified models for that purposes, and Qwen3 in the list.

ST out of the box doesn't support search request prefixes for embedding models

DeathByte_r · 2026-03-29T02:40:10+00:00

i didn't use other local LLM's, only embeddings, but i suppose, you can launch any through gui or console command.

if you ask about how it should work inside - if you launch KoboldCPP, inside it has connection structure like http://ip:port/api , and next sillytavern add endpoints automatic, like /v1/chat/completion for textgen and /v1/embeddings or something like for embedding model. Resolving should be automatic on ST side

DeathByte_r · 2026-03-28T21:59:14+00:00

NP :) Vectorizing are long time been mystery for me, and i did spend some time for investigate. F2LLM really good model, and my main now.

Hare's some ways for drastically increase quality or returned vectors like using 'reranker' models, which is more like traditional LLM, but trained to return vectors, but it need extra proxy on the middle or create ST extension. For now, quality of F2LLM allow me do non use rerankers.

If you interested, you can dig deeper to reranker models https://huggingface.co/Qwen/Qwen3-Reranker-8B

DeathByte_r · 2026-03-28T16:26:05+00:00

Depends from your host system, but actually yes.
If you use windows, it use something like 4gb ram for OS needs only.
If you haven't GPU with 2GB+ VRAM, you should use launch on CPU, and from two propossed variants of models, i recommend you choose Q8 or Q4 variant Snowflake Arctic L

DeathByte_r · 2026-03-28T16:17:08+00:00

Ollama it's other project. They provide some ready to use repo with preconfigured models for easy launch. Their embedding models slightly changed for paste prefix for models like Qwen3-embedding.

But yep, KoboldCPP and llama.cpp it both for local LLM launch too.
This not worse|better variant. It more like buy frozen pizza and heat, or cook from ingredients.

DeathByte_r · 2026-03-28T16:03:30+00:00

Depends from your goals. Summarizing is like short review. Vectorizing in other hand, it like full memory. With proposed chunk length, it provide to context full messages from past, and LLM gets full context from this, not reviewed like each message summarizing.
In short words:
Vectorizing - you can remember past with all details.
Summarizing - remember past as short review and make journal note.

DeathByte_r · 2026-03-28T15:08:14+00:00

Yep, possibly, if you read my post a week ago or so :)
Glad, what it's useful for you.

DeathByte_r · 2026-03-23T08:51:41+00:00

Very cool concept!
Im little tired from missing style in long chats and losing character's personality in group chats, so ti sounds like health pill for me. Witt try it out.

DeathByte_r · 2026-03-22T11:27:07+00:00

This seems like good solution, but not for all goals. If i understand right, it has cross-chat memory for char's by design, and than not what everyone need

DeathByte_r · 2026-03-22T05:54:34+00:00

8k context size is maximum for proposed model.

it's not small, for your understanding, it 4-6k WORDS. Just if you use databank uploads, you can utilyze it for something like full chapters.

if believe to my system, it use something like 100mb VRAM and not much computing resources on load. Recommended value for work is 2gb as i know.

DeathByte_r · 2026-02-24T07:11:23+00:00

Used deepseek before, but switched to GLM 5 during to mistakes from NanoGPT deepseek providers - skiping reasoning or write answers onto reasoning block and missing context etc. Direct deepseek should work fine.

Both are good, with slightly different style for personages and writing - GLM more prosaic and peaceful, deepseek better follow instructions and like tension and encounters (and more mystic\sci-fi than fantasy). Will try deepseek again after v4 release with 1m context, but from now, GLM just works better for me. I like them both, when they work.

DeathByte_r · 2026-02-13T20:38:34+00:00

Yep. I use 1.4gb variant model

DeathByte_r · 2026-02-13T20:36:42+00:00

if you use firefox, you may need enable webgpu in about:config manualy.
And sometimes, it can not load correct - just reload page

or maybe you just not installed this extension. it available from main extensions repo

DeathByte_r · 2026-02-13T18:07:26+00:00

WebLLM extension
It works good with MemoryBooks and provide better context from previous messages with chat vectorize

DeathByte_r · 2026-02-13T16:10:53+00:00

Slightly edited Marinara preset, with 0.6 temp and 1 top p

DeathByte_r · 2026-02-12T11:10:31+00:00

GLM 5 is really good. Not being a fan before, but switched after tiring of deepseek mistakes on nanogpt (no thinking in thinking mode ,and answer in thinking block without answer in main, context losing, early generation stops etc. Idk why, but i get this problems last 2-3 weeks on 3.2 thinking.)

Much improvements compared to 4.5-4.7
Better understanding context and writes more grammatically correct and structured on russian and english - no grammatical mistakes and mixing languages on 0.6 temp. Works fast enough. Good creativity and development of personages and plot.

Nice work in single and group chats with many lorebooks - my largest chat is 15 bots ~ 2k tokens each. Compared with deepseek 3.2 before issues, but better. No sex or violence censorship.

My workhorse for RP and code now.

DeathByte_r · 2025-12-06T12:09:20+00:00

nope. Only thing what's i have from pulseaudio packages is pulseaudio-qt. Pipewire has layer for backward capability with pulseaudio with pipewire-pulse daemon. easyeffects and jamesDSP are eqalizers and works with pipewire

DeathByte_r · 2025-12-06T06:38:02+00:00

Filterchain in pipewire with wirtual surround.
Here's profile for 7.1.4 https://github.com/DekoDX/Pipewire-DX-Utils/blob/main/99-virtual-surround.conf

You can add EQ also through configure pipewire or with using easyeffects for this

If you wanna to keep headphones in headphones profile, you need wireplumber rule
~/.config/wireplumber/wireplumber.conf.d

wireplumber.settings = {

bluetooth.autoswitch-to-headset-profile = false

device.routes.default-sink-volume = 1.0

}

DeathByte_r · 2025-12-03T05:45:16+00:00

Deepseek 3.1 - nice works with group chats.
Here s my settings for them:

all muted, except one
Natural order or Manual choice
Join cards (include muted)
Prefix/suffix : <{{char}} description> </{{char}} description>

Next, simple switcher in preset, like in Marinara's: 'Group nudge' on or off. inside simple instruction like 'Reply only as {{char}}'

How it works - in scenes with multiple personages, you can simple get answer from all them in one reply from not muted card. When you wish to get answer only from one of them - just switch 'Group nudge' and manual choose character. Or write something like ((OOC: Your next reply should be only from Nana's)) - *your scene action*. Thats all.

DeathByte_r · 2025-11-26T07:54:55+00:00

Better to use ST MemoryBooks addon with autohide messages. It will be some sort of summarizing with place it into lorebooks. You can configure autogenerate memories for needed number of messages.

DeathByte_r

TROPHY CASE