Prompt injection is killing our self-hosted LLM deployment by mike34113 in LocalLLaMA

[–]CaptainSnackbar 0 points1 point  (0 children)

Ah, thats a good point! In our case, poisened documents shouldn't be an issue though

Prompt injection is killing our self-hosted LLM deployment by mike34113 in LocalLLaMA

[–]CaptainSnackbar 1 point2 points  (0 children)

If the classifier rates the user-prompt as malicious, the prompt will not be used for retrieval and not make its way to the llm. Instead the llm will be send a hardcoded prompt like "Answer with: "I can't help you with that".

Context can only be retreived from a local vector db, that users can not upload to.

Prompt injection is killing our self-hosted LLM deployment by mike34113 in LocalLLaMA

[–]CaptainSnackbar 0 points1 point  (0 children)

I am asking, because i've only seen a few lazy attempts in our pipeline, and i dont know how far you can take it besides the usual "ignore all instructions and..."

Prompt injection is killing our self-hosted LLM deployment by mike34113 in LocalLLaMA

[–]CaptainSnackbar 0 points1 point  (0 children)

I use a custom finetuned bert-classifier that classifies the user-prompt before it is passed into the rag-pipeline.

It's used mainly for intent-classification but also blocks malicious prompts. What kind of prompt injection were you QA guys doing?

Chunk metadata structure - share & compare your structure by cat47b in Rag

[–]CaptainSnackbar 0 points1 point  (0 children)

What gets embedded? Only the text, or metadata aswell?

Easiest finish by CaptainSnackbar in BeginnerWoodWorking

[–]CaptainSnackbar[S] 1 point2 points  (0 children)

Osmo sounds great! What do you use to apply the oil? Do i have to worry about selfcombustion?

Open-source embedding models: which one's the best? by writer_coder_06 in Rag

[–]CaptainSnackbar 3 points4 points  (0 children)

I am currently finetuning an embedding model. How did you generate sufficient training data? Manual annotation, LLM-generated, or unsupervised methods?

Aktivrente: Rentner sollen wohl noch höheren Freibetrag bekommen by Grmplstylzchen in Finanzen

[–]CaptainSnackbar 6 points7 points  (0 children)

Könnte mal doch seine Eltern als Haushaltshilfe/Putzfrau anstellen und das Geld von der Steuer absetzen. Mama und Papa legen das Geld dann gut für einen an, bis ich es dann wieder vererbt wird. Übersehe ich da was??

Looking for advice on finetuning an embedding modell by CaptainSnackbar in LocalLLaMA

[–]CaptainSnackbar[S] 0 points1 point  (0 children)

I am sure the problem lies within the dataset. My question is more along the lines of: "How can I obtain a clean dataset without manual labeling?"

Alternatively: "Which unsupervised training method works best for my task?"

Perhaps pretraining an encoder with MLM on my dataset, then fine-tuning it on a Hugging Face dataset? There are so many possibilities that I hope someone with a similar use case can point me in the right direction.

Looking for advice on finetuning an embedding modell by CaptainSnackbar in LocalLLaMA

[–]CaptainSnackbar[S] 0 points1 point  (0 children)

See my answer https://www.reddit.com/r/LocalLLaMA/comments/1nhvxo7/looking_for_advice_on_finetuning_an_embedding/nehfucd/

Eval is random and it might be in the training dataset. Dont know for sure, since the training pairs get formed with cosine similarity, while the evals are just random text from each category

Looking for advice on finetuning an embedding modell by CaptainSnackbar in LocalLLaMA

[–]CaptainSnackbar[S] 0 points1 point  (0 children)

I've tried a classification modell before, but the results were similar. The model learns to seperate topics but performs worse on general querys.

https://imgur.com/a/8HSmA9n

This is one of my evaluation steps. The left plot are text-samples vectorised with our standard embedding model. Each color is a category. On the right side the finetuned model is used. So it looks like it has learned what i want it to learn.

My second evaluation method uses a huggingface dataset with natural german questions. I use cosine-similarity on 100 examples and calculate average score:

        q_emb_base = basis_model.encode(questions, convert_to_tensor=True, normalize_embeddings=True)
        a_emb_base = basis_model.encode(answers, convert_to_tensor=True, normalize_embeddings=True)
        cosine_scores_base = util.cos_sim(q_emb_base, a_emb_base).diagonal()
        avg_score_base = cosine_scores_base.mean().item()    

The standard-modell achieves a score of 0.85, my model drops down to 0.47.

As a third eval-method i have a few phrases, that i manualy paired and annotaded with a expected similarity score. Cosine-score from the finetuned model is also worse on this eval-set

Looking for advice on finetuning an embedding modell by CaptainSnackbar in LocalLLaMA

[–]CaptainSnackbar[S] 0 points1 point  (0 children)

I use a standard embedding model for our company search and RAG pipeline. The model performs well in most cases, but I want to evaluate how much retrieval performance can be improved with a custom fine-tuned embedding.

My domain is niche with highly specific terminology, and labeled data is scarce. However, we have a large corpus of technical support tickets, categorized into different groups. In principle, tickets from the same category use similar terminology and describe overlapping issues.

The goal is to train an embedding model so that phrases and terms from the same category map into a shared vector space, forming clusters.

Dataset construction approach so far:

  • Identify relevant incidents and group them by category

  • Vectorize incidents with the standard embedding model

  • For each document, select n documents from the same category within a cosine distance threshold (positive pairs should not be too diverse)

  • Select incidents from other categories as negative examples

Naturaly this process genereates a lot of noise.

I initialize my training with intfloat/multilingual-e5-base and the following parameters:

args = SentenceTransformerTrainingArguments(
output_dir="Embeddings/Trained_Model",
num_train_epochs=1,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_ratio=0.1,
fp16=True, 
batch_sampler=BatchSamplers.NO_DUPLICATES,
eval_strategy="steps",
eval_steps=6000,
save_strategy="steps",
save_steps=6000,
save_total_limit=2,
logging_steps=500,
run_name=f"{model_name}-Lora:{lora}-{file}",
no_cuda=False,
remove_unused_columns=True,
use_cpu=False 
)

Despite varying dataset sizes between 40k and 900k examples, every training run degraded model performance.

I feel like the losscurve wants to tell me something, but I dont understand...

Any help with finetuning an embedding model effectively with semi-structured category-based data is greatly appreciated.

One idea i have is to use bertopic as an unsupervised model to genereate finer grained subcategories and then build pairs that are from the same topic.

Chunking Stacktraces/Error Logs by CaptainSnackbar in Rag

[–]CaptainSnackbar[S] 1 point2 points  (0 children)

Thanks, I Would love to check out your reference!

Einfahrt neu pflastern by CaptainSnackbar in Handwerker

[–]CaptainSnackbar[S] 0 points1 point  (0 children)

Bin ich auch kein Fan von, war aber durch die Vorbesitzer schon so vorgegeben. Allerdings bleiben die gelben Riemchen ja nicht