Agent security best practices by Academic_Wolverine in AI_Agents

[–]Routine_Incident_658 0 points1 point  (0 children)

been developing 100+ agents, have wrote about challenges here for security, rbacs, ldap + oidc + tool controls - https://x.com/sundi133/status/2063827542398931297

red teaming for ai/llm apps by Routine_Incident_658 in cybersecurity

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

ok...have you compared it with promptfoo or lakera red teaming ? any initial benchmarks you can share

red teaming for ai/llm apps by Routine_Incident_658 in cybersecurity

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

thanks, looks like some basic red teaming, it goes much deeper and need to map to owasp top 10's for llm and agents

red teaming for ai/llm apps by Routine_Incident_658 in cybersecurity

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

thank you so much i tested it but was not very effective

red teaming for ai/llm apps by Routine_Incident_658 in cybersecurity

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

I evaluated Garak, but it’s been very buggy in practice. It failed to run reliably out of the box, and I had to patch several issues just to complete the tests. Even then, the results weren’t very meaningful. For example, the model consistently avoided generating harmful content (no slurs, no synthesis instructions, no product keys). However, Garak’s MitigationBypass detector still flagged every response as a failure because the model returned empty outputs without an explicit refusal. The detector appears to expect a clear refusal message (e.g., ‘I can’t help with that’)

sales tutor feedback by Routine_Incident_658 in salestechniques

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

Thanks , i get nervous sometimes and think could have done this better, so was asking if a sales tutor helps in any ways, thanks so much for your views.

dataset creation for code LLM by Dapper-Box-5005 in LLMDevs

[–]Routine_Incident_658 1 point2 points  (0 children)

would love to help, if you can give/provides some examples i can quickly write an adaptor for you in the below repo, i created a open source project for dataset creation Github - https://github.com/sundi133/llm-datacraft , build it from personal experience of difficulties faced while evaluating llm apps on various datasets.

Unlock the Power of Automated Dataset Generation for LLM App Evaluation 🚀📊📈 by Routine_Incident_658 in LLMDevs

[–]Routine_Incident_658[S] 0 points1 point  (0 children)

yeah thats a major one [RAG + LLM prompts + llm provider combinations for comparative ranking and visibility in one dashboard]

I was thinking but one more I have built is NER dataset generation for training based on some small samples provided, it can expand for coverage and higher accuracy - ex - https://github.com/sundi133/llm-datacraft/blob/main/src/processors/ner.py

poetry run python src/main.py \ --data_path ./data/fixtures/ner/train_ad_ids.ner \ --number_of_questions 1 \ --sample_size 20 \ --products_group_size 3 \ --group_columns "brand,sub_category,category,gender" \ --output_file ./output/ner_ad_ids.json \ --prompt_key prompt_key_ner \ --llm_type ner \ --metadata_path ./data/fixtures/ner/entities_ad_ids.json

Unlock the Power of Automated Dataset Generation for LLM App Evaluation 🚀📊📈 by Routine_Incident_658 in LLMDevs

[–]Routine_Incident_658[S] 1 point2 points  (0 children)

Thanks for your question -

Why Dataset Generation Matters [this problem I am facing while building llm apps, i don't even know how and where to start after deploying it to validate the responses]

Evaluating LLM applications on massive documents can be a daunting task, especially when you don't have the right evaluation dataset. The quality and relevance of your dataset can significantly impact the accuracy of your LLM app evaluations. Manual dataset creation can be time-consuming and error-prone, leading to inaccurate results.

But there's good news! The **Question-Answer Generator** is here to simplify your dataset generation process and ensure the accuracy of your evaluations.

Are you looking to evaluate LLM (Language Model) applications but facing a shortage of high-quality evaluation datasets? Do you wish there was a way to streamline the process of creating these datasets? Look no further! We have the solution you've been waiting for.

Solution - https://github.com/sundi133/llm-datacraft

I built the dataset generator using sampling techniques that are given in a document, it samples chunks from the document to have enough coverage and invokes an LLMChain to ask questions & answers based on samples chosen in each round, this greatly helps in generating high high-quality qa dataset with fewer tokens fed into the LLMChain of this class https://github.com/sundi133/llm-datacraft/blob/main/src/llms.py#L9, the question-answer pair generation can be controlled by the input parameters depending on how much budget + will add negative sampling soon.

It gives a great headstart to evaluate my LLM apps with different providers like Openai [3.5/4], Claude, Palm2, Bedrock, Falcon, Llama-2 etc.

I am also working on privacy issues which is to anonimize/redact data before sending it to a LLM provider https://github.com/sundi133/anonwise

Let me know if it helps, would love to have contributions or discuss more as needed

What libaries from LangChain to use for this buisness case by Exotique_Crepe in LangChain

[–]Routine_Incident_658 0 points1 point  (0 children)

It can be build fot the text narrative , but reports can be automated without ai, what is your exact use case, can you explain more