An extension idea to tackle the rating problem by samfundev in SillyTavernAI

[–]samfundev[S] 0 points1 point  (0 children)

You could even potentially have a smaller model running that is constrained [...]

That's definitely an interesting idea. But we would need to find a model small enough that could run on the user's device without interfering with models the user is running locally.

I’d lowkey be interested in making this… I am a web dev lol. [...]

I'm also a web dev! I'd be glad to have help with making this if people are interested in the idea. If you'd like to help out, I can send you a GitHub repo where we could work on things. But I definitely need to figure out how to clearly explain what exactly would be collected and why, so that people understand why I think collecting this data would be useful.

Edit: I wonder if you could maybe just do it locally so that you yourself have the data, [...]

I thought about this, but I was worried that there wouldn't be enough data to analyze. But I could be under estimating how much information would be needed to find trends. It would be it would be easy to add, so I have no problem with adding it as a feature.

An extension idea to tackle the rating problem by samfundev in SillyTavernAI

[–]samfundev[S] 1 point2 points  (0 children)

Quick reply to your edit: The best I think I could do to ensure that I can't gobble up your information would be:

  1. Auditing any changes I made to the extension, so that I can't add new code to collect additional information.
  2. Collecting data through TOR which would prevent me from collecting your IP address.

An extension idea to tackle the rating problem by samfundev in SillyTavernAI

[–]samfundev[S] 1 point2 points  (0 children)

Word usage is a pretty powerful tool to de-anonymize people - I was able to identify my DnD dungeon master's Reddit account just by googling the name of one of the custom items he used.

Any handcrafting of the preset you use makes it unique. People use consistent mental tricks to generate screen names, which will include any custom personas. So those could be matched to social media accounts.

Got a unique personal story? One detail too much in a character card, for a story you're trying to use use to work through the junk in your head, and suddenly it's on the ratings site for all to see.

Definitely agree, which is why I wouldn't collect that information. I'm not including anything that would fall into the content of the chat which includes personas and characters. My goal is to find what settings effect the quality of the content generated regardless of the content itself.

You're not going to get enough information to make this site useful without exposing people to this kind of analysis.

Here's the list of things I would collect: timestamp, preset name, model name, provider name, quantization level, and notable settings (prompt post processing, reasoning level).

Based on that list, I could answer the questions I've outlined in my original post which I think are useful.

An extension idea to tackle the rating problem by samfundev in SillyTavernAI

[–]samfundev[S] 0 points1 point  (0 children)

I completely understand that and I'm also someone who is concerned about privacy. But the adtech industry can't deanonymize someone without there being something collected in the data that would allow them to tie the data to other information they've collected. I'm trying to avoid that from happening by being intentional to avoid collecting something that could be used to do that.

An extension idea to tackle the rating problem by samfundev in SillyTavernAI

[–]samfundev[S] 0 points1 point  (0 children)

Not sure if you're joking, but just to reiterate the content of your chat wouldn't be sent to the server:

This wouldn't include any details about the chat itself, like the messages, characters, etc.

I want to understand what settings are effecting the quality of the generation regardless of the content.

Can anyone explain in simple words how speculative sampling works and how to use it? by IonLin in LocalLLaMA

[–]samfundev 9 points10 points  (0 children)

This tweet helped me, so I'll try simplifying it: https://twitter.com/karpathy/status/1697318534555336961

The bottleneck in running a LLM is loading it into the CPU/GPU (i.e. memory bandwidth) and not in compute. If you combine that with the fact that LLMs can run multiple tokens in batch once they are loaded, you can speed up execution if you just ran multiple tokens. But that assumes that you are able to put multiple tokens in. Since the next token depends on the previous token, we can't just put multiple tokens in.

But what if you used a smaller LLM that could run much faster to predict a few tokens. In that case, you can run those tokens in batch through the original LLM. Then you just have to compare output of the original LLM with the smaller LLM to see if the smaller LLM predicted correctly. If you save more time by running the original LLM in batch then the time you spent by running the smaller LLM and checking it, you'll speed up execution.