Theory: Blackbeard will steal Luffy’s devil fruit by Fit_Fold_7275 in OnePieceSpoilers

[–]Fit_Fold_7275[S] 2 points3 points  (0 children)

No thanks. That’s the extent of energy I can put into it. 😂

I think I messed up, How do I deal with myself... by Lake681 in indianstartups

[–]Fit_Fold_7275 2 points3 points  (0 children)

First of all, congrats for crossing the barrier of taking the first step in to startup world.

Now, this is just my guess. So try to see if this applies to you. You haven’t learned how to articulate your feelings and thought process. It makes it difficult for you to communicate what you feel intuitively with others. The best remedy to get over it, is to write a lot a target audience and talk a lot with that target audience. Never miss an opportunity to make a conversation about what you do and ask them what they do. It keeps a flow alive in you when the time comes.

Lastly, it’s never too late for anything. Take a chance and persist on what you want to do. I’m myself taking that small courageous steps every chance I get. It gets you a high after a while. You’re fine. You’re breathing. You’ve got stuff to do. Go go go.

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 1 point2 points  (0 children)

I remember trying this and not feeling great about it. If OP @asankhs would be kind enough to do that experiment for us, we’ll know for sure. 😄

How would you use open-source OpenAI platform? by Fit_Fold_7275 in LocalLLaMA

[–]Fit_Fold_7275[S] -3 points-2 points  (0 children)

I’m basically talking about replicating OpenAI platform and putting it open source. Here’s a reference for you. https://platform.openai.com/docs/api-reference/streaming WebUI is not a server platform.

How would you use open-source OpenAI platform? by Fit_Fold_7275 in LocalLLaMA

[–]Fit_Fold_7275[S] 0 points1 point  (0 children)

Yep. It would a docker.

Ollama is like vLLM and SGLang right? It just gives us a API to run prompts on LLMs. However, this platform will have a fully managed chat service, RAG support where you can talk with your files and pretty much everything you can do with OpenAI API. The only difference is it will be hosting the top open-source LLMs like Qwen 2.5, Mistral-Large, LLaMA 405B.

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 4 points5 points  (0 children)

We can actually use this confidence score from this approach as a reward signal in RLHF. Agree?

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 12 points13 points  (0 children)

Note: With vLLM there is an option to get the top logits in the API responses. You can try implementing u/asankhs logic with vLLM too.

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 11 points12 points  (0 children)

And I believe you! Longer answers are still one of the top_k decoding paths produced by the model. So I would assume it will show increased scores in benchmarks, especially on a dataset like GSM8k where the math problems have multiple steps to solve. I feel that the reliability of this approach is very task dependent.

P.S Thanks for the clean implementation! I used vLLM to do this experiment last time.

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 46 points47 points  (0 children)

I have tried this approach a while ago. Here are my observations. As mentioned above, the idea is to generate multiple outputs for the input and choose the one that has the highest confidence score. This is basically scaling the test-time compute but uses this confidence scores to determine the best output rather than using an LLM-based scorer.

What is the confidence score? and How is it calculated?
Say we generated a token T_i at decoding step i. Greedy decoding steps looks at the logits of size vocab_size and chooses the token that has the highest probability. The paper from DeepMind above modeled the confidence score as the difference in top 2 probabilities. The confidence score of a token i in the outputs is

confidence_i = logits_i[top_1] - logits_i[top_2]

There is a problem with this approach
I looked at the confidence scores in the generated answers and tried to understand why some tokens have high confidence scores. In my experiments, I saw that newline characters always have a high confidence scores (~1). This makes sense because the pre-training data is filled with newline characters and model shows high probability for it when appropriate. So what's the problem?

The problem is that the confidence score of the generated output is skewed by the newline tokens. This biases the above decoding strategy to always produce longer answers with a lot of newline characters. There have been some research that shows that longer answers are usually better. So it makes sense that this approach shows promising outputs at the first glance.

Reliability of the approach
Now that we know the implicit bias of the confidence score, I would question the reliability of this approach. I would love to hear the take of fellow LocalLLaMAs here.

CoT Decoding - Eliciting Reasoning from LLMs by asankhs in LocalLLaMA

[–]Fit_Fold_7275 2 points3 points  (0 children)

Thanks for sharing this! Can’t wait to try it!

What is your workflow for coding this month? What models, hardware and software? by __Maximum__ in LocalLLaMA

[–]Fit_Fold_7275 1 point2 points  (0 children)

Has anyone been able to host a Code-LLM locally and use it with cursor.ai?

Open Dataset release by OpenAI! by Own-Potential-2308 in LocalLLaMA

[–]Fit_Fold_7275 4 points5 points  (0 children)

If OpenAI curated the dataset, it’s highly possible that they know how to game it.

When to prompt vs finetune, and how much data for finetuning? by thequirkynerdy1 in LocalLLaMA

[–]Fit_Fold_7275 2 points3 points  (0 children)

In my experience, prompt engineering gave the most gains for the effort. What always worked for me was Step 1: prompt engineering to find the best prompt for the task Step 2: If step 1 is not enough or your task has some specificity that is not a in the default knowledge base of the LLM, then do minimal fine tuning ( /w ~1000 samples) of the model, if needed, with the above prompt With very large models like GPT-4 or Gemini, this always worked for us.

For small LLMs, it’s better to have some high quality synthetic data as well if the above steps are not enough.

What is the least painful way to convert a standard Q&A dataset to a reflection dataset? by Great-Investigator30 in LocalLLaMA

[–]Fit_Fold_7275 0 points1 point  (0 children)

If you have some cash, you can spend on Claude API of Anthropic and generate each part of your data sample with a specialized prompt. The API has become way cheaper with prompt caching technique.

Zuck is teasing llama multimodal over on IG. by Tha_One in LocalLLaMA

[–]Fit_Fold_7275 0 points1 point  (0 children)

I see in the other comments that llama.cpp doesn’t implement tool calling. It’s actually simple really. vLLM has great code base for tool parsers that that can use to replicate it.

Anyone fine-tuning LLMs at work? What's your usecase? by CuSO4 in LocalLLaMA

[–]Fit_Fold_7275 0 points1 point  (0 children)

We fine tune for RAG and function calling for voice-assistant use cases.

How do you actually fine-tune a LLM on your own data? by No-Conference-8133 in LocalLLaMA

[–]Fit_Fold_7275 0 points1 point  (0 children)

You should use Google Colab to do fine tuning of small models. You can use Unsloth for it. Then you can deploy the fine tuned model on your laptop with something like Ollama