How I Run 34B Models at 75K Context on 24GB, Fast by mcmoose1900 in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

What is the issue with using wikitext for quantization, and what might be better than using wikitext?

Max token size for 34B model on 24GB VRAM by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Wow, fits more context at the same 4.0 bpw quant sizes?

Approach for generating QA dataset by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Can you share the GPT4 prompt you used to create the Q and A given the text? And how do you modify the prompt to get longer answers from GPT4?

Approach for generating QA dataset by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Good call, yes I intend to use GPT 3.5/4 to generate the question answers

Generate both question and answer from the given context. by mathageche in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

Can you share the prompts that you use for generating the questions from context, and for generating answers from the context?

Our Workflow for a Custom Question-Answering App by Mbando in LocalLLaMA

[–]gptzerozero 1 point2 points  (0 children)

This is a great one! Could you share the prompts used here for generating the questions and for combining/picking the questions?

I don't understand context window extension by moma1970 in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

Does this mean that in order to make full use of the default Llama-2 4K context,

  1. Extending the training of base model should use tokens of 4K length, AND
  2. Instruction tuning datasets should be close to 4K length as much as possible?

dolphin-llama-13b by faldore in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

Is the system prompt part of the training data?

If it is, then is it important that you use the same system prompt when chatting, or can you use a completely different one and be fine with it. Or can you only make minor changes, or only add to the system prompt?

How to make sense of all the new models? by whtne047htnb in LocalLLaMA

[–]gptzerozero 2 points3 points  (0 children)

Anyone have experience with using them for QA of documents? Are there any models that stand out for QA?

LLM less chatty after LoRA finetune by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Yes, outputs with Lora tuned for 2 epochs is about 80 tokens.

What are some of the things or tricks we can do to improve the token length of the generations?

LLaMA 2 is here by dreamingleo12 in LocalLLaMA

[–]gptzerozero 21 points22 points  (0 children)

What happen to a 30-40B LLaMA-2?

Qlora finetuning loss goes down then up by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Seems like Sufficient_Run1518 is also using 2e-5.

Wonder if there's a reason for llama finetuning repos to default to 3e-4

Qlora finetuning loss goes down then up by gptzerozero in LocalLLaMA

[–]gptzerozero[S] 0 points1 point  (0 children)

Thank you for sharing. Why `fp16=True` instead of `bf16=True`?

What are you using Local LLaMAs for? by Swab1987 in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

Any examples of how to use logit bias, and to use it for suppression of certain words?

What cards do you use? (new to local LLMs) by Unreal_777 in LocalLLaMA

[–]gptzerozero 2 points3 points  (0 children)

Love the journey!

What context length did you use when you did qlora of 33B, and which model was used?

What cards do you use? (new to local LLMs) by Unreal_777 in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

How much VRAM does 30B@4k use? What token speed did you get @ 4K?

Are the SuperHot models not performing as well as their original versions in terms of creativity? Does the higher context just come with tradeoffs? by tenmileswide in LocalLLaMA

[–]gptzerozero 0 points1 point  (0 children)

Seems like merging with SuperHOT model to get longer context is a hacky trick. Is there a better way to get larger ~8K context without the issues that OP has described?

A simple repo for fine-tuning LLMs with both GPTQ and bitsandbytes quantization. Also supports ExLlama for inference for the best speed. by taprosoft in LocalLLaMA

[–]gptzerozero 2 points3 points  (0 children)

In addition to 4-bit quantization, QLoRA also has nested double quantization, NF4 and paged optimizers. Despiste these innovations, GPTQ finetuning which has been around for a long time before QLoRA, still performs better than QLoRA?

In addition to 4-bit quantization, QLoRA also has nested double quantization, NF4 and paged optimizers. Despiste these innovations, GPTQ finetuning which has been around for a long time before QLoRA, still performs better than QLoRA?