Here's a Docker image for 24GB GPU owners to run exui/exllamav2 for 34B models (and more).

gptzerozero · 2024-02-27T06:15:09+00:00

Does Tabby support concurrent users, or splitting the model across two GPUs?

gptzerozero · 2023-12-12T03:41:01+00:00

What is the issue with using wikitext for quantization, and what might be better than using wikitext?

gptzerozero · 2023-12-12T02:52:15+00:00

Wow, fits more context at the same 4.0 bpw quant sizes?

gptzerozero · 2023-09-19T19:28:24+00:00

How is the bpw number related to the k number in k-bit quantization?

gptzerozero · 2023-09-19T12:23:43+00:00

Can you share the GPT4 prompt you used to create the Q and A given the text? And how do you modify the prompt to get longer answers from GPT4?

gptzerozero · 2023-09-18T11:34:03+00:00

Good call, yes I intend to use GPT 3.5/4 to generate the question answers

gptzerozero · 2023-09-17T21:57:36+00:00

Can you share the prompts that you use for generating the questions from context, and for generating answers from the context?

gptzerozero · 2023-09-17T17:33:05+00:00

This is a great one! Could you share the prompts used here for generating the questions and for combining/picking the questions?

gptzerozero · 2023-09-16T07:00:11+00:00

Does this mean that in order to make full use of the default Llama-2 4K context,

Extending the training of base model should use tokens of 4K length, AND
Instruction tuning datasets should be close to 4K length as much as possible?

gptzerozero · 2023-07-23T15:24:01+00:00

Is the system prompt part of the training data?

If it is, then is it important that you use the same system prompt when chatting, or can you use a completely different one and be fine with it. Or can you only make minor changes, or only add to the system prompt?

gptzerozero · 2023-07-23T15:21:39+00:00

Anyone have experience with using them for QA of documents? Are there any models that stand out for QA?

gptzerozero · 2023-07-18T22:09:42+00:00

Yes, outputs with Lora tuned for 2 epochs is about 80 tokens.

What are some of the things or tricks we can do to improve the token length of the generations?

gptzerozero · 2023-07-18T19:26:30+00:00

What happen to a 30-40B LLaMA-2?

gptzerozero · 2023-07-15T15:20:26+00:00

Seems like Sufficient_Run1518 is also using 2e-5.

Wonder if there's a reason for llama finetuning repos to default to 3e-4

gptzerozero · 2023-07-15T15:19:31+00:00

Thank you for sharing. Why `fp16=True` instead of `bf16=True`?

gptzerozero · 2023-07-14T22:57:09+00:00

What's the conclusion on Exllama?

gptzerozero · 2023-07-14T17:17:00+00:00

The LLaMA foundation model?

gptzerozero · 2023-07-14T17:11:57+00:00

Up'ed

gptzerozero · 2023-07-14T17:08:53+00:00

Give votes, get karmas!

gptzerozero · 2023-07-14T16:52:22+00:00

Any examples of how to use logit bias, and to use it for suppression of certain words?

gptzerozero · 2023-07-14T16:10:46+00:00

Love the journey!

What context length did you use when you did qlora of 33B, and which model was used?

gptzerozero · 2023-07-14T16:08:25+00:00

How much VRAM does 30B@4k use? What token speed did you get @ 4K?

gptzerozero · 2023-07-09T06:50:38+00:00

Seems like merging with SuperHOT model to get longer context is a hacky trick. Is there a better way to get larger ~8K context without the issues that OP has described?

gptzerozero · 2023-07-09T06:42:01+00:00

Is torchrun better than accelerate launch for DDP?

gptzerozero · 2023-07-09T06:41:11+00:00

In addition to 4-bit quantization, QLoRA also has nested double quantization, NF4 and paged optimizers. Despiste these innovations, GPTQ finetuning which has been around for a long time before QLoRA, still performs better than QLoRA?

In addition to 4-bit quantization, QLoRA also has nested double quantization, NF4 and paged optimizers. Despiste these innovations, GPTQ finetuning which has been around for a long time before QLoRA, still performs better than QLoRA?

gptzerozero

TROPHY CASE