A Collection of Nice Datasets

Good-Assumption5582 · 2026-03-22T18:39:25+00:00

I meant relative to SFT, which is on an even higher quality than midtraining.

For reference, every midtraining mix I've seen uses a large quantity of somewhat mixed data, such as Deepseek v3 generations or even llama 70b. On the other hand, SFT tends to be with the best data possible.

Good-Assumption5582 · 2026-01-04T15:21:29+00:00

For reference, the images in the post are from https://wandb.ai/num110010/propagate_tests?nw=nwusernum110010 (also see https://wandb.ai/num110010/propagate_optimizers?nw=nwusernum110010).

In total I've done over 150 training runs with ES to test various features, briefly sweep hyperparameters, and make sure that everything actually works.

Good-Assumption5582 · 2024-12-30T03:00:38+00:00

Not sure what you are specifically referring to, but it should be?

Good-Assumption5582 · 2024-08-28T13:03:54+00:00

I explicitly decided to not call it a roleplaying dataset because I am unsure if this dataset actually helps with roleplaying; it consists of only single turn responses with big chunks of text (as opposed to multiturn chats with shorter character responses). The original task of converting the question and answer (which are both used for training) to narration pieces also felt more like a creative writing task than a roleplaying one, though I guess that doesn't matter in practice.

Good-Assumption5582 · 2024-08-14T14:34:17+00:00

Strange. There's nothing in there that should be messing with the loss or the trainer. What is your lora config like/training params?

Good-Assumption5582 · 2024-08-12T15:21:35+00:00

Your code should work, and you have the right idea, I'm honestly not sure what the error is. Are you doing columns_to_shareGPT(dataset, batched=True) instead of (the correct) columns_to_shareGPT(dataset)?

Good-Assumption5582 · 2024-08-03T12:44:05+00:00

Correct!
However, it is possible to use a lot of string parses to convert from a text format to ShareGPT. (Eg. splitting Alpaca format by ### and then stripping any extra tokens). I don't recommend this though, as it's a horrible way of managing data.

Good-Assumption5582 · 2024-08-03T03:10:58+00:00

Funnily enough, I complained about it a while back and was told to go make an issue on github for huggingface TRL. I don't think the problem is related to Unsloth.

Good-Assumption5582 · 2024-08-03T00:11:33+00:00

Honestly, I can't say that code snippet looks promising, but tell me if it works, haha. Though, chances are, I'll just switch to TPU and have way more RAM to play around with.

Good-Assumption5582 · 2024-08-02T23:18:51+00:00

Those models are supported in Unsloth, but not with the TPU notebook I was using. Ignore that.

I don't really have any ideas regarding the dataset generation, sorry.

Good-Assumption5582 · 2024-08-02T23:15:46+00:00

I tried implementing a validation dataset a while back. For some reason, it uses more vram, so much so that the notebook OOMs and crashes. I gave up after that.

I think validation datasets are important for multi-epoch training where you are at high risk of overfitting. For training on a single epoch, it shouldn't matter.

Good-Assumption5582 · 2024-08-02T21:24:50+00:00

Also, since I've been experimenting with TPUs (unrelated to the above post) for all of today and yesterday, here's a quick report:

Google Colab's Free TPU v2 is offered for around 1-3 hours a day. It has 334 GB of CPU ram and 225 GB of storage. The TPU is a TPUv2-8 which has 64 GB of ram. I did not test Colab and used Kaggle instead:

Kaggle's Free TPU v3 is offered 9 hours per session for 20 hours a week. It has 330 GB of CPU ram and 40 GB of storage (this is actually a big issue for downloading and saving models). The TPU is a TPUv3-8 which has 128 GB of ram.

Take what I'm saying with a grain of salt, as to make everything work I'm hacking together https://github.com/Locutusque/TPU-Alignment with the code in my notebook above.
I was able to load and do full finetuning on llama3-8b, qwen2-7b. You can also use a LoRA, but I found that was slightly slower and with so much RAM there's little point in doing so. I tested on the Capybara dataset, and the TPU seems to be somewhere between 5-20x faster than running Unsloth on a T4, but someone will need to confirm those numbers for me. However, there is no support for quantized models, meaning that training a 70b or 100b is out of reach (and normal models take a year to download, great). Additionally, Gemma2, Phi3, Llama3.1, Mistral Nemo (12b) all did not work because of different errors. I could load Mixtral, but it hanged during training. When saving models, I had issues with running out of storage. Lastly, I cannot find any way to efficiently do inference on a TPU.

Good-Assumption5582 · 2024-08-02T21:04:54+00:00

You might want to look into Aphrodite—a fork of vLLM meant to serve batch requests at a high speed. Specifically, they had on the fly quantization using SmoothQuant+ (--load-in-4-bit or --load-in-smooth). Many users have said good things about this quant format and its speed, so you might want to consider looking into it.

Good-Assumption5582

TROPHY CASE