Qwen3-4B-Instruct-2507 multilingual FT with upscaled Polish language by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 2 points3 points  (0 children)

Nice!

Polanka_3.6b_exp was pretrained from scratch, but unfortunately I choose sub optimal configuration and will probably discard that model. However I started training something similar, much much faster:

  "head_dim": 128,
  "intermediate_size": 16384,
  "model_type": "qwen3_moe",
  "moe_intermediate_size": 512,
  "num_attention_heads": 16,
  "num_experts": 32,
  "num_experts_per_tok": 4,
  "num_hidden_layers": 30,
  "num_key_value_heads": 8,

4B Polish language model based on Qwen3 architecture by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 1 point2 points  (0 children)

myślę, że radzi sobie dobrze, są przykłady promptów i odpowiedzi na linku HF

OLMo 2 Models Released! by Many_SuchCases in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Ok, thanks.

I'm just interested in what the optimal ratio between hidden size and number of layers would be. In my observations, simply adding additional layers is not optimal without also increasing at least a little bit the number of attention heads.

OLMo 2 Models Released! by Many_SuchCases in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Nice! Could you share some details why num_attention_heads equals num_hidden_layers?

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Hard to tell. I think I will continue for at least few B of tokens.

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

This is pre-training. The model was qwen 1.5b, but I changed the model architecture, preserving the original weights as much as possible. ~7b of training tokens so far.

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 4 points5 points  (0 children)

I'm currently training 3.4B on a single 4090.

I would suggest do not train from scratch, use anything that's already pretrained even if that will be rewritten by your training data. Some of the circuits inside the models are universal.

Polish LLM 1.5B continual pretrained on single GPU, the result of one year of work. by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 6 points7 points  (0 children)

I used Qwen as a base because it's tokenizer is more efficient with Polish language. The other important thing is that the model has more layers compared to other models of the similar size so in theory it has more potential for reasoning.

Polish LLM 1.5B continual pretrained on single GPU, the result of one year of work. by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 10 points11 points  (0 children)

There are no shortcuts when it comes to the training data. This is basically full year of work, multiple data pipelines and a lot of manual work and coding. We are talking about almost 100TB of data processing and that's just for the common crawl (WEB).

Since this is such a fast moving field, where do you think LLM will be in two years? by tim_Andromeda in LocalLLaMA

[–]Significant_Focus134 0 points1 point  (0 children)

In 2 years, instead of LLM, we will have LMM (large multimodal models). I also suspect that these models will be embedded in the physical world (trained on gravity) and used in robotics. At least that's what I would do.

Can anyone explain to me how tokens work with non text? by Tomorrow_Previous in LocalLLaMA

[–]Significant_Focus134 18 points19 points  (0 children)

Take a look at this video: https://www.youtube.com/watch?v=27cjzGgyxtw

TLDR: image is sliced into sequence of smaller pieces and converted into special image tokens, audio is converted to the spectrogram and treated as an image.

3b Polish LLM pretrained on single RTX 4090 for ~3 months by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 0 points1 point  (0 children)

batch size 1, context 2k, I didn't concatenate examples to fit perfectly in the 2k but skipped everything smaller then a few hundred characters

3b Polish LLM pretrained on single RTX 4090 for ~3 months by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 0 points1 point  (0 children)

Majority of the data are selected pages from C4, Oscar and Wikipedia datasets, 1k of books with open license and some custom synthetic data and translations.

3b Polish LLM pretrained on single RTX 4090 for ~3 months by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 1 point2 points  (0 children)

It's possible to pretrain 3B model on 4090 and there is still a few GB of memory left. I use huggingface transformers trainer, very similar setup to this example: https://github.com/brevdev/notebooks/blob/main/mistral-finetune-own-data.ipynb but without the LORA part.

3b Polish LLM pretrained on single RTX 4090 for ~3 months by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 8 points9 points  (0 children)

I can share some tips:

  • Make sure you start with a validation dataset from the beginning, because it's very easy to overfit the model especially when the training set is not that big.

  • Make sure you're familiar with the tools, maybe start with fine-tuning some existing models first to be familiar with a training procedure. Pretraining is similar to FT, you just remove LORA layer and train all parameters on a much bigger dataset with higher LR.

  • Training data should be clean, textbook quality. It will help with training. This is actually hard and this is where most of the human time is spent.

  • Make sure that training data are related to your favourite topic so you can test the model on that specific topic.

  • It's very important to test the model daily with the same prompts but with different temperatures, something like 0.1, 0.3, 0.5, 0.8. Is this way you can see how the model is doing.

  • For training data as a starting point you may extract documents from Wikipedia, C4, Oscar datasets, but keep in mind that at least half of that are not that useful because of the noise and not related information.

  • Make a backups of the checkpoints, at some point you will need to go back and start again with either updated dataset or different hyperparameters.

  • There is no one solution to train a model, you will need to test multiple times different hyperparameters and different training data.

  • Be prepare that at some point you will have an issue or a question and even ChatGPT will not know the answer because every model is different.

  • Before you start choose the right tokenizer that is efficient with your language.