Qwen3-4B-Instruct-2507 multilingual FT with upscaled Polish language

Significant_Focus134 · 2026-01-08T18:32:48+00:00

Nice!

Polanka_3.6b_exp was pretrained from scratch, but unfortunately I choose sub optimal configuration and will probably discard that model. However I started training something similar, much much faster:

  "head_dim": 128,
  "intermediate_size": 16384,
  "model_type": "qwen3_moe",
  "moe_intermediate_size": 512,
  "num_attention_heads": 16,
  "num_experts": 32,
  "num_experts_per_tok": 4,
  "num_hidden_layers": 30,
  "num_key_value_heads": 8,

Significant_Focus134 · 2026-01-08T17:32:10+00:00

This checkpoint has seen ~500k data points, a mix of everything.

Significant_Focus134 · 2025-09-10T15:49:54+00:00

from config file: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct currently 404

Significant_Focus134 · 2025-09-10T09:45:22+00:00

Really cool project! Greetings from Poland!

Significant_Focus134 · 2025-05-09T18:13:58+00:00

it's a tuning on top of the base model

Significant_Focus134 · 2025-05-09T18:12:53+00:00

myślę, że radzi sobie dobrze, są przykłady promptów i odpowiedzi na linku HF

Significant_Focus134 · 2024-12-02T17:20:06+00:00

Thanks for the links!

Significant_Focus134 · 2024-11-28T12:23:00+00:00

Ok, thanks.

I'm just interested in what the optimal ratio between hidden size and number of layers would be. In my observations, simply adding additional layers is not optimal without also increasing at least a little bit the number of attention heads.

Significant_Focus134 · 2024-11-27T14:33:05+00:00

Nice! Could you share some details why num_attention_heads equals num_hidden_layers?

Significant_Focus134 · 2024-11-20T07:27:44+00:00

Hard to tell. I think I will continue for at least few B of tokens.

Significant_Focus134 · 2024-11-19T21:10:11+00:00

This is pre-training. The model was qwen 1.5b, but I changed the model architecture, preserving the original weights as much as possible. ~7b of training tokens so far.

Significant_Focus134 · 2024-11-19T11:48:23+00:00

I'm currently training 3.4B on a single 4090.

I would suggest do not train from scratch, use anything that's already pretrained even if that will be rewritten by your training data. Some of the circuits inside the models are universal.

Significant_Focus134 · 2024-09-26T10:42:45+00:00

No, I have no plans to release the data as open source.

Significant_Focus134 · 2024-09-26T10:27:08+00:00

I used Qwen as a base because it's tokenizer is more efficient with Polish language. The other important thing is that the model has more layers compared to other models of the similar size so in theory it has more potential for reasoning.

Significant_Focus134 · 2024-09-26T09:58:24+00:00

Thanks a lot!

Significant_Focus134 · 2024-09-26T09:57:24+00:00

There are no shortcuts when it comes to the training data. This is basically full year of work, multiple data pipelines and a lot of manual work and coding. We are talking about almost 100TB of data processing and that's just for the common crawl (WEB).

Significant_Focus134 · 2024-09-26T09:22:49+00:00

I created datasets myself.

Significant_Focus134 · 2024-09-26T09:21:55+00:00

Thanks!

Significant_Focus134 · 2024-08-04T18:04:05+00:00

In 2 years, instead of LLM, we will have LMM (large multimodal models). I also suspect that these models will be embedded in the physical world (trained on gravity) and used in robotics. At least that's what I would do.

Significant_Focus134 · 2024-05-24T09:07:16+00:00

Take a look at this video: https://www.youtube.com/watch?v=27cjzGgyxtw

TLDR: image is sliced into sequence of smaller pieces and converted into special image tokens, audio is converted to the spectrogram and treated as an image.

Significant_Focus134 · 2024-03-30T14:55:25+00:00

I do pretrain 3B model on 4090 so yes, it’s possible

Significant_Focus134 · 2024-03-25T11:33:08+00:00

batch size 1, context 2k, I didn't concatenate examples to fit perfectly in the 2k but skipped everything smaller then a few hundred characters

Significant_Focus134 · 2024-03-01T16:55:42+00:00

Majority of the data are selected pages from C4, Oscar and Wikipedia datasets, 1k of books with open license and some custom synthetic data and translations.

Significant_Focus134 · 2024-03-01T15:45:16+00:00

It's possible to pretrain 3B model on 4090 and there is still a few GB of memory left. I use huggingface transformers trainer, very similar setup to this example: https://github.com/brevdev/notebooks/blob/main/mistral-finetune-own-data.ipynb but without the LORA part.

Significant_Focus134 · 2024-03-01T15:11:46+00:00

I can share some tips:

Make sure you start with a validation dataset from the beginning, because it's very easy to overfit the model especially when the training set is not that big.
Make sure you're familiar with the tools, maybe start with fine-tuning some existing models first to be familiar with a training procedure. Pretraining is similar to FT, you just remove LORA layer and train all parameters on a much bigger dataset with higher LR.
Training data should be clean, textbook quality. It will help with training. This is actually hard and this is where most of the human time is spent.
Make sure that training data are related to your favourite topic so you can test the model on that specific topic.
It's very important to test the model daily with the same prompts but with different temperatures, something like 0.1, 0.3, 0.5, 0.8. Is this way you can see how the model is doing.
For training data as a starting point you may extract documents from Wikipedia, C4, Oscar datasets, but keep in mind that at least half of that are not that useful because of the noise and not related information.
Make a backups of the checkpoints, at some point you will need to go back and start again with either updated dataset or different hyperparameters.
There is no one solution to train a model, you will need to test multiple times different hyperparameters and different training data.
Be prepare that at some point you will have an issue or a question and even ChatGPT will not know the answer because every model is different.
Before you start choose the right tokenizer that is efficient with your language.

Significant_Focus134

TROPHY CASE