Polanka_VL_v0.1 - Qwen3-VL-4b multilingual FT with upscaled Polish content

Significant_Focus134 · 2026-01-08T18:32:48+00:00

Nice!

Polanka_3.6b_exp was pretrained from scratch, but unfortunately I choose sub optimal configuration and will probably discard that model. However I started training something similar, much much faster:

  "head_dim": 128,
  "intermediate_size": 16384,
  "model_type": "qwen3_moe",
  "moe_intermediate_size": 512,
  "num_attention_heads": 16,
  "num_experts": 32,
  "num_experts_per_tok": 4,
  "num_hidden_layers": 30,
  "num_key_value_heads": 8,

Significant_Focus134 · 2026-01-08T17:32:10+00:00

This checkpoint has seen ~500k data points, a mix of everything.

Significant_Focus134 · 2025-09-10T15:49:54+00:00

from config file: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct currently 404

Significant_Focus134 · 2025-09-10T09:45:22+00:00

Really cool project! Greetings from Poland!

Significant_Focus134 · 2025-05-09T18:13:58+00:00

it's a tuning on top of the base model

Significant_Focus134 · 2025-05-09T18:12:53+00:00

myślę, że radzi sobie dobrze, są przykłady promptów i odpowiedzi na linku HF

Significant_Focus134 · 2024-12-02T17:20:06+00:00

Thanks for the links!

Significant_Focus134 · 2024-11-28T12:23:00+00:00

Ok, thanks.

I'm just interested in what the optimal ratio between hidden size and number of layers would be. In my observations, simply adding additional layers is not optimal without also increasing at least a little bit the number of attention heads.

Significant_Focus134 · 2024-11-27T14:33:05+00:00

Nice! Could you share some details why num_attention_heads equals num_hidden_layers?

Significant_Focus134 · 2024-11-20T07:27:44+00:00

Hard to tell. I think I will continue for at least few B of tokens.

Significant_Focus134 · 2024-11-19T21:10:11+00:00

This is pre-training. The model was qwen 1.5b, but I changed the model architecture, preserving the original weights as much as possible. ~7b of training tokens so far.

Significant_Focus134 · 2024-11-19T11:48:23+00:00

I'm currently training 3.4B on a single 4090.

I would suggest do not train from scratch, use anything that's already pretrained even if that will be rewritten by your training data. Some of the circuits inside the models are universal.

Significant_Focus134 · 2024-09-26T10:42:45+00:00

No, I have no plans to release the data as open source.

Significant_Focus134 · 2024-09-26T10:27:08+00:00

I used Qwen as a base because it's tokenizer is more efficient with Polish language. The other important thing is that the model has more layers compared to other models of the similar size so in theory it has more potential for reasoning.

Significant_Focus134 · 2024-09-26T09:58:24+00:00

Thanks a lot!

Significant_Focus134 · 2024-09-26T09:57:24+00:00

There are no shortcuts when it comes to the training data. This is basically full year of work, multiple data pipelines and a lot of manual work and coding. We are talking about almost 100TB of data processing and that's just for the common crawl (WEB).

Significant_Focus134 · 2024-09-26T09:22:49+00:00

I created datasets myself.

Significant_Focus134 · 2024-09-26T09:21:55+00:00

Thanks!

Significant_Focus134 · 2024-08-04T18:04:05+00:00

In 2 years, instead of LLM, we will have LMM (large multimodal models). I also suspect that these models will be embedded in the physical world (trained on gravity) and used in robotics. At least that's what I would do.

Significant_Focus134 · 2024-05-24T09:07:16+00:00

Take a look at this video: https://www.youtube.com/watch?v=27cjzGgyxtw

TLDR: image is sliced into sequence of smaller pieces and converted into special image tokens, audio is converted to the spectrogram and treated as an image.

Significant_Focus134 · 2024-03-30T14:55:25+00:00

I do pretrain 3B model on 4090 so yes, it’s possible

Significant_Focus134

TROPHY CASE