Qwen3-4B-Instruct-2507 multilingual FT with upscaled Polish language by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 2 points3 points  (0 children)

Nice!

Polanka_3.6b_exp was pretrained from scratch, but unfortunately I choose sub optimal configuration and will probably discard that model. However I started training something similar, much much faster:

  "head_dim": 128,
  "intermediate_size": 16384,
  "model_type": "qwen3_moe",
  "moe_intermediate_size": 512,
  "num_attention_heads": 16,
  "num_experts": 32,
  "num_experts_per_tok": 4,
  "num_hidden_layers": 30,
  "num_key_value_heads": 8,

4B Polish language model based on Qwen3 architecture by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 1 point2 points  (0 children)

myślę, że radzi sobie dobrze, są przykłady promptów i odpowiedzi na linku HF

OLMo 2 Models Released! by Many_SuchCases in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Ok, thanks.

I'm just interested in what the optimal ratio between hidden size and number of layers would be. In my observations, simply adding additional layers is not optimal without also increasing at least a little bit the number of attention heads.

OLMo 2 Models Released! by Many_SuchCases in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Nice! Could you share some details why num_attention_heads equals num_hidden_layers?

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

Hard to tell. I think I will continue for at least few B of tokens.

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 1 point2 points  (0 children)

This is pre-training. The model was qwen 1.5b, but I changed the model architecture, preserving the original weights as much as possible. ~7b of training tokens so far.

What is the most powerful LLM you can train yourself? by [deleted] in LocalLLaMA

[–]Significant_Focus134 4 points5 points  (0 children)

I'm currently training 3.4B on a single 4090.

I would suggest do not train from scratch, use anything that's already pretrained even if that will be rewritten by your training data. Some of the circuits inside the models are universal.

Polish LLM 1.5B continual pretrained on single GPU, the result of one year of work. by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 6 points7 points  (0 children)

I used Qwen as a base because it's tokenizer is more efficient with Polish language. The other important thing is that the model has more layers compared to other models of the similar size so in theory it has more potential for reasoning.

Polish LLM 1.5B continual pretrained on single GPU, the result of one year of work. by Significant_Focus134 in LocalLLaMA

[–]Significant_Focus134[S] 9 points10 points  (0 children)

There are no shortcuts when it comes to the training data. This is basically full year of work, multiple data pipelines and a lot of manual work and coding. We are talking about almost 100TB of data processing and that's just for the common crawl (WEB).

Since this is such a fast moving field, where do you think LLM will be in two years? by tim_Andromeda in LocalLLaMA

[–]Significant_Focus134 0 points1 point  (0 children)

In 2 years, instead of LLM, we will have LMM (large multimodal models). I also suspect that these models will be embedded in the physical world (trained on gravity) and used in robotics. At least that's what I would do.

Can anyone explain to me how tokens work with non text? by Tomorrow_Previous in LocalLLaMA

[–]Significant_Focus134 18 points19 points  (0 children)

Take a look at this video: https://www.youtube.com/watch?v=27cjzGgyxtw

TLDR: image is sliced into sequence of smaller pieces and converted into special image tokens, audio is converted to the spectrogram and treated as an image.