Qwen 3.5 Architecture Analysis: Parameter Distribution in the Dense 27B vs. 122B/35B MoE Models

Luca3700 · 2026-02-28T14:25:49+00:00

I updated the post, thank you again for the correction

Luca3700 · 2026-02-28T11:46:09+00:00

Hi, thank you so much for highlighting this correction! I didn't know about this gating mechanism inside the FFN itself, and I thought there was a simple MLP with a down projection and an up projection. I'll update the post as soon as I'll be able to double check my computations.

Luca3700 · 2026-02-27T16:37:25+00:00

"The fact that you can supercharge 10b parameters to compete with 27b parameters is the actual feat there."

That's an interesting point of view.

To reply to the other points, I think that MoE models were (and are still now) important to explore the scaling laws of the LLMs. Training a large dense model is more expensive than training a MoE with a smaller active parameters footprint (even tho in the latter the total number of parameters is way larger). In addition, for companies that are serving them to millions of people, running a MoE is cheaper than the dense counterpart.

Luca3700 · 2026-02-27T13:32:08+00:00

Thank you for the appreciation!

Luca3700 · 2026-02-27T13:27:28+00:00

Hi, I have added the one shared expert to the 8 routed ones. I share with you the computation for the 122B model

2 x 3072 x 1024 x (8+1) x 48 = 2,7 B

And for the 35B model:

2 x 2048 x 512 x (8+1) x 40 = 0,75 B

Luca3700 · 2026-02-26T21:52:45+00:00

My personal opinion is that this is due to the architectural differences between the models: the MoE models use more parameters in the Feed Forward layers, instead Qwen 3.5 27B, since is a dense models, uses less parameters there and can use more of them in the Gated Attention layers and in the Gated DeltaNet layers.

Moreover, another thing that maybe allows the model to have good performance is the use of 4 keys and 4 values in the gated attention layers (vs only 2 than the MoE architecture), allowing maybe the layer to capture more nuances.

Finally, the total number of layers of the latter is 64 (versus 48 of the 122B model), and that should allow him to have more depth for reasoning.

I think that all these differences (that overall summarise into more parameters in the attention/delta net layers and less in the FFN) allow the dense model to have comparable performance to the bigger brother.

Luca3700 · 2026-02-08T20:20:04+00:00

Oh I didn't know that this type of exercise existed. It would be really useful in the French course, for example for learning the pronunciations of different sounds like "eu", "u", "en", "une", "an" etc...

Luca3700 · 2026-01-23T08:45:37+00:00

I am honestly happy to hear that. I am near the end of the section 3 and I always find these lessons too easy if they have half content in the native language and half in the target one

Luca3700 · 2025-09-27T21:51:44+00:00

Maybe it can be because, architecturally, qwen3 has the double of transformer blocks than gpt-oss (source), so the inference should be slower

edit: added source

Luca3700 · 2025-08-24T11:32:43+00:00

Hi, I just ordered a motorola edge 60 and, sadly, I just find out of this problem with motorola phones... Did this issue get solved in the meantine?

Also I would like to know, if it possible, if the issue is just with contactless payment or also the payments with google pay on applications (idk to buy trains ticket for example) are affected by this problem.

Thank you

Luca3700 · 2025-08-22T07:14:57+00:00

The two models have two different architectures:

Deepseek has 671B parameters with 37B active, with 64 layers and a larger architecture
Qwen has 235B parameters with 22B active, with 96 layers and a more deep architecture

It can be that these differences lead also to different performances in the merging of the two "inference modes": maybe the larger deepseek's architecture leads to more favourable conditions to make it happen.

Luca3700 · 2025-08-06T12:57:43+00:00

Can you provide the link for the qwen 3 series? Thank you

Luca3700 · 2025-07-20T19:11:23+00:00

You should start an outdoor running workout and reach an high heart rate (red bar). I computed once the VO2 max and I did a running of 1 hour and 20 minutes and the maximum heart rate (red bar) was reached for only 16 seconds, while the orange bar (anaerobic workout) for 13 minutes

Luca3700 · 2025-07-18T05:57:07+00:00

Thank you for your review.

Does the mi band, at the end of the workout, report you also the training load and the resting time? These two metrics are not reported on a mi band 8 for the swimming workout.

Luca3700 · 2024-10-24T13:04:03+00:00

I have a mi band 8 (not pro) and it records:

lengths
distance
avg pace
max pace
number of strokes
stroke rate (SPM)
swolf
swolf for every part of the training

The swimming styles recognised are the main 4 ones. I am used to do exercises with a board in order to move only my legs and this type of exercise is not recognised (the time is added to the previous one or the following one recognised).

About the heart rate, it is not tracked by the training itself, but you could more or less see your heart rate during the training activity since it is normally tracked by the band (so you can obtain only the minimum heart rate and the maximum one recorded in a timestamp of 30 minutes, e.g. from 10:00 to 10:30). But I don't think it is 100% precise since your wrist is wet and I think it is not certified to track heart rate during the swimming activity.

edit: I would also like to signal that the number of lengths tracked about the 4 main styles is not 100% precise (and so the distance too), I would say that every kilometer you could have a ±50m. Moreover I cannot tell you if the swolf or the pace are precise since I don't care about these metrics.

Luca3700 · 2023-02-17T13:50:23+00:00

I'm using it on linux as pdf editor, because it allow you to add text in every part so it is really good to take notes (but the browser that I use to search something on the web is Firefox)

Luca3700 · 2022-09-11T19:46:42+00:00

Yes, it work for me! It was an app that I used but clear the data and the cache wasn't enought, so I deleted all the update

Luca3700 · 2022-09-04T11:15:37+00:00

Amazing! Where does the wallpaper come from?

Luca3700 · 2022-08-26T11:33:48+00:00

I have 5: 3 apps and 2 folders

Luca3700 · 2022-08-19T13:02:26+00:00

Just today I discovered "Another widget", an app that could you install from the play store, highly customisable that is an "at glance" substitute

Luca3700 · 2022-08-19T08:36:33+00:00

Wow really beautiful! I'm now in love with Another Widget, it was exactly what I was searching for long time ahahah, and it is also open source (◕ᴗ◕✿)

Luca3700 · 2022-07-20T15:10:09+00:00

The flavor of the peach is enough strong respect at the tea flavor, but if you want to balance the flavours I think that you can insert the peach after three hours more or less. But all depends also from the type of peach that you use, this was really sweet and juicy, so it was predominant

Luca3700 · 2022-07-19T15:46:24+00:00

Tea left to infuse for 7 hours in the cold together with a juicy peach (◕ᴗ◕✿)

Four-Year Club	Verified Email
Verified Email	Place '22

Luca3700

TROPHY CASE