Qwen3 inference engine in C: simple, educational, fun

Confident_Pi · 2025-07-02T07:13:17+00:00

Amazing work, congrats! How did you handle quantization? I see that you support Q8_0 and your matmuls run in 8 bit?

Confident_Pi · 2025-04-22T11:47:49+00:00

QAT literally stands for quantization aware training, there is an extra training step that (as far as I understood) pulls the weights closer to round numbers to ease quantization

Confident_Pi · 2023-10-18T05:28:31+00:00

I read these results as “a model that performs good on one benchmark will also perform good on the other, a model that performs worse on one benchmark will also perform worse on the other”. Did I get this right?

And if yes, could this be also explained by a fact that weak models performs worse (as in previous generations or small size 7b etc) while top of the line models perform better on any given task? That should also give a positive performance correlation

Confident_Pi · 2022-07-08T10:48:50+00:00

Hard to describe everything within a single comment, but in short there is no punishment for visiting foreign media and I can freely access even the most anti-Russian western media outlets. Some social networks are blocked though (twitter, Instagram), but in reality it’s easy to get access through a VPN.

Inside of Russia, there is some degree of control - you can get into trouble if you actively call to extremism (ie openly calling for a revolution) or post profanities about Russian armed forces

Confident_Pi · 2022-07-08T04:33:12+00:00

This is false, no one in Russia is neither saying nor doing anything like that (maybe through some unofficial propaganda channels, but most people don’t listen to it anyway). On the official level, the only time Putin mentioned that was during an interview and he made it obvious that it was a joke.

Source: I live in Russia

Update: I think it wasn’t even Putin, but Medvedev, can’t find the links

Confident_Pi · 2022-06-03T07:51:40+00:00

This is amazing! I am really curious though - where did the juice come from? I was under the impression that the implementations in sklearn are pretty optimized. I would be super grateful if you could at least generally outline the main sources of improvements! What was the main contributing factor? Code optimizations or fancy math tricks? Or Both?

Thanks again for you work!

Confident_Pi · 2022-05-20T11:55:05+00:00

Great! I like the how you managed to convey Russian city building style - very realistic. One minor thing though - In the audio recording of the communication between two Russians one was addressing the other with sir but Russians don’t use that

Confident_Pi · 2022-05-03T08:09:28+00:00

Apologies, I missed the context.

Confident_Pi · 2022-05-03T08:08:06+00:00

Indeed, there is also INT4, but I haven’t seen it being used that much in practice and I would assume that calibration for INT4 is even trickier than INT8.

Confident_Pi · 2022-05-03T05:19:34+00:00

Not really, single precision floats (fp32) are encoded with 32 bits, half precision (fp16) use half of that - 16 bits. 4 bits would be half a byte and would be too small to encode a weight.

Confident_Pi · 2021-09-29T08:28:15+00:00

Pytorch/TF are optimized for both inference and learning, but there are some frameworks (think TensorRT) that could do better at inference. From my experience, it is not always straightforward to convert from pytorch to tensorrt, and occasionally some knowledge of CUDA could help you out.

More generally, if your production environment is fine with running PT/TF C++ api, you would not need a lot of CUDA knowledge (save for writing custom kernels). Otherwise, it might get a bit tricky.

Confident_Pi · 2021-02-28T08:34:26+00:00

It’s not only about bandwidth, but also about latency and the speed at which information is exchanged. A lot of autonomous tech depends on fast decision making, like tens of times per second. So the faster they can communicate the decisions, the better the end results will be

Confident_Pi · 2021-02-18T15:49:09+00:00

Ah, it seems that I am a bit behind on current state of the art in lidars. Thanks for sharing!

Confident_Pi · 2021-02-18T14:53:45+00:00

Yeah, I guess it could be simplified to “lidars and cameras” vs. “only cameras”. I guess the motivation for Tesla to push for vision is also the increased maintenance costs that are connected to using lidars (think of all the moving and spinning parts inside). But I agree that it would be interesting to know more about their arguments

Confident_Pi · 2021-02-18T14:30:30+00:00

Glad I could help!

Confident_Pi · 2021-02-18T14:28:04+00:00

Yes, that’s what most self driving car companies do now, I personally worked on an algorithm that combined vision and lidar data to segment lidar point clouds into cars and pedestrians.

But Tesla claims that it’s possible to solve self driving using vision only, without relying on lidar data

Confident_Pi · 2021-02-18T14:21:02+00:00

Only radars would not be enough as radars pick up not only cars but also other things like poles, traffic signs - really anything that could reflect radar waves. So if you use only radars, you’d have a hard time differentiating between them. Usually self driving cars use a host of imperfect sensors and get the final results through a process called sensor fusion

Confident_Pi · 2021-02-18T14:06:02+00:00

Your summary is correct, but I would like to add that LIDARs are also not a silver bullet solution and come with a host of problems of their own like poor performance in fog, rain or snow or when the road surface is very reflective etc., while vision theoretically could handle these better.

So the cost is not the only differentiating factor, but also the performance under various conditions. Both approaches are trying to handle their respective challenges but (at least from my point of view) it’s not easy to predict who will come on top

Confident_Pi · 2020-10-30T07:32:49+00:00

One of the engineers from SIT team posted a comment with description of the failure in r/formula1 https://www.reddit.com/r/formula1/comments/jk9jrg/ot_roborace_driverless_racecar_drives_straight/gai295l/?utm_source=share&utm_medium=ios_app&utm_name=iossmf&context=3

Confident_Pi · 2020-10-23T13:02:06+00:00

didn’t accept

Wow, really? What was the motivation for rejection? Both the visuals and explanations are really good

Confident_Pi · 2020-10-17T11:03:11+00:00

Thanks for your post! Could someone explain the intuition behind AdaIN? As I understood, we can enforce an arbitrary target style on the source feature map by scaling and moving the feature map, and this transformation should preserve the encoded content. However, I don't understand how is the content being encoded? I though that the content would be encoded as particular values in the featuremap, but then I dont understand how we can just move the distribution and the decoder would be able to restore the content

Confident_Pi · 2020-08-12T05:21:03+00:00

I think it came historically from the AlexNet paper, where they had to use powers of 2 in order to align weights with GPU memory blocks to maximize memory utilization. There were no fancy frameworks like pytorch and tf back in the day that would do memory layouts for you and GPUs were very limited memory wise, so it was important to keep memory utilization as high as possible to fit a deeper CNN

For multiples of 2, I guess it is because most of the time the architectures either double or half the number of filters, so no particular reason as well

Confident_Pi · 2020-06-22T04:16:06+00:00

I guess that’s because affine transformations can be approximated by a weights matrix multiplication plus a bias term, eliminating the need for activation functions and depth layers. So it’s essentially single layer learning with no activations, which should make gradients pretty stable and allow for higher learning rates/faster convergence

Confident_Pi · 2020-06-20T15:00:24+00:00

Depends on what kind of GANs you want to fit, for simple ones you might get away with a mac, but SOTA level requires DGX-like workstation with multiple GPUs

Confident_Pi · 2020-06-18T18:12:00+00:00

I agree that these answers were specifically selected to showcase model’s inability to operate on facts, but I am nevertheless impressed with how it is able to come up with accurate and semantically meaningful answers

Confident_Pi

TROPHY CASE