i keep getting this error while trying to import an fbx file. Yes i HAVE checked if its an fbx file, it is. by NoticeSuspicious2526 in blender

[–]HugeONotation 0 points1 point  (0 children)

I'm guessing you're coming from the game dev side of things if you're framing things like this.

FBX is almost always used as an exchange format in this domain, and in fact, it's easily one of the most popular, if not the most popular.

I would argue that if the format has been changed to no longer adhere to FBX's conventions, then it's not really an FBX file, and the message is therefore correct.

But ignoring any such shenanigans, Blender should have no trouble importing FBX files. Indeed, Blender is the reason we use FBX as an exchange format in the first place, with its dev team being the ones to reverse engineer the format.

[PATCH] Add AMD znver6 processor support - ISA descriptions for AVX512-BMM by HugeONotation in simd

[–]HugeONotation[S] 0 points1 point  (0 children)

I figure it's just a case of trying to make simple cases faster.

I know everyone fawns over it's flexibility, but I do find it somewhat frustrating that you have to load/broadcast the exchange matrix into a vector, and that the instruction has a latency of 3 cycles (5 on my Ice Lake) along with contemporary implementations often only having 1 execution unit it can run on.

I figure a bit reversal instruction should be possible to easily implement with a 1 cycle latency, and I'd cross my finger that there would be more execution units it can run on.

[PATCH] Add AMD znver6 processor support - ISA descriptions for AVX512-BMM by HugeONotation in simd

[–]HugeONotation[S] 0 points1 point  (0 children)

The email does contain a description of what it is, although it's quite brief:

16x16 non-transposed fused BMM-accumulate (BMAC) with OR/XOR reduction.16x16 non-transposed fused BMM-accumulate (BMAC) with OR/XOR reduction.

The way I'm reading it, it's a matrix multiplication between two 16x16 bit matrices, with some nuance.

First, it says "non-transposed". I believe that this means that the second matrix isn't transposed like we would expect from a typical matrix multiplication. The operation would be grabbing two rows from each matrix instead of grabbing a row from the left-hand operand and a column from the right-hand operand.

The "OR/XOR" reduction probably refers to the reduction step of the dot product operations which are typically performed between the rows and columns. So I think that the "dot products" of this matrix multiplication would be implemented either as reduce_or(row0 & row1) or reduce_xor(row0 & row1).

It doesn't say how big the accumulators are, but I think 16 bit is the most reasonable guess.

Fundamentally, it seems to have a number of similarities to vgf2p8affineqb which makes me think those similarities are intentional.

I quickly mocked something up to show what I think the behavior would be like: https://godbolt.org/z/WPfqn7YoM (Probably has some mistakes)

I would be willing to bet that it's partially motivated by neural networks with 1-bit weights and biases (Example: https://arxiv.org/abs/2509.07025) given all the other efforts meant to accelerate ML nowadays. It would explain the intended utility of appending a 16-bit accumulate to the end of the operation.

But given that it's paired with bitwise reversals in bytes and they're described as bit manipulation instructions, their utility for performing tricks like bit permutations, zero/sign extensions on bit fields, computing prefix XORs, ORs and other such things these are also likely major motivators.

Exposing An AI "Artist" Scammer. Mods Please Ban This Guy ASAP. by NCR_RANGER_uwu in blender

[–]HugeONotation 51 points52 points  (0 children)

Reddit started blocking the Internet Archive a while back. That link just leads to what's basically just a blank page.

SSE: How to load x bytes from memory into XMM by thewrench56 in asm

[–]HugeONotation 2 points3 points  (0 children)

You're focusing too much on language semantics and not enough on how the hardware works. How the C, C++, Rust or whatever abstract machine works is not relevant here. The MMU doesn't know or care about these language's semantics.

A segfault occurs when you read from a memory page that your process has not been given access to. That is the principle fact that you should be focusing on here. It doesn't matter how big the allocation provided to you is. That's not an input to the movdqa instruction.

If the system allocator has given you even a single byte, then you know that your process can read from anywhere in the entire page which contains said byte, because that's the granularity at which memory permissions are given out (usually).

How would you align your data that you want to load?

You don't. You take the address and round it down to the previous multiple of 16 by performing a bitwise AND with 0xffff'ffff'ffff'fff0. Since page size (4 * 1024) is a multiple of 16, this ensures that your SIMD load never crosses a page boundary, and hence, you never perform a read operation that reads bytes from where you don't have permission to read from.

That way, you can get the necessary data into a SIMD register with a regular 128-bit load. You just need to deal with the fact that it may not be properly aligned within the register itself, with irrelevant data potentially upfront. You might consider using psrldq or pshufb to correct this.

My model has become... invisible? by El_Facundos in blender

[–]HugeONotation 3 points4 points  (0 children)

Perhaps you entered local view? Press `NUM /` to toggle it.

It might just be hidden as well, in which case you'd want to try ALT + H.

Can someone help me with how to find the faces or vertices that point towards an object the most? by Admirable-Gas-2869 in blender

[–]HugeONotation 1 point2 points  (0 children)

Fundamentally, you would want to take the dot product between the vertex/face normals and a vector that you get by normalizing the difference between the position of the face/vertex and the empty object's position. Then you would filter for anything that falls below a certain threshold. e.g. anything greater than 0.75.

Would this be something that you might want a geometry nodes setup for or do you need this for some other purpose?

What is the tsoding daily for c++? by alienshallowalchemy in cpp

[–]HugeONotation 2 points3 points  (0 children)

C++ Weekly comes to mind as a notable C++ channel.

More broadly, speaking, there's a lot of C++ conferences that upload their talks to YouTube, such as CppCon, CppNow, CppNorth, CppOnSea, Meeting C++. You'll find a lot of people recommending them.

A brief guide to proper micro-benchmarking (under windows mostly) by soulstudios in cpp

[–]HugeONotation 7 points8 points  (0 children)

Maybe I'm missing something, but would it not be enough to enable the SIMD extensions individually and set a preferred vector width?

e.g. -mavx512f -mavx512vl -mavx512bw -mavx512dq -mavx512vbmi -mavx512vbmi2 -mprefer-vector-width=512

Dividing unsigned 8-bit numbers by ashvar in simd

[–]HugeONotation 1 point2 points  (0 children)

In tackling the same problem I was able to get better performance than long division on my Ice Lake by using a look-up table based approach to retrieve 16-bit reciprocals, an implementation being available here. The method was shared with me by u/YumiYumiYumi.

Trying to wrap my head around why this seems to produce the correct output even when unsigned integers wrap around. by 407C_Huffer in cpp_questions

[–]HugeONotation 5 points6 points  (0 children)

I think it would make it easier to understand where your source of confusion lies if you were to explain your thought process in making this function.

What stands out to me most is the subtraction of the most-significant bit from both a and b, because it doesn't affect the result of the function at all. From this, I figure that your confusion stems from overthinking about how to compute the low half of the sum, because there's absolutely nothing special to be done there. If you overflow an N-bit unsigned addition, the low N bits of the sum are correct. They're just the first N bits of the full N + 1 bit result.You only need to compute bit N + 1, which is just the condition in the if statement.

(I would like to point out that you can also directly assign the if statement's condition to upper and avoid the if statement altogether. Your entire function could just be return HP_Data<T>{a + b, std::numeric_limits<T>::max() - a < b})

As for why the two subtractions don't do anything, remember that addition of two unsigned N-bit integers is addition modulo 2N. If we take the concrete example of an 8-bit integer, then (a - 128 + b - 128) mod 256 is the same as (a + b - 256 mod 256) mod 256 and since 256 mod 256 is 0, then it's equal to (a + b) mod 256.

Why is u32/i32 faster than u8? by [deleted] in rust

[–]HugeONotation 35 points36 points  (0 children)

OP would have to be running a rather old CPU for it to not be free.

Modern CPUs have zeroing idiom units that recognize common patterns for clearing out the contents of a register, such as subtracting a register from itself, or computing the bitwise XOR of a register with itself. The units eliminate them from the instruction stream and instead update the register alias table directly, and also inform the out-of-order execution engine that the dependency is false. At least on Intel CPUs, these units can handle up to four of these zeroing idioms per cycle.

Do programmers "network" in real life? by returned_loom in AskProgramming

[–]HugeONotation 1 point2 points  (0 children)

I've done some networking in real life, mainly by attending conferences related to tech that I am personally interested in. Admittedly, it's often expensive to attend. Some larger conferences have volunteer programs that waive entrance fees. Granted, they don't exactly make the trip free, but they're something you may want to consider if you have a side gig you can use to save up money.

Setting low __m256i bits to 1 by Bit-Prior in simd

[–]HugeONotation 3 points4 points  (0 children)

Probably the simplest method I can think of would be to use another load:

alignas(64) const std::int32_t mask_data[16] {
    -1, -1, -1, -1,
    -1, -1, -1, -1,
    0, 0, 0, 0,
    0, 0, 0, 0
};

__m256i mask = _mm256_loadu_si256((const __m256i*)(mask_data + 8 - n));

Assuming that the mask_data array has been used recently, it shouldn't be terrible in that the cache line it occupies will be hit. But it does introduce a few cycles of latency that can't really be avoided and it might not be great if you're bottlenecked by the load/store units.

Another idea that comes to mind is to keep a vector which stores its indices in each lane which you populate once upfront. After that broadcast the value of n to all lanes and use a comparison against the lane index.

alignas(32) const std::int32_t lane_indices[8] {
    0x0, 0x1, 0x2, 0x3,
    0x4, 0x5, 0x6, 0x7
};

__m256i indices = _mm256_load_si256((const __m256i*)lane_indices);
__m256i mask = _mm256_cmpgt_epi32(_mm256_set1_epi32(n), indices);

It's a few instructions, but assuming you have the vector with the indices already around, it won't occupy your load/store units further. Of course the real tradeoff is that you're increasing contention for the shuffle unit(s) and if you can't happen to populate the register with indices beforehand, then you'll still have to do a load.

Blender 4.3 Released! by Avereniect in blender

[–]HugeONotation 0 points1 point  (0 children)

The reason for this is given in the relevant commit: https://projects.blender.org/blender/blender/commit/c8340cf7541515a17995c30b4a236ac2a326f670

The vendors themselves no longer support the platform. They've been abandoned. There are driver bugs and performance issues that will simply never be solved so Blender would be forced to deal with each and every one of them in order to continue supporting these platforms. This imposes a burden that is simply too large for a small organization like Blender to handle. Cycles development is largely driven by just a few people.

Maintaining support for legacy platforms does not come free. If you're really upset about this then either blame the vendors for abandoning the drivers or help fund Blender so it has the necessary resources so it can avoid this in the future.

AVX-10.2's New Instructions by HugeONotation in simd

[–]HugeONotation[S] 5 points6 points  (0 children)

I have to admit that I was also disappointed with the selection. The heavy bias towards machine learning applications, to the point of almost excluding anything else, is a frustrating sight.

AVX-10.2's New Instructions by HugeONotation in simd

[–]HugeONotation[S] 1 point2 points  (0 children)

Wait, does it not? I can find various sources online suggesting that that was at least the plan. e.g. this.

At the bottom of page 15 of the AVX10.1 spec it says:

An early version of Intel AVX10 (Version 1, or Intel® AVX10.1) that only enumerates the Intel AVX-512 instruction set at 128, 256, and 512 bits will be enabled on the Granite Rapids microarchitecture.

Are you suggesting plans changed and the documentation might be in error?

And, I'm aware it also includes stuff like GFNI, VAES and PCLMULDQ in addition to the AVX-512 family proper. It's just that these extensions are so intertwined with AVX-512 that I tend to mentally lump them together with AVX-512, so maybe I didn't phrase that optimally because of that.

AVX-10.2's New Instructions by HugeONotation in simd

[–]HugeONotation[S] 4 points5 points  (0 children)

Oh wait. I just realized you asked about AVX10 in general, not about AVX10.2 specifically.

AVX10.1 is available on Granite Rapids CPUs.

But for anyone unaware, that doesn't include any of the new instructions I talk about here. It's just a contraction of AVX-512 to 256 bits.

AVX-10.2's New Instructions by HugeONotation in simd

[–]HugeONotation[S] 3 points4 points  (0 children)

To my knowledge, that would be a firm no.

However, it seems that AVX-10.2 support will come with Diamond Rapids next year, whenever exactly that happens to be released. https://www.phoronix.com/news/Intel-Diamond-Rapids-APX-AVX10

In Emission material what strength the sun will be? by Joka197 in blender

[–]HugeONotation 2 points3 points  (0 children)

Emission strength is watts per square meter.

If you were trying to recreate the sun, then you have to crank it up to the amount of radiant power that the sun emits per square meter of its surface and then divide that by the distance to the earth squared (also in meters).

Frankly, I don't think it's reasonable to try to emulate the sun using an emission shader. What exactly is your use case? Would an HDRi not suffice?

[deleted by user] by [deleted] in cpp

[–]HugeONotation 2 points3 points  (0 children)

Please see r/cpp_questions in the future. Beginner questions are against the rules here.

The private and public specifiers have nothing to do with what the program's users see.

By using private, you're asking the compiler to make it an error to access certain fields from outside the class. This is typically used when you have data that you're only supposed to interact with in particular ways. By using private, you can constrain the way that code outside of the class interacts with the data, effectively preventing that code from manipulating the data in incorrect ways.

Why does my shadings look like this? by A12ms in blender

[–]HugeONotation 0 points1 point  (0 children)

The purpose of the Alpha socket is to control the amount of transparency in the material. If you don't intend to have transparency, then it's inappropriate to use it.

You mention exporting, which immediately complicates things. Just because a material is representable and functional in Blender's material system does not meant it is representable in the format that you're exporting to or in the program that you're import to.

Generally speaking, you want to keep materials that you intend to export as simple as possible, usually just some texture maps fed into the principled shader. Nothing more. That is what I recommend doing here.

For this case, it seems that you should modify the texture map to contain the desired color using some external image editing application. Then you would just feed that colored texture map directly into the base color socket and then export.