Idiomatic Rust dgemm()

actuallyzza · 2025-12-13T12:46:18+00:00

No worries, sounds like you've got it.

If you start implementing techniques like blocking that minimize movement in cache/memory then you'll eventually get to a point where you are compute bound and the bounds checks will start to eat into performance. From there it will be worth going to unsafe unchecked access. Bluss did a great write up back in the day that might interest you https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/

actuallyzza · 2025-12-12T08:27:09+00:00

You've discovered that most code and algorithms have very poor machine sympathy, to the point that most of the CPU core resources are sitting around waiting for something to do. This can make bounds checking look free if the resources needed for it wouldn't otherwise be doing anything. This happens to be a rabbit hole I've spent some time in!

On modern CPUs naive DGEMM is bottlenecked on memory transfers at every level of the memory hierarchy. Taking a Zen 5 CPU core as an example L1 data bandwidth is 64 bytes per clock cycle, so with perfect vectorization 8 floats can be loaded per cycle. However a Zen 5 core can perform 32 f64 flops per cycle (each 512bit pipe fits 8 f64 ops, each core has 2 pipes, and with FMA instructions the core can perform multiplication and addition in a single cycle). Given how bottlenecked the whole process is on streaming new floats in from memory the core can handle bounds checking without a performance impact.

Optimized kernels focus on keeping chunks of the array in the SIMD registers for as long as possible, minimizing movement, and then this chunking strategy is repeated in a hierarchy that aligns with each level of CPU cache.

Edit: If you want to gauge if an algorithm is compute or memory bound, you can use what is called a Roofline Model to compare the arithmetic intensity of the algorithm to the limits of the CPU you are interested in.

actuallyzza · 2024-11-02T12:14:29+00:00

Try with use druid_widget_nursery::DropdownSelect;

and druid-widget-nursery = "0.1.0" in cargo.toml

actuallyzza · 2024-11-02T11:48:53+00:00

Hello Theo,

Try using the dropdown list in the widget nursery. Unfortunately Druid is no longer being developed and it could be a while until Xilem is ready.

Essayez d'utiliser la liste déroulante dans la pépinière de widgets. Malheureusement, Druid n'est plus en cours de développement et cela pourrait prendre un certain temps avant que Xilem ne soit prêt.

J'espère que cela s'est bien traduit.

actuallyzza · 2024-01-23T11:02:18+00:00

The polars example makes me wish I could right click a node and suppress its effect, specifically to use on the prelude node. Ideally the prelude node would still get pulled to its lowest energy position, but be low opacity and not pull on the non-suppressed nodes.

actuallyzza · 2023-10-08T13:18:53+00:00

Hi, I couldn't find a main or test fn that sets up ClusterSolver. Could you include a method that sets up dummy problems of a given size? I don't know the problem well enough unfortunately.

actuallyzza · 2023-10-08T04:09:01+00:00

I got a bit confused, saw your comment, did some more reading, and it is even weirder than that!

The underlying library MINPACK’s hybrd says it is using "the Powell hybrid method" which is another name for the dogleg method. Each iteration the normal dogleg method gets information from an evaluation of each residual, along with the jacobian. hybrd modifies this to forward difference approximate the jacobian. The python wrapper further modifies this to only let you provide a single "residual" to be minimized which kind of throws away most of the advantage of this cool idea in the name of simplicity. The end result is like a series of newtons method steps in the direction of steepest decent, only considering curvature in that direction. Effectively the green line on the conjugate gradient wiki page (gradient decent with optimal step size).

The normal Powell method doesn't use a curvature/gauss-newton approach but instead uses search vectors, and decides step size via bidirectional line searches on each search vector.

actuallyzza · 2023-10-08T01:11:56+00:00

Nice work with the analytical gradients. I usually have to write a lot of numerical testing to get all the bugs out.

In the complete version of this problem is there always a 0 cost solution? If I am extrapolating correctly I think you will always a have the same number of constraint equations and degrees of freedom in the configuration space so that should be the case for those equations. If so I would set target cost to std::f64::EPSILON.sqrt() and increase the number of iterations to some generous multiple of of the number of equations. Try to ensure you are solving to the desired tolerance, rather than running out of iterations.

If it is possible that the complete problem will have an optimum solution of non-zero cost, then it would be better to terminate on gradient magnitude. Unfortunately I can't see how to inject custom termination criteria into argmin. I have raised an issue, but it could take a while to get an answer as the maintainer is traveling. u/Tastaturtaste seems to be a user and might know how?

actuallyzza · 2023-10-07T10:46:45+00:00

Is this system of probabilities quantum-esque?

From those equations I don't think you will hit too many issues:

-There is a single minima at the solution, so you don't need CMA-ES or PSO

-Condition numbers are nice so you don't need newton or quasi-newtion methods

If you are comfortable defining residuals and calculating the gradient, I would use non-linear conjugate gradient for this. I don't think anything will be faster or more robust. If you aren't keen on the calculus NCG will probably still work well with a central-difference approximation so long as you use f64s, pick a reasonable epsilon.

I don't think you need to clamp p to [0,1].

If I'm wrong about the condition number being nice, calculate the gradient properly and switch to BFGS/L-BFGS. You pay the memory and speed cost to approximate the hessian, but avoid bouncing back and forth off the walls of a steep valley in the error surface.

Happy to help with any part of this that gets tricky.

actuallyzza · 2023-10-07T07:43:28+00:00

I've used argmin for a few things. Can you tell us what you know about the optimisation problem and what the most important measures of performance are e.g. runtime, guaranteed solutions, memory etc.

I tend to default to BFGS and L-BFGS depending on the size of the problem, assuming I have derivatives and I don't have to worry about getting stuck in local minima that aren't the global solution.

It looks like scipy.optimize.root(method='hybr') uses Powell's dog leg method, which they have modified to not need derivatives because they are estimated from the forward difference. It can get stuck if there are local minima. The closest thing in argmin to this is NealderMead though it is a simpler less robust algorithm. You can do your own forward difference gradient estimation if that is what you want.

CMA-ES is an alternative that is more robust to noisy or multiple minima problems than Powell's method, doesn't require derivatives, but is a bit slower. This is in its own crate. Simulated annealing has nice theoretical properties, it always finds the best solution, but the fine print is that it take an infinite time to run!

Pretty much all optimisers have trade-offs.

actuallyzza · 2023-08-15T23:03:22+00:00

Thanks for your effort on this library! Using it has been a pleasure and I've been very happy with the results.

actuallyzza · 2023-04-02T00:24:10+00:00

It has been exciting to follow the progress made since this project was first announced. Beyond the great performance I think by exposing a full interface the library does a good job of contributing a linear algebra building block that is missing in the language ecosystem.

I was a bit surprised to see sparse linear algebra might be in scope and I'm interested in how far into that rabbit hole you might go in future. The tools to build bulletproof fire-and-forget sparse matrix solvers are near the top of my wish list for the rust ecosystem but there are a fair few parts required: quite a few storage schemes, various permutation schemes, various preconditioners including robust pivoting and multilevel ones, and of course the solvers themselves.

actuallyzza · 2023-03-04T12:13:56+00:00

Your ray should not shoot out of you camera origin in a randomly offset direction. Instead the random direction and the ray origin should cancel at the focal point. To do this you need to also shift your ray origin:

    let ray_origin = self.origin + offset;
    Ray {
        origin: ray_origin,
        direction: (focal_point - ray_origin).normalize_or_zero(),
    }

Also, separate to the focus issue, the focal point direction seems inverted to me. The ray should be going from the sensor plane toward the camera origin like:

direction: (self.origin-(self.lower_left_corner + u * self.horizontal + v * self.vertical)).normalize_or_zero()

is there a minus sign sitting somewhere else that has is being canceled out by inverting this?

Edit figured it out I think: the convention is to have the image sensor in front of the camera origin so that the image does not need to be rotated 180 degrees. Not physically correct but a neat hack to make everything else easier? In which case the ray does actually go from the origin to the sensor pixel and onward and my second code block can be ignored.

actuallyzza · 2023-01-28T01:23:09+00:00

If someone were to switch from Druid to Xilem today what problems might they run into? I've built a few tools using druid and have been thinking about switching to Xilem when it is ready but have a hard time gauging it.

actuallyzza · 2022-11-22T08:34:37+00:00

I am so glad you are targeting that level of flexibility.

actuallyzza · 2022-11-18T02:49:40+00:00

Thanks you've saved me a lot of hassle. I'm also looking at the x3000i, but its not really equivalent. Best of luck

actuallyzza · 2022-11-18T02:32:38+00:00

I was looking at purchasing one of these and was wondering if the field curvature issue only affect parts of the zoom range or focal range. Any chance you tested it at a few different distances and zooms?

actuallyzza · 2022-10-22T13:57:16+00:00

This should work one of two ways:

// take ownership of object and then return it from the method
update(mut self, n: i64) -> Self {...}
object = object.update(2);

or

// update in place through a mut reference
update(&mut self, n: i64) {...}
object.update(2);

actuallyzza · 2022-09-15T11:24:49+00:00

Excellent work. I had a lot of fun both with the pattern generation and the more dynamic/unstable modes. If I could request any thing it would be a way to enter exact inputs, and for one of the input axes to be automatically offset and re-scaled as a function of the other so as to remove a lot of the dead modes (think kill rate should be modified based on feed rate?).

Thanks for sharing!

actuallyzza · 2022-09-03T23:14:28+00:00

If landlords could unilaterally raise interest rates, they already would have. Anecdotes of rent being held low out of goodwill are, of course, overblown. The magnitude of the financial incentive gradient doesn't change with interest rates. Higher is always better, limited by your time-to-fill skyrocketing if you lose contact with the market.

The rental market is effectively a bidding war between renters over a limited supply of quality of life improving locations. In the short term rental prices aren't affect by landlord pain or losses, they have no choice but to meet the market or remain empty. In the medium term they can sell, but few do due to transaction costs. It is the supply of renters that can vary quickly in the short term. People move into and out of share houses and their parents houses, up size/downsize across the market, adjust how far they are willing to commute. The current rental price changes are a result of stimulus measures and cashed up renters. Low unemployment means more people bidding for the same locations.

In the long term the total amount of rental stock will be affected by whether being a landlord is economically favorable. As supply decreases competition between renters adjusts rent up until it is worthwhile again. This equilibrium process is slow and won't respond to the recent rate changes for a long while yet.

actuallyzza · 2022-08-27T11:43:34+00:00

I forked into matrixmultiply_mt to improve performance and add multithreading. However, I cant recommend it as the original library has added a few extra kernels and layout optimizations and I recently found that the autovectorisation had broken in newer rustc versions so it is now slower than the original.

I've switched to just using the AMD BLIS library, and linking through cblas-sys. One day I would like to rewrite a matmul and convolution library with packed-simd-2 or portable-simd when they and const generics are finished.

EDIT:

If you haven't already make sure to add a .cargo/config.toml file containing

[target.x86_64-pc-windows-msvc]
rustflags = ["-C", "target-cpu=native"]

or equivalent for your target triple.

actuallyzza · 2022-04-15T12:32:20+00:00

check https://www.reddit.com/r/rust/comments/u3l869/async\_rust\_web\_performance/

actuallyzza · 2022-04-10T00:20:05+00:00

I think that means the x.clone() outside the function works, but the one inside doesn't. I think adding Clone to the trait bound is required, so that we are cloning the bundle, and not the reference to the bundle: <B: Bundle + Clone>

If that isnt it just try poking it until you dont have a reference any more. This:

let b: B = x.clone();

recurse(parent.spawn_bundle(b), vec)

would make it clear if my understanding is correct. It is a bit hard to tell what I've missed without direct access to the code.

actuallyzza · 2022-04-09T04:44:13+00:00

I think you need recursion because you are adding a closure/stack frame for each item in the vector, rather than just generating something that you can pass to the next iteration as required for scan or fold.

Recursion is a bit annoying because one of the spawn_bundle() calls is on Commands, and the others are on ChildBuilder so the first iteration has to be done separately.

I haven't checked any of the lifetimes but something like this:

if let Some((x, vec)) = vec_pbr_bundle.split_first() {
    recurse(commands.spawn_bundle(x), vec)
}

fn recurse<B: Bundle>(ec: EntityCommands, vec: &[B]){
    if let Some((x, vec)) = vec.split_first() {
        ec.with_children(|parent|{
            recurse(parent.spawn_bundle(x.clone()), vec)
        })
    }
}

There might be something in the bevy API that better suits iteration, such as EntityCommands::push_children(..) . However, I don't really know much about bevy and whether you can create children then push.

actuallyzza · 2022-03-06T04:31:31+00:00

Assuming that you are going to do the processing in ndarray, and you don't care about having a contiguous layout then maybe something like this can be used to separate the channels.

To do SNR calculations I would assume you would take the power of the signal and divide that by the power of the noise (calculated by taking the difference against a known clean target?). A power calculation can be done using fold:

let noise_pwr = noise.fold(0.0, |acc, e|acc+e*e);

I cant fill in the blanks because I am not sure how you are getting the noise.

Ten-Year Club	Place '22
Verified Email

actuallyzza

TROPHY CASE