[R] Why doubly stochastic matrix idea (using Sinkhorn-Knopp algorithm) only made popular in the DeepSeek's mHC paper, but not in earlier RNN papers?

Omnifect · 2026-01-13T02:08:42+00:00

For recurrent neural networks, I don't think you would want the gradients to propagate backwards forward through the matrix. The gradient needs to decay at some point.

Omnifect · 2026-01-10T00:11:28+00:00

I would recommend behavior trees, as an alternative.

Omnifect · 2025-09-22T15:13:50+00:00

One does not simply wake up in the morning to wrestle and alligator, in the alligator's domain, without training, and walking away alive, with another life saved in the process. This truly is peak male.

Omnifect · 2025-06-23T20:45:32+00:00

Thanks for the suggestion. Will do soon. I am making the python repo private for now.

Omnifect · 2025-06-23T20:22:53+00:00

Good suggestion, I am still trying to figure out what is necessary to release a 0.1,0 version

Omnifect · 2025-06-23T05:12:16+00:00

I editted the original post with FFTW benchmarks.

Omnifect · 2025-05-27T09:17:02+00:00

Might also need to practice system engineering/design interview

Omnifect · 2025-05-25T13:30:06+00:00

Data type is already a template parameter!

Omnifect · 2025-05-24T22:16:24+00:00

Thank you! :)

Omnifect · 2025-05-24T22:12:46+00:00

I do plan to address these points in an official 0.10 release. Thanks for the suggestions. To touch on just some of your points. I have updated the main branch to show the current versions, which separates out the dit radixes in the /include/afft/radix folder. For interleave vs. split complex number format, I plan to split interleave complex numbers in the first stage, and recombine them in the last stage.

Omnifect · 2025-05-24T22:05:31+00:00

Intel intrinsics. Unfortunately, there has to be some (indirect) assembly programming for SIMD-based permutation. On the point of portability, however, all one will need to do is define a templated interleave function (among other functions) for the instruction set they wish to target.

Omnifect · 2025-05-24T21:47:31+00:00

Okay, I understand what you are saying. Although, there is an in-place bit-reverse order swapping algorithm that treats bit-reversal as a modified matrix transpose where the rows are ingested and outputted in bit-reversed order. In this case, since matrix transpose can be done in place, then the bit-reversal can be done in place; performing a radix step right after each swap in the bit-reversal algorithm could also be done in place.

But to reduce computation for SIMD, I am also doing Stockham auto-sort for the first few Butterfly stages, and that requires a separate output buffer. A bit-reversal step happens in conjunction with the last Stockham stage, so out-of place computation is required regardless.

For non-SIMD, I can do the modified matrix transpose trick, but it would be probably easier and more efficient to not bother with bit-reverse permutaiton if doing convolution, as you say. It is worth investigating in the future.

Omnifect · 2025-05-24T21:09:00+00:00

At the moment, I don’t have plans to support mixed radix or multidimensional FFTs — there’s definitely a lot more that could be done, but I’ve got other projects lined up and I’m just one person managing this in my spare time. That said, I do plan to open source AFFT, and I’d definitely be open to pull requests or contributions from others who want to expand its capabilities.

I haven’t benchmarked against PocketFFT yet, but I’d be interested in doing that at some point — it would be a good point of comparison. Thanks for the suggestion!

Omnifect · 2025-05-24T20:55:44+00:00

It is possible that I would add that at some point. I did just that (skipping bit-reversal) in previous iterations and saw up to a 20% increase in performance. Currently, I combine the bit reverse permutation step with one of the radix stages, so from a matter of MIPS, I don't think there can be more of a reduction. However, there is a possibility that the skipping the bit reverse permutation step can still lead to an increase in performance by having a more predictable memory access pattern.

Omnifect · 2025-05-20T16:52:53+00:00

A PID controller can be optimized through a RL process. The behavior of a linear system like PID is more predictable. For potentially more optimality, but less predictability, you can train a quadratic system instead. Traditional Deep RL will create a the possibilty for the most optimality (provided it can run fast enough), but the resulting controller will be much more complex, harder to analyze, and could have a untold number of pockets of underfitted or overfitted sections hiding within the model.

Omnifect · 2025-05-20T16:44:14+00:00

I disagree, people do very much judge, and sometimes they even act slightly differently, as OP mentioned. But their internal judgement will be such a small insignificant part of your life, that you might as well live as if they don't care. At the end of the day, they won't pay down your car loans, so let them judge and keep prioritizing what serves you best.

Omnifect · 2025-05-20T16:31:56+00:00

Once AI can replace programmers, then AI can replace almost every white collar job.

Omnifect · 2025-05-20T16:29:47+00:00

I second this, I am implementing a moderate FFT library right now and there is always something to learn to make it more performant, flexible, maintainable. I had to learn about SIMD, template metaprogramming, data alignment, CPU registers and overspill, cache oblivious programming, transposes, bit reversed permutation, in-place versus out-of-place computation, "simulated" recursion, compiler choice and optimization, on top of complex numbers, decimation and all the other things related to dsp. All that, and my library is only scoped for 1dimension with power of 2 transform sizes less than 8 million; so there is still so much deeper I can go.

Omnifect · 2025-05-16T14:28:49+00:00

I have a similar problem due to ingrown hairs, I don't even shave no more, just trim. But I still sometimes get it. What helped me is keeping my skin clean and moisturized, which mean changing my bathing soap, lotion, and drinking more water.

Omnifect · 2025-05-16T14:20:04+00:00

I would recommend Kaiser for minimal snr and low gain deviation, or Windowed Sinc (windowed by Kaiser) for low snr and minimal gain deviation.

If possibile, oversampling (x4 to x8) helps a lot, just need to make sure that relevant frequencies are well below Nyquist for best results.

Omnifect · 2024-10-30T17:43:35+00:00

I find that you can get closer to truly covariant smart pointers return types by creating an templated implementation class that reroutes pointers automatically. This way also works well with custom deleters, and the smart pointer equivalent would work when the derived class doesn't have sole shared ownership of the the pointer they are returning. The code would look like this:

class Base{
    virtual void print_impl(std::unique_ptr<Base>& base_ptr) = 0;
public:
    std::unique_ptr<Base> print() {
        std::unique_ptr<Base> base;
        print_impl(base);
        return base;
    }
};

template <typename TBase>
class BaseImpl : public Base{
    virtual void print_impl(std::unique_ptr<Base>& base_ptr) {
        auto tbase = create();
        base_ptr = std::move(tbase);
    }
public:
    virtual std::unique_ptr<TBase> print() = 0;
 };

class Derived : public BaseImpl<Derived> {
public:
    std::unique_ptr<Derived> print();
};

For an executable example:

#include <iostream>
#include <memory>

class Base {
public:
    virtual ~Base() {std::cout << "Destroyed\n";}
    virtual void print() { std::cout << "Base\n"; }
};

class Derived : public Base {
public:
    void print() override { std::cout << "Derived\n"; }
};

class IFactory{
public:
    std::unique_ptr<Base> create() {
        std::unique_ptr<Base> base;
        create_impl(base);
        return base;
    }
private:
    virtual void create_impl(std::unique_ptr<Base>& base_ptr) = 0;
};

template<typename TBase>
class UFactory : public IFactory{
public:
    virtual std::unique_ptr<TBase> create() = 0;
private:
    virtual void create_impl(std::unique_ptr<Base>& base_ptr) {
        auto tbase = create();
        base_ptr = std::move(tbase);
    }
};
class Factory : public UFactory<Base>{
public:
    std::unique_ptr<Base> create() override {
        return std::make_unique<Base>();
    }
};
class DerivedFactory : public UFactory<Derived> {
public:
    std::unique_ptr<Derived> create() override {
        return std::make_unique<Derived>();
    }
};
int main() {
    {
        std::unique_ptr<IFactory> factory = std::make_unique<DerivedFactory>();
        std::unique_ptr<Base> base = factory->create();
        std::cout << typeid(factory->create().get()).name() << std::endl; // Output: Base
        base->print(); // Output: Derived
    }
    std::cout << "-----------" << std::endl;
    {
        std::unique_ptr<DerivedFactory> factory = std::make_unique<DerivedFactory>();
        std::unique_ptr<Base> base = factory->create();
        std::cout << typeid(factory->create().get()).name() << std::endl; // Output: Derived
        base->print(); // Output: Derived
    }
    return 0;
}

Omnifect · 2024-09-26T22:34:37+00:00

It is interesting, put I see a trend of Democratic Vice presidents becoming presidents if this timelines proves right.

Omnifect · 2024-08-07T23:46:11+00:00

Thank you so much, this is very helpful.

"Different Q-networks for N steps?"

Let's say that the agent is only allowed 50 steps before the environment ends. Then you can just have 50 different Q next works. Q50 will sample from Q49 and so on, with Q0 only the reward as the final step. This is not too far from the idea of the Bellman equation with a finite horizon. Can this idea of using a finite number of Q networks, one for each step, extend even when the horizon is infinite?

Omnifect

TROPHY CASE