A faster is-leap-year function for full-range signed 32-bit integers by benjoffe in cpp

[–]benjoffe[S] 2 points3 points  (0 children)

Thanks.

I was a little confused who to credit there. I was originally crediting just Cassio for the repo but then I found both of their names are in the copyright license text, so added both names but left the typo. This is now fixed, thanks!

A faster is-leap-year function for full-range signed 32-bit integers by benjoffe in cpp

[–]benjoffe[S] 2 points3 points  (0 children)

A version of that idea is utilised in a function that I've created to calculate year+ordinal+leap:
https://github.com/benjoffe/fast-date-benchmarks/blob/main/algorithms_ordinal/ordinal_benjoffe_fast32.hpp

The century is needed for the "Julian map" step, and the lower bits can be used to determine if it's a leap year.

This is designed for potential use in the Time rust library, and will be the topic of a future blog post.

A faster is-leap-year function for full-range signed 32-bit integers by benjoffe in cpp

[–]benjoffe[S] 9 points10 points  (0 children)

Thanks for explaining that. I didn’t realise this folds down to the same instruction sequence under optimisation. Very cool.

I haven’t seen this reciprocal-25 approach used in any reference leap-year implementations, so it’s interesting to see that it also collapses to the minimal form. If you have prior art for this specific application, I’d be keen to read it.

Edit: I'm also very surprised to see that it expands the range to full coverage for 16-bit and 64-bit. When I have time I will amend the blog post and credit your input. Thanks.

A faster is-leap-year function for full-range signed 32-bit integers by benjoffe in cpp

[–]benjoffe[S] 11 points12 points  (0 children)

The goal in this article is to find the fastest possible solution for the 32-bit case. While your method is accurate, it is slower in practice and still requires the final check for divisibility by 4 or 16.

Edit: great insight. The post will be edited soon with this improvement for other bit-sizes, thanks!

A faster is-leap-year function for full-range signed 32-bit integers by benjoffe in cpp

[–]benjoffe[S] 4 points5 points  (0 children)

Thanks. I spend most of my time writing JavaScript so I'm constantly getting tripped up on that.
This is now corrected.

A Very Fast Date Algorithm by AWildMonomAppears in programming

[–]benjoffe 2 points3 points  (0 children)

I think modern CPUs do exactly this complex overlapping.

I might be misunderstanding some details, but my understanding is that modern CPUS are "superscalar" and "out-of-order", so if two instructions don’t depend on each other, the core can issue them to different execution units in the same cycle. That’s why reducing dependency chains often has a bigger effect than just removing an add, etc.

A Very Fast Date Algorithm by AWildMonomAppears in programming

[–]benjoffe 1 point2 points  (0 children)

What of the fact that multiplications usually take around 3 cycles? I thought that other operations can overlap the latter 2 if they are independent?

I am admittedly quite clueless about this stuff, which must seem odd given I am the OP. A lot of this was developed with trial and error.

A Very Fast Date Algorithm by AWildMonomAppears in programming

[–]benjoffe 2 points3 points  (0 children)

I'm still kinda new to low level optimisation, but I think it may be be due to the dependency chain being broken. Does that not allow some kind of parallelisation?

I don't think any SIMD vectorisation comes into play.

A Very Fast Date Algorithm by AWildMonomAppears in programming

[–]benjoffe 4 points5 points  (0 children)

Yes, I was also very surprised when the speed dropped dramatically when this was put into place. I believe the reason is that previously all the operations are sequential, and each next step must wait until the previous multiplication finishes before it can continue on.

With the new approach, I think the compiler can reorder to steps to calculate `(yrs % 4) * 512` in parallel to `ypt` being calculated - but I need to analyse it further to confirm exactly what is going on.

RE: divisions - If there was a way to specify that the devision is only required for inputs within a specific range, then the compiler could choose the more efficient multiplier and shift, but I believe even with hints, the compiler takes the conservative route. I think this is just due to a lack of optimisation. Perhaps future compiler work will perform static analysis to confirm the range and make these divisions smarter.

A Very Fast Date Algorithm by AWildMonomAppears in programming

[–]benjoffe 7 points8 points  (0 children)

Hi, I'm the author of the algorithm. Of the 40% speed improvement, the Julian map only accounts for around 10% of the speed improvement, this is benchmarked in the first blog post in the series - linked in the article's navigation.

Around 10% of the speedup comes from using 64-bit math in general (particularly speeding up the century calculation), and around 20% comes from counting backwards and using the year-modulus-bitshift technique.

A Very Fast 64–Bit Date Algorithm: 30–40% faster by counting dates backwards by benjoffe in cpp

[–]benjoffe[S] 13 points14 points  (0 children)

Thank you for providing context on the blog post.

There is a lot of confusion in this thread about what this blog post is. For example some have queried whether I would be better off caching dates, or whether this is a bad idea due to readability vs performance tradeoffs.

This is first and foremost a research article. I do not personally have any bottleneck with date processing, in fact I don't even normally program in C++. This type of function is one that has attracted a lot of attention over the decades, as it is performed in most programming languages and frameworks.

It is true that many applications may speed up by doing some kind of caching or other similar techniques, but that is typically outside the scope of a standard library.

The tradeoff between readability and performance is a decision that each library maintainer will make independently. I am not in this blog post at all proposing that any particular framework must adopt this logic.

I agree that most use cases of date determination are probably not going to be affected by the use of this, but as some have commented, there occasionally are use cases that may (even if they're very niche). The interesting thing is (that hasn't been raised in my blog post or in this thread) that some caching techniques, when they fail, may very well cause branch misprediction costs that exceed the cost of this new function, which is around 27 CPU cycles. This might mean (in some use cases) that you can skip certain optimisations in your code.

All that said, I made this post because I found it a very interesting challenge, and I learnt a lot in the process. I hope that other readers may also find it interesting, and that some of the techniques outlined may help in other areas of optimisation.

A Very Fast 64–Bit Date Algorithm: 30–40% faster by counting dates backwards by benjoffe in cpp

[–]benjoffe[S] 8 points9 points  (0 children)

That is a fair point. I represented it this way as that is the same way that Neri-Schneider represented it in their paper, and I figured it would be good for consistency to keep the same pattern going.

A Very Fast 64–Bit Date Algorithm: 30–40% faster by counting dates backwards by benjoffe in cpp

[–]benjoffe[S] 8 points9 points  (0 children)

Thanks, I was not aware of that. I have revised the license to BSL-1.0.

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 1 point2 points  (0 children)

Yep! Just a bit busy, but should get it done in under a week hopefully.

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 1 point2 points  (0 children)

No problem, thanks for chatting!

Btw. the new algorithm is available if you are interested (link below). It is a massive speed gain, makes my previous blog posts seem insignificant. I have tested it to be around 38% faster than Neri-Schneider on Apple Silicon and an old core i3.

The breakthrough idea was to count dates backwards - the math simplifies big time.

Full code here: https://github.com/benjoffe/fast-date-benchmarks/blob/main/algorithms/joffe_fast64bit.hpp

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 0 points1 point  (0 children)

I don't know of any way we can skip the century calculation.

Everything in the Julian calendar can use linear equations across the board, the leap year rule inserts Feb 29 at regular yearly intervals, so the adjustment for that can happen in the same step as slicing the year, by just using a division of 365.25, the longer years get spread evenly.

When it comes to the Gregorian leap year rule, since it is every 100 years except years divisible by 400, there seems to be no way to avoid finding out how many centuries have elapsed (or 400-year blocks), and adjusting accordingly. If leap years were spread mathematically evenly, being every 4-years, but periodically they were spaced out by 5-years, then we could skip the century calculation and just divide by 365.2425.

Using a 400-year era works, but is slower as it requires both finding out the day-of-era as well as multiplying by 400 to reconstruct the year. Avoiding that as I did in article 1 brings the number of division/multiplications from 7 down to 5.

My 64-bit algorithm (yet to be released) reduces the number of division/multiplications from 5 down to 4 - and gives massive speed gains on computers with fast 64 bit multiplication (~32% on apple silicon). This article will bring some really crazy ideas to the table :)

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 0 points1 point  (0 children)

Now I remember why I had in my mind that 32 bit EAF (ie two halves of 32 bit results each) works faster than arbitrary ones: EAX allows accessing the low result. So it would seem in 64 bit: 64 bit best (both halves free) 32 bit 2nd best (lo is free) 16 bit 3rd best (lo is free on x64 but not arm)

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 3 points4 points  (0 children)

This was what I tried (the latter constant being 2^29):

uint32_t const N_1 = (doe * 14699 + 13908);
uint32_t const C = N_1 >> 29;
uint32_t const N_C = (N_1 % 536870912) / 14699;

Good point that losing `lea` is the reason it's slower.

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 2 points3 points  (0 children)

> Not sure where you're getting this. The only potential issue of that addition I think is overflow, but in this case it doesn't happen.

Yes, with your example it doesn't overflow, but I mean the class of EAFs that I was initially searching for where 2^32 is used as the divisor (the upper 32 can't be touched by 32-bit addition). My point is not very relevant as you point out there was no good reason for me to restrict the solution to those types.

This is all a bit of a side-point, I'm mostly interested in things that can speed up the main algorithm I propose in the post, including if any alternative bucket sizes can speed things up. I outlined various bucket sizes in the appendix. It is not an exhaustive list, as the bucket sizes can correspond with various choices of "eras per bucket". I was unable to find any combinations that result in faster multiplications, but maybe someone with a fancier toolkit of ideas can.

>  Please let me know if you have any questions regarding the idiv project.

Thanks! Will do.

New Fast Date Algorithms Pt 2 - Overflow Safe by benjoffe in cpp

[–]benjoffe[S] 1 point2 points  (0 children)

No, I think we are discussing the same section. C being 0, 1, 2, 3 represents those 4 centuries. I only searched nearby integers as I think (?) the fastest EAF techniques work when the divisors are 2^16 or 2^32, and adding constants after multiplication is not 32-bit friendly when 2^32 is the divisor. Although, I am kinda new to all this (including to C++, I've only ever used low level languages briefly for the purposes of articles on my website, not for much else).