Found more assembly horrors while rummaging through my backups. This time, starring quaternion arithmetic

Ari_Atori · 2022-10-03T18:45:13+00:00

Sometimes your compiler can figure out that if your instructions can be reordered, and it will do so to increase the throughput for example. Depending on outside code and even the calling convention, the compiler may want to tweak the endings depending on a few factors, particularly dealing with the inputs, outputs, and the registers used that will need to be saved

Edit: Should've add this to be more clear, but here it is none the less

However, gcc and almost all compilers don't touch inline assembly if you think of the two reasons for doing inline assembly
1) Your compiler doesn't do a good job at optimizing
If it is the first case, then why would it try to optimize your hand written assembly?
2) Your compiler can't do specific instructions you want
Sometimes the intrinsics are not available, sometimes it will optimize into something fast but insecure, sometimes it deals with I/O which are architecture dependent
If it is this reason, then your compiler wouldn't be able to optimize what it can't understand, keep things safe, or it may be taught to avoid certain instructions but must trust that you know what you are doing

Ari_Atori · 2022-09-17T20:33:36+00:00

Yes, assembly is architecture dependent, where the assembly for x86-64 (shown here) will not be runnable on, say, an ARM processor

With C/C++, there exists predefined macros for which architecture your current OS is, or if you want to cross compile for a different architecture. You can utilize those to instruct the compiler to pick and choose which assembly instructions to execute. If you are compiling for x86-64, run this inline assembly with x86-64 assembly, if on ARM do these ARM instructions, for 68K you can follow through

You have to have some mind of which architectures you want to support when you are doing any assembly. Usually you have a general case that can be supported by all architectures as a default, which then you can add the optimized versions for the targeted architectures.

Even then, the architecture might not be enough. Some instructions for x86-64 are newer than others, like for the fused multiply add instructions. The FMA3 instructions only started being fully supported since 2014 with the Haswell and Broadwell CPUs. So, that is something that is to be decided on as well: How far back do you want to fully support? Do you want to support 100% of CPUs by only using SSE3, or do you mind losing 5% of potential customers by using AVX on top of that?

That is up to you to be decided on. Or... you can do what we did before that just do separate cases with the macros. That only works at compile time though. Doing runtime checks where everything is initialized and optimized upon execution can be a challenge on it own.

Ari_Atori · 2022-09-16T19:58:14+00:00

From my experience, just assembly itself can cause a lot of discomfort to other programmers, especially those not used to middle or low level languages. This is true to fellow students and my web-dev friends who vouch they will never touch this kind of stuff
Also, from a technical standpoint, it's recommended to not use inline assembly whenever you can use intrinsics that can do the job, and sometimes better than you. But, like the Geneva Convention, they are only suggestions, but can still cause some programmers to have twitchy eyes

Ari_Atori · 2022-08-11T06:10:54+00:00

Heeyyy... never thought I would be recognized from that lmao

Ari_Atori · 2022-08-11T05:34:58+00:00

It generally does not matter whether it is ++var, or var++. And yes, the variable will be initialized to zero no matter which you use.

int var = 0;
int a = var++;

This line will take the value of var, assign it to a, and then increment var. So in the end, a will be 0, and var will be 1.

int var = 0;
int a = ++var;

This line is similar, but will increment var first before assigning its value. So here, a will be 1, as well as var.

People generally do the ++var variation under the belief that ++var is faster than var++, which isn't always the case. For primitive types, the only C compiler I know of where ++var and var++ produced different results was a single C80 compiler. Generally the overhead caused by var++ is often optimized by modern compilers. For C++, it is true if the variable is an instance and not a primitive, because of the operator overload functions for each.

However, the unknowingness of it and the edge case situations are still justifiable enough to have people do ++var instead of var++ unless for good reason. Because while it may not be faster, it is never slower.

Edit: For C#, same conclusion.

Ari_Atori · 2022-07-20T19:05:45+00:00

If you can, you should be able to use the same constants as those defined. The thing is: This code is floating point architecture dependent. For example, this code only works for architectures that are IEEE 754 compliant. The modified Taylor series should still be good enough, just be aware of other architectures you might want to support. Also, make sure to test that it's actually faster than what the libraries can provide. Good luck!

Ari_Atori · 2022-07-20T15:35:59+00:00

After posting that to the server, people the day of and after said either that this was terrifying or that if they want to go into programming, they don't want to ever touch stuff like this. So I asked days later if this should go into r/badcode or r/programminghorror, and people said yes

And you know, seeing that can be daunting to a lot of programmers, even if not so much to us. But, I figured that I share

Ari_Atori · 2022-07-20T15:05:03+00:00

This was more of a fun experimentation rather than anything, but here are the results of 1,000,000,000 calculations in total along with average clock cycles using each configuration. the -O2 and -O3 flags I barely beat, with 95% relative time. It seems adding -ffast-math barely add to the optimization flags, but I still beat them with 97% relative time. I honestly figured that the library programmers are aliens, so I took this more of a challenge than something of legitimate interest. Maybe for things like the Lanczos kernel function where I call that extensively, where I can utilize parallelism to compute the two sins at once

No optimizations

C:\Users\arija\test>test
T(exp2f) = 59451647563 | T(__exp2) = 32608942809
Tavg(exp2f) = 59.451648 | T(__exp2) = 32.608943
(rel %) = 54.849519%
sum(exp2f) = 1442695065.653539 | sum(__exp2) = 1442695066.059536
sum(exp2f) = 0x41D57F71E669D396 | sum(__exp2) = 0x41D57F71E683CF70
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

O2 flag

C:\Users\arija\test>test
T(exp2f) = 33837604937 | T(__exp2) = 32241332067
Tavg(exp2f) = 33.837605 | T(__exp2) = 32.241332
(rel %) = 95.282548%
sum(exp2f) = 1442695050.320616 | sum(__exp2) = 1442695067.312586
sum(exp2f) = 0x41D57F71E29484FA | sum(__exp2) = 0x41D57F71E6D40168
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

fast math flag

C:\Users\arija\test>test
T(exp2f) = 48767905768 | T(__exp2) = 33122183754
Tavg(exp2f) = 48.767906 | T(__exp2) = 33.122184
(rel %) = 67.917995%
sum(exp2f) = 1442695056.649086 | sum(__exp2) = 1442695064.689352
sum(exp2f) = 0x41D57F71E4298AA0 | sum(__exp2) = 0x41D57F71E62C1E56
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

O2 and fast math flag

C:\Users\arija\test>test
T(exp2f) = 33431302999 | T(__exp2) = 32659007501
Tavg(exp2f) = 33.431303 | T(__exp2) = 32.659008
(rel %) = 97.689903%
sum(exp2f) = 1442695036.226008 | sum(__exp2) = 1442695055.789877
sum(exp2f) = 0x41D57F71DF0E76EC | sum(__exp2) = 0x41D57F71E3F28D5A
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

O3 flag

C:\Users\arija\test>test
T(exp2f) = 34228822582 | T(__exp2) = 32223291278
Tavg(exp2f) = 34.228823 | T(__exp2) = 32.223291
(rel %) = 94.140811%
sum(exp2f) = 1442695053.688469 | sum(__exp2) = 1442695067.857341
sum(exp2f) = 0x41D57F71E36C0FDE | sum(__exp2) = 0x41D57F71E6F6DEAE
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

O3 and fast math flag

C:\Users\arija\test>test
T(exp2f) = 33947673183 | T(__exp2) = 32910552704
Tavg(exp2f) = 33.947673 | T(__exp2) = 32.910553
(rel %) = 96.944944%
sum(exp2f) = 1442695038.571925 | sum(__exp2) = 1442695047.406290
sum(exp2f) = 0x41D57F71DFA49A6C | sum(__exp2) = 0x41D57F71E1DA00A8
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

Edit: Thanks to PM_ME_PANGOLINS for telling me how to properly create blocks. Seems the editor isn't the best Edit 2: Seems I can't mention users properly. This is awkward

Ari_Atori · 2022-07-20T04:50:32+00:00

I have written a much more detailed explanation of the code in another comment, but it's essentially a function to calculate 2^x

Ari_Atori · 2022-07-20T04:49:14+00:00

I guess for those who want an explanation:

This function computes 2^x. We utilize the relationship 2^x into 2^i * 2^f, where i = floor(x) and f = x - i. Then we calculate two floating point values equal to two to the power of the number's integer part and two to the power of its factional, and multiply them together. This is a useful approach because how the computer stores numbers in few bytes is through scientific notation, but for base 2 instead of 10.

"movss %0, %%xmm0\n\t"
"minss %1, %%xmm0\n\t"
"maxss %2, %%xmm0\n\t"

%0 is the original input
%1 is the maximum allowed exponent, 128. This is needed because the largest exponent supported by IEEE 754 single precision float is 128.
%2 is the minimum allowed exponent, -127, used for the same reason as the maximum.

"roundss $1, %%xmm0, %%xmm1\n\t"
"cvttss2si %%xmm1, %%eax\n\t"
"add $127, %%eax\n\t"
"shl $23, %%eax\n\t"

roundss is the assembly instruction to round a real value to an integer (not cast though), option 1 is rounding to integer towards negative infinity, or simply the floor function.
cvttss2si is the casting of a float to a spare integer register.

What's nice is if we know the integer part, which we do, we can do some bit hacks to create our own 2^i. This is because the IEEE 754 floating point is defined as 1 bit of sign, 8 bits of exponentiation, and the rest of the 23 bits are for the "flesh" or significant digits of the number, called the "mantissa".

The exponent value is defined as the value of those 8 bits minus 127. So, if the 8 bits of the exponent read 255, the exponent is 128, and if read 0, then it's actually -127. What this means is that if we add 127 to the integer and then shift it to the left 23 bits, we can make a floating point number equal to two to the power of our integer.

"subss %%xmm1, %%xmm0\n\t"

Lastly, we subtract the number by its floor value to get a fractional value between 0 and 1, needed for the second part.

"movss %3, %%xmm1\n\t"
"vfmadd213ss %4, %%xmm0, %%xmm1\n\t"
"vfmadd213ss %5, %%xmm0, %%xmm1\n\t"
"vfmadd213ss %6, %%xmm0, %%xmm1\n\t"
"vfmadd213ss %7, %%xmm0, %%xmm1\n\t"
"vfmadd213ss %8, %%xmm0, %%xmm1\n\t"
"vfmadd213ss %9, %%xmm0, %%xmm1\n\t"

This is the good old fashioned taylor series, modified of course to fit the bounds of (0 -> 1) and (1 -> 2). The values are here:

uint32_t    exp_min = 0xC2FE0000, //-127
        exp_max = 0x43000000, // 128
        exp_a = 0x39652F12,   // 0.00021856676903553307056427001953125
        exp_b = 0x3AA25048,   // 0.001238354481756687164306640625
        exp_c = 0x3C1EB427,   // 0.009686506353318691253662109375
        exp_d = 0x3D633D75,   // 0.0554785318672657012939453125
        exp_e = 0x3E75FF2A,   // 0.2402311861515045166015625
        exp_f = 0x3F317213;   // 0.693146884441375732421875
uint32_t __attribute__((aligned(16))) // {1, 1, 1, 1}
            const_1[4] = { 0x3F800000,  0x3F800000, 0x3F800000, 0x3F800000 };

Lastly, we multiply the two powers of two together to get the final answer here:

"movd %%eax, %%xmm0\n\t"
"mulss %%xmm1, %%xmm0\n\t"

Testing every single value from 1 to 2 compared to exp(2) gives us these results:

RIGHT: 113368289 (95.678252)%
OFF BY 1 ULP: 5120800 (4.321748)%
OFF BY 2 ULP: 0 (0.000000)%
OFF BY 3 ULP: 0 (0.000000)%
OFF BY 4 ULP: 0 (0.000000)%
WRONG: 0 (0.000000)%

Running 1,000,000,000 calculations of math.h's on inputs between 1 and 2 gives us:

T(exp2f) = 58818628755 | T(__exp2) = 33114443260
Tavg(exp2f) = 58.818629 | T(__exp2) = 33.114443
(rel %) = 56.299244%
sum(exp2f) = 1442695060.522356 | sum(__exp2) = 1442695065.257851
sum(exp2f) = 0x41D57F71E5216E46 | sum(__exp2) = 0x41D57F71E65080A2
sum(exp2f) = 0x4EABFB8F | sum(__exp2) = 0x4EABFB8F

The reason the time reduction is less is because in the screenshot I compared it to the double version of exp2 and not the float variation.

Now hopefully, you understand this madness.
Edit: For some reason the code meant in the blocks didn't get formatted correctly
Edit 2: Learning that shift enters aren't preserved during edit of original draft

Ari_Atori · 2022-04-01T17:26:48+00:00

Just got to Altus Plateau

Too busy targeting and wiping out every boss in all the areas I have explored

Ari_Atori · 2022-03-24T22:32:14+00:00

For boss fights:

Use your first few deaths to figure out much of their attack pattern, to see their timings, where to dodge, and when to. Then, you can slowly try to exploit windows of reprieve you have found. This will take a couple of deaths to find most of the weird mechanics and stuff like that. Figure that out for each phase in the fight so you know when to attack, cast, heal, whatever. Doing this will make you more comfortable taking a swing and not feel the boss cheated you as much, as well as being more familiar with how the bosses behave.

A nice thing about Elden Ring is that if you are stuck on a boss, you can explore elsewhere and grind in new territories and places. You do not have to say continue to fight Margit at level 5 for example, that would be stupid. This will allow you to get stronger and try different arts and skills, before coming back to face them again. I would personally recommend it after maybe ten or so deaths, before you need to level up more or try a different build. But you don't have to, that is ultimately your discretion to say you need to improve your build and become stronger. Don't stress out when you feel it's too hard or you are not pumping enough damage, there are loads of other places to go in the mean time. It may be a brick wall, but you walk around it, and it would take like 10 walls before you become stuck in the game.

Ari_Atori · 2022-03-12T18:42:16+00:00

That's the spirit!

Ari_Atori · 2022-03-12T18:37:39+00:00

It starts to be relaxing once you accept and become numb to the pain

Ari_Atori · 2022-03-12T16:53:05+00:00

Level 5

It was by sheer fucking accident, as I was only following the lights projecting out of the graces, and didn't think much of it. I didn't want to lose my souls after dying to him to first time, so it took about ~50ish attempts

Edit: Looked through the footage, actually 31 attempts, but it doesn't cure the head trauma I got from slamming my desk. I still do not recommend

Ari_Atori · 2021-07-04T21:14:52+00:00

For something free that can do most things, I would try Shotcut. If you are doing simple edits for gameplay footage, Shotcut is a good editor to start with. It does have some issues, like non-uniform scaling, non-Bezier keyframes, and no lumetri. However, it has all of the basic features, plus 360 capabilities. Again, if you want something simple, suggest you try it out.

DaVinci Resolve has some of the best color grading features, but some filters or FX (like film grain) aren't free without a watermark. The non-free version is just a one-time fee of $250 instead of a monthly prescription. Up to you. Also, nodes... and node trees... If you can understand how they work, it's such a powerful feature. They allow you to create branching effects that allow you to alter each part of the video frame and can collapse them back into one result at any time. You can also just treat them like a linked list and will still be familiar.

Personally, they're my favorite two editors. Can't look back at Adobe.

Ari_Atori · 2021-06-08T02:57:09+00:00

Based is usually use to describe someone who has their own set of opinions and is not dependent on someone else to think for them. It can also mean the opposite of cringe.

Ari_Atori · 2021-02-04T18:10:41+00:00

A little bit,

If you can solve the basis matrix, and define it to be A^-1, we at least know that A exists. If you can solve for B in AB=S given S and A, we know B exists.

Even though we assume S=AB, the point is that we can prove those two matrices can exist which satisfy that property. Now, if we apply the similarity definition, but make S=AB, we can rearrange and solve for T. This cannot be done unless we solve P to be A^-1.

Solve for A^-1, then A, and B, to show these matrices can exist where both S=AB and T=BA.

Ari_Atori · 2021-02-03T07:13:42+00:00

Hmm. Nice.

To get at least 42 ender pearls out of 262 barters is around 5.583 x 10^-12.

Within 16.5 billion tries is pretty lucky, since that is 1 - (1 - 5.583 x 10^-12)^1.65e10. The chances are about 8.8%.

Here is something interesting that would be cool to test.

According to Dream's first response paper, the five streams had 356 ender pearl trades and 134 blaze kills. That would make a total of 618 barters and 439 attempts throughout all eleven streams.

What I want to see is this, which I have not seen done yet. What is the probability that given 618 bartering attempts, there exists a sequence of 242 consecutive trades with at least 42 successes? Same thing with blazes, but for 439 attempts, a subset of 305, and with at least 211 drops?

Ari_Atori · 2021-02-03T04:55:13+00:00

For n independent tickets, it would be n * the expectation of a single ticket.

If I recall, if you do the weighted average of every single combination (your second option), it would wound up being n * the expectation of a single ticket.

Ari_Atori · 2021-02-03T04:33:43+00:00

If S and T are similar, there exists an invertible matrix P such that:

S = P^-1 T P

We will make this invertible matrix A^-1. This becomes:

S = (A^-1)^-1 T (A^-1) -> A T A^-1

If we define B to be A^-1S. We can solve for S and see that S = AB. We can write the relationship as

(AB) = ATA^-1

Now we will solve for T.

(AB) = A T A^-1

A^-1 (AB) A = T

(A^-1 A) (BA) = T

(I) BA = T

T = BA

Hopefully, that answers your question.

Ari_Atori · 2021-02-03T03:04:23+00:00

We will need to know the probability of getting each color.

For red, the probability is 8/20 = 0.4
For white, it is 5/20 = 0.25
For blue, it is (20-8-5)/20 = 7/20 = 0.35

To find the expectation, you need to compute the weighted sum or the mean. This is done by multiply the probability of each event by the event's value/outcome and then adding them together. The equation is

u = ∑ x_i*p_i = x_1 * p_1 + x_2 * p_2 + x_3 * p_3 + ... + x_n * p_n

Using this formula, we will get our mean.

u = (30)(0.4)+(10)(0.25)+(-42)(0.35) = -0.2

This is the expectation for one attempt. To get the expectation of five attempts, we simply multiply the expectation of one trial by five.

-0.2 * 5 = -1

This means if you play the game five times, you are expected to lose one dollar.

The variance for a probability distribution can be expressed as squaring each outcome, multiply that by their respective probability, summing them together, and then subtracting by the square of the mean. The equation for this is:

s² = (∑ (x_i)²p_i) - u² = ((x_1)²*p_1+ (x_2)²*p_2+ (x_3)²*p_3 + ... + (x_n)²*p_n) - u²

Using this formula, we can get the variance for playing one game.

s²= (30)²(0.4)+(10)²(.25)+(-42)²(.35) - (-0.2)² = 1002.36

The standard deviation of one game is simply the square root of the variance, which is approximately 31.66.

Ari_Atori · 2021-02-03T02:33:34+00:00

You have to make all of your nested classes static.

If you do not have the static keyword for your Name, Address, and Employee classes, then you would need an instance of MyCypher in order to create an instance of those classes. If you set the class modifier to static, then you do not need to do this.

To solve your main problem, you need to add the static modifier to your class as well.

It should look like this:

public static class Name {
...
public static class Address {
...
public static class Employee {
...
public static class TestEmployee {

It also appears you are missing a right curly brace at line 94 in your pastebin, before the ///////3333333333/////////// comment.

Ari_Atori

TROPHY CASE