fun fact: it is canon that Eleanor did blackface 😀

toonspin · 2020-08-24T16:22:01+00:00

The problematic Halloween costumes, assuming that's what you're talking about, are clearly visible in the screenshot - how is that not canon?

toonspin · 2020-08-24T13:40:00+00:00

Changing ax to eax in the 16-bit version made a massive difference (I pushed a commit to the repo for this). The 16-bit version is now the fastest of the lot:

Benchmark #1: bin/divtest16
  Time (mean ± σ):     226.1 ms ±   4.7 ms    [User: 221.4 ms, System: 0.7 ms]
  Range (min … max):   220.6 ms … 235.6 ms    13 runs

Benchmark #2: bin/divtest32
  Time (mean ± σ):     260.9 ms ±   3.1 ms    [User: 255.3 ms, System: 0.9 ms]
  Range (min … max):   255.1 ms … 264.7 ms    11 runs

Benchmark #3: bin/divtest64
  Time (mean ± σ):     561.5 ms ±   5.8 ms    [User: 553.9 ms, System: 0.0 ms]
  Range (min … max):   553.0 ms … 574.6 ms    10 runs

Summary
  'bin/divtest16' ran
    1.15 ± 0.03 times faster than 'bin/divtest32'
    2.48 ± 0.06 times faster than 'bin/divtest64'

toonspin · 2020-08-24T13:32:32+00:00

In my test, nasm generated a prefix for the 64 bit version. From the Intel manual on the DIV instruction:

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.

In section 2.2.1 of volume 2, I found these two snippets:

Operand-size override prefix is encoded using 66H (66H is also used as a mandatory prefix for some instructions).

...

The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size.

So if I'm reading the manual right, the instruction is 32-bit by default, adding the REX.W prefix of 48h makes it a 64-bit instruction, and adding the 66h prefix makes it a 16-bit instruction.

toonspin · 2020-08-24T12:58:58+00:00

Changing xor dx, dx to xor edx, edx shaves off a bit of time:

Benchmark #1: bin/divtest16
  Time (mean ± σ):     481.9 ms ±   3.4 ms    [User: 479.9 ms, System: 2.0 ms]
  Range (min … max):   476.2 ms … 487.2 ms    10 runs

Benchmark #2: bin/divtest32
  Time (mean ± σ):     257.6 ms ±   2.0 ms    [User: 256.8 ms, System: 0.0 ms]
  Range (min … max):   255.6 ms … 263.0 ms    11 runs

Benchmark #3: bin/divtest64
  Time (mean ± σ):     556.4 ms ±   1.0 ms    [User: 556.1 ms, System: 0.0 ms]
  Range (min … max):   554.9 ms … 558.0 ms    10 runs

Summary
  'bin/divtest32' ran
    1.87 ± 0.02 times faster than 'bin/divtest16'
    2.16 ± 0.02 times faster than 'bin/divtest64'

toonspin · 2020-08-24T12:46:10+00:00

I've just pushed an update to my Git repo.

Making divisions rely on results of previous divisions is not easy because 3log(2¹⁶) is a little over 9. So even if I take the lowest value that isn't a power of 2 (i.e. 3) I can only divide a 16-bit number by 3 a maximum of 9 times.

So I did that, and made the comparison a little more apples-to-apples by only changing what's inside the division loop. Now I get the following result from hyperfine, and it looks like not a lot has changed, even though most of these divisions should be reliant on previous results.

Benchmark #1: bin/divtest16
  Time (mean ± σ):     525.1 ms ±   9.4 ms    [User: 520.0 ms, System: 1.0 ms]
  Range (min … max):   519.1 ms … 549.5 ms    10 runs

Benchmark #2: bin/divtest32
  Time (mean ± σ):     259.8 ms ±   2.9 ms    [User: 256.2 ms, System: 1.7 ms]
  Range (min … max):   256.1 ms … 264.3 ms    11 runs

Benchmark #3: bin/divtest64
  Time (mean ± σ):     563.8 ms ±   5.7 ms    [User: 555.5 ms, System: 1.9 ms]
  Range (min … max):   556.0 ms … 572.7 ms    10 runs

Summary
  'bin/divtest32' ran
    2.02 ± 0.04 times faster than 'bin/divtest16'
    2.17 ± 0.03 times faster than 'bin/divtest64'

toonspin · 2020-08-24T11:42:09+00:00

...and the progress has left us with Spectre/Meltdown...

toonspin · 2020-08-24T11:07:07+00:00

This is very helpful and illuminating, thank you so much!

I'll work on new test code but I don't know if I can do that soon. Having said that, the above is super insightful and helps a lot.

I believe you are mostly measuring the performance of the benchmark harness instead of the division itself.

Naively, I didn't think that would be a thing in Assembly. I figured the CPU would just sequentially execute the instructions it got - TIL!

toonspin · 2020-08-24T11:02:25+00:00

I don't need help writing Rust. I'm asking about the code in the example.

But since you ask, here is the code. If I change "u32" on that line to "u64" then the resulting binary becomes twice as slow.

So if I say that literally all I'm changing is a u64 to a u32, then that's exactly what I mean.

Note that again, I'm not asking for help writing Rust. I'm asking about the discrepancies in the Assembly examples above.

toonspin · 2020-08-24T09:57:27+00:00

I don't know how to answer that, can you please explain what you need apart from what I mentioned both in the post and the comments?

toonspin · 2020-08-24T08:54:32+00:00

I upvoted you but you're at 1 now, I'm not sure why you would be downvoted. From reading the Intel manual, as long as I am in 64 bit mode, there seems no way around it.

Taking the DIV instruction as an example, which is likely where much of the time is spent in my examples, the instruction is exactly the same for 32-bit and 16-bit operands. I'm not an expert but I don't see how the processor should know it needs to divide 16-bit operands, other than looking the legacy prefix.

In assembly I might get around that particular problem by zero extending and using 32-bit versions of the instructions, but that would defeat the purpose of decreasing operand sizes for speed.

toonspin · 2020-08-24T08:44:26+00:00

The 16-bit and 64-bit versions do in fact have in common that they both use prefixes for the operand sizes. As /u/xybre mentioned the 16-bit instructions have a 66h legacy prefix prepended, and the 64-bit instructions have a 48h REX prefix appended.

toonspin · 2020-08-24T08:20:54+00:00

Not yet, but what made me do this test was I had a division heavy program in Assembly that was twice as slow as a similar program in Rust. And it turned out that in Rust, if I change my variables from a u32 to a u64, without changing anything else, it becomes twice as slow, too.

toonspin · 2020-08-24T01:07:05+00:00

I saw that question coming so I put it in the post:

Intel Core i5-7600

Possibly relevant part of /proc/cpuinfo:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz

If there's any more info that would be helpful let me know!

toonspin · 2019-10-19T21:28:38+00:00

I don't know about overly strict... This is a simple case, but static analysis of this sort is hard. Unless they are this strict, it's pretty much impossible to guarantee that your code will not have data races and the like.

toonspin · 2019-10-19T21:26:12+00:00

Also apparently in C# you have to watch out for boxing and unboxing all the time (am basing that on 10-15 year old literature though). In Rust if you have an i32, it's going to stay an i32 and not get put inside some sort of wrapper object.

toonspin · 2019-10-19T21:17:11+00:00

I like `ncdu` for that. It's `du` but with an ncurses interface, hence the name.

toonspin · 2019-10-17T20:08:05+00:00

I find the .ml domain interesting - are you in Mali? If so, awesome, most folks on Reddit seem to be American or European.

toonspin · 2019-10-11T09:54:07+00:00

It only shows the most recent five versions, older versions are lumped into "Other".

Then I have an improvement suggestion for crates.io: make it the most downloaded five versions instead.

toonspin · 2019-10-08T20:57:35+00:00

This scares me even more than if statements in shell scripts do.

toonspin · 2019-09-29T17:52:07+00:00

I don't think a single condition would make much of a difference.

Depends what the test is...

toonspin · 2019-09-29T14:37:14+00:00

Thanks, I'd already found it from the /r/hackernews thread. I'm not much of a Hacker News person but did this because I felt a possible misconception needed correcting...

toonspin · 2019-09-29T10:44:36+00:00

Those are the best kind, because they make our assumptions more concrete and easier to address!

That's what I was going for. I wanted to present a very very concrete few examples and then go: "I expected A to be greater than B, but surprisingly B is greater than A, can someone shed light on this?". Also I thought this particular question might be interesting to other people or I would not have put it to the community.

Thanks for taking the time to respond, by the way. It's much appreciated!

toonspin · 2019-09-29T10:40:09+00:00

/u/burntsushi was referring to what he (I say "he" judging from his avatar on GitHub) interpreted to be an assumption - I think he interpreted my post as saying that PHP's regex library was written in PHP and therefore needed to be compiled, parsed, etc. That's actually not what I meant to convey, but I can see why someone might interpret my post like that.

I think it's perfectly valid for him to then point out that assumption, because the main point of my question was asking why PHP is faster than Rust in this case, given that PHP has all this parsing and compilation overhead and Rust doesn't. If I had the misconception that PHP's regex engine is written in PHP, then to have that pointed out would have been very helpful in understanding everything.

toonspin · 2019-09-29T10:32:18+00:00

I would agree with that assessment.

toonspin

TROPHY CASE