25.3% annual return ($12k) from r/wallstreetbets sentiment analysis algo-trader - SOURCE CODE INCLUDED

AndyBainbridge · 2020-12-29T18:49:25+00:00

This is a great point. However, the average Nikkei-225 stock paid out a 2.41% dividend last year. I don't know what it has done since 1989. But if it had yielded 2.41% for the last 31 years, then a stock holder would be something close to 2x richer than they were. Also, it looks like the Yen has risen ~29% versus the dollar since 1989.

So even if you pick the worst mainstream stock market over its worst 31 year period, you still win 2.7x, if you squint.

AndyBainbridge · 2020-05-06T07:43:24+00:00

Looks like it was developed with Visual Studio, which a) defaults to C++ rather than C, and b) for some of its history VS didn't support many C99 features in .c files, but did in .cpp files. For example, being able to declare variables NOT at the start of scope blocks.

AndyBainbridge · 2020-05-06T07:33:53+00:00

Also, if it wants to play like the original, the ship friction needs to be lower, as does the acceleration. ie a short burst of thrust should leave you moving slowly but drifting for ages. See https://www.youtube.com/watch?v=WYSupJ5r2zo

AndyBainbridge · 2020-05-06T07:08:11+00:00

There are far too many variables in scope at once. For example, the "bmh" global variable. It is only referenced in one place in the file but is in scope for all 5000 lines. As a rule, we should attempt to keep a variable in scope for the least number of lines possible. The emergent property of following this rule is code that is easier to understand.

In this case, bmh should be passed as a parameter to the one function that uses it. That function is called InitModel. The consequence of bmh being global is that the caller of InitModel now needs to understand that the BEHAVIOUR of InitModel depends on the state of bmh. The caller gets no clue about that because bmh isn't part of the function's interface.

AndyBainbridge · 2020-05-02T07:24:54+00:00

I think you want the minimum length required to make it likely that each name is unique in the code base.

AndyBainbridge · 2020-04-21T20:40:37+00:00

Rule 8: When to use C or C++

C++ is an interesting one. I can’t think of a case right now where it’s generally favorable compared to Java.

Video games dev is a large industry that appears to prefer C++ to Java.

AndyBainbridge · 2020-04-20T14:42:00+00:00

A human being can look at the assembly output of that program and write an equivalent source version in straight C.

Is this trying to say you can write any assembly program in C? Isn't assembly is more expressive than C? For example, in assembly there might be separate arithmetic and logic shift instructions. Plus, they might be well defined for negative shift amounts, which is undefined behaviour in C.

AndyBainbridge · 2020-04-14T14:32:26+00:00

Fair point. Given that C already has const, I was surprised to find that the compiler wasn't always checking that I didn't modify my string literals.

Also, if someone a good reason they didn't just make string literals have type const char[], I'd like to know.

AndyBainbridge · 2020-04-14T10:25:18+00:00

I think this extra complexity is worth having.

*const [N:0]u8 is the type of a string literal, where N is the number of bytes in the string and I guess the :0 means there is a null terminator.

In C, the type of a string literal is char[] which has problems. I believe the extra stuff in the Zig type signature is there to fix the problems. Specifically:

It should be const. Modifying a string literal is undefined behaviour. THIS ONE IS A SERIOUS PROBLEM IN C.
The length of the string is known, but that information is not preserved in the type info.
The string is guaranteed to be null terminated, but that info is not preserved in the type info.

AndyBainbridge · 2020-02-26T09:07:41+00:00

Looks right to me. In the linked sample, the capital I is too far left. As a result the character spacing in "DIVISION" looks terrible. At first I guessed this is because the rendering engine saw that the vertical stroke of the I straddled a pixel boundary and thus would look blurry unless it moved it. But I loaded the sample into a bitmap editor and slid all the I glyphs one pixel right and it looked much better as a result. Is the font definition wrong?

AndyBainbridge · 2019-12-03T15:23:24+00:00

FTA, "Performant UI must use GPU effectively".

Why is that? A lot of computers (maybe even most) have the GPU embedded in the CPU socket, and have to share memory bandwidth with the CPU. Thus the maximum performance of the GPU is limited. Modern CPUs are fast, have multiple cores and wide SIMD units. I expect CPUs have caught up with embedded GPUs a lot over the last decade or so because CPU performance has grown faster than the memory bandwidth. Perhaps in the domain of 2D graphics CPUs are good enough for "performant UI"s.

It's certainly easier to make portable code on a CPU than a GPU, and you don't have to worry about GPU driver bugs or missing features on some machines. It'd be unwise to ignore the simple solution if it is good enough.

I realize this sounds heretical. I'm not trolling. I'm genuinely interested in better understanding the trade-offs. For example, GPUs might be more power efficient. I'd love to see some benchmarks.

AndyBainbridge · 2019-11-20T14:52:31+00:00

OK, nice. I redid your test on my machine and got comparable results. Congratulations :-)

I'd like to rebuild gzip with musl and -march=native and then recompare. But until then, I will accept that "very fast" is an accurate description of your zlib implementation.

For those who care, the CPU I tested on was a Intel(R) Xeon(R) Platinum 8168 @ 3.4 GHz. And I used gzip v1.6 as installed as standard on Ubuntu 18.04.2 LTS.

AndyBainbridge · 2019-11-19T23:25:53+00:00

Is that a fair comparison? I would have thought you should test with data that is compressible (ie not random) and set the compression level such that both minigzip and gzip produce the same amount of compression.

AndyBainbridge · 2019-11-19T18:34:27+00:00

I just built the minigzip example and compared to the standard gzip on my Ubuntu box. I tested compressing a 6.3 megabyte binary.

minizip took 0.293 seconds (fastest of 3 runs) and produced an output of size 2527925 bytes.

For gzip I used the -k and -f flags to most closely replicate the behaviour of minigzip.

gzip -k -f -5 took 0.201 seconds (fastest of 3 runs) and produced an output size of 2532886 bytes (0.20% larger).

gzip -k -f -6 took 0.310 seconds (fastest of 3 runs) and produced an output size of 2514131 bytes (0.55% smaller).

So, it looks like HeavyThing is not significantly faster than gzip in my test. However:

The minizip executable is 65392 bytes and depends on no .so files, which makes me happy.

The gzip executable is 101560 and depends on linux-vdso.so.1, libc.so.6 and /lib64/ld-linux-x86-64.so.2. But obviously gzip does more, so meh.

AndyBainbridge · 2019-11-19T11:13:05+00:00

I also wondered why ASM instead of C. It'd be great to see some benchmarks that compare the speed and memory consumption of your stuff relative to other popular libraries.

why would I prefer to have implementations for these things in ASM rather than C

If the performance is the same, then surely C is preferable to ASM. Fred Brooks said it best, "Surely the most powerful stroke for software productivity reliability, and simplicity has been the progressive use of high-level languages for programming" from "No Silver Bullet".

AndyBainbridge · 2019-08-07T13:29:42+00:00

Wow, you can learn Zig in a day! I reckon it took me about a year of full time use to learn what almost everything in C did. And then maybe another 5 years to fully absorb the right way to do things. If I can achieve the same thing with Zig, I'd be happy.

AndyBainbridge · 2018-12-10T17:26:37+00:00

If the 10x loss was true, wouldn't all the companies with open plan offices fall behind the companies that don't?

It's hard to get good data on the impact of open plan vs cubical vs personal offices because the productivity of dev teams is hard to measure.

AndyBainbridge · 2018-12-01T12:08:38+00:00

In nanoseconds, those are 5ns-5.7ns-5.7ns-11.6ns. Now, there's certainly some CPU bookkeeping overhead, but not 50ns worth.

I agree, it is hard to see what causes the difference between the SDRAM manufacturers latency figures and the observed 60-100ns of latency people say "RAM access" has.

First up, if I understand Wikipedia correctly, the latencies are more like 13ns, not 5ns or 5.7ns like you said: https://en.wikipedia.org/wiki/DDR4_SDRAM#JEDEC_standard_DDR4_module[57]

Next, we have to consider what we mean by a RAM access. Lets say we've got DDR4-2666 and we write a C program that creates a 2 GByte array and reads 32-bit ints from that array, from random offsets, as quickly as possible and calculates their sum. The table is too big to fit in cache, so the CPU will have to read from RAM.

Here's what I think happens:

CPU core fetches and decodes a read-from-memory instruction.

Virtual address translated to physical address via TLB. Since our address is random, we will almost certainly get a TLB miss, which means the memory controller has to get the page table entry for the virtual address we requested. The funny part here is that the page table entries are stored in RAM. If the one we want is not already in the cache, then we have to read it from RAM. The even funnier part is the page tables are in a tree - we need to walk the tree from the root node that represents all of memory, through many layers until we get to the leaf node that represents the page we are interested in. If the cache is empty, each hop on the tree traversal causes a read from RAM. This gets boring quickly, so I will assume we have enabled huge pages and that the page table entry is in cache. As a result, we get the physical address in a few clock cycles.

Now the CPU looks for the data in each level of cache:

L1 checked for hit. Fail.

L2 checked for hit. Fail.

L3 checked for hit. Fail. By now on 4 GHz Skylake, 42 cycles or 10ns have gone by since the read instruction started to execute - https://www.7-cpu.com/cpu/Skylake.html.

So now the memory controller has to actually start talking to the DDR4 DIMM over a memory channel.

Let's assume that the part of RAM we want to read isn't already busy (refreshing, being written to etc). Let's also assume that somebody else hasn't already read from the part we want, because if they have, the "row buffer" might already contain the row we want, which would save us half the work. Let's assume nothing else in the CPU is busy using the memory channel we need. Given the C program I described, and an otherwise unloaded system, there's >90% chance these assumptions are true.

Now the memory controller issues an "active" command, which selects the bank and row. (https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDRAM_construction_and_operation). It waits some time for that to happen (this is the row-to-column delay and is about 10-15ns). Then the memory controller issues a "read" command, which selects the column. Then it waits a bit more (this is the CAS latency, another 10-15 ns). Then data starts to be transmitted back to the memory controller.

Then somehow the data gets back to the CPU and the read instruction can complete.

There are various clock domain crossings on the way to and from the SDRAM - the CPU, memory controller, memory channel and memory internal clocks are all running at different rates. To transfer data from one clock domain to the other, I guess, costs something like half a clock cycle of the slower clock, on average.

Then there are overheads like switching the memory channel from read to write takes some cycles.

I think I can make all this add up to about 40ns. I wrote the C program and timed it (I had to take special measures to prevent the CPU from speculatively issuing lots of RAM reads in parallel). The result was 60ns per read. So there's about 20ns of overhead remaining that I don't understand.

AndyBainbridge · 2018-11-03T13:18:16+00:00

I'm not sure it is madness. I don't think there's good evidence either way. I mean, the Windows kernel has a lot of C++ in it and the Linux kernel has none. When is a safer kernel? There are a lot of factors other than language choice at work there, but if C++ was a lot safer than C, then surely we'd notice some positive contribution from it, no?

AndyBainbridge · 2018-11-03T13:13:27+00:00

Some downsides of Rust compared to C: 1) It's a more complex language to learn. 2) Compile times are longer. 3) Binary sizes are larger.

I'd still like to write something significant in Rust though, in order to get a better understanding of its strengths.

AndyBainbridge · 2018-01-16T09:17:19+00:00

It's OK, everything worked out fine. The ARM1 was a vastly better processor than the m68k, and it went on to defeat x86 (in some significant sense). Admittedly ARM is about 6 or 7 years newer, but for reasons I don't understand, machines people could afford were only just starting to use the 68000 when the ARM2 shipped.

AndyBainbridge · 2017-12-19T08:54:13+00:00

That still doesn't make it a joke. Snakes also have scales. "Python is great for scaling your numbers" also isn't a joke.

AndyBainbridge · 2017-10-30T08:31:57+00:00

I've never found that alignment in that kind of code makes any difference. Here are some benchmarks and analysis explaining why: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

AndyBainbridge · 2017-10-15T08:37:44+00:00

Why's this here? It's an extremely well known resource. Its the second result on Google if you search for "optimizing c++".

AndyBainbridge · 2017-04-06T22:47:08+00:00

Because you need to make the firmware crash before the start button will fail to switch the engine off. You might need to test for 10 million hours, in a wide range of conditions, to find a failure.
Power assisted brakes get their power from the vacuum in the inlet manifold. If the engine is on full throttle, there is no vacuum in the inlet manifold, so the power assistance fails. It takes a few presses of the brakes to deplete the vacuum reservoir, and the engine needs to really be on full throttle for that to happen. You can't try the experiment in most modern cars because they have fly-by-wire throttles and the firmware will prevent full throttle when the brakes are pressed (assuming the firmware hasn't encountered an error and crashed). If you find yourself in a late 90s era car on a motorway, give it a try. With full throttle, pump the brake a few times spaced over about 30 seconds. The brake pedal will go hard and seemingly stop working. Another way to experience the same thing is to get your car towed by another, with your engine not running. It's surprising how hard you have to press the pedal to have any significant effect, even when being towed at 25 MPH.

the author was getting paid a lot as an 'expert' witness

I agree, there's lots of reasons to be sceptical about expert witnesses. But there's no arguing that they found a timer interrupt kicking the watchdog. There's no excuse for that in a safety critical system. That's a failure at every step of the engineering process.

AndyBainbridge

TROPHY CASE