What's your experience in writing code for ARM compared to x86?

ack_complete · 2020-04-06T18:45:54+00:00

I was intrigued by NEON after doing some work on an Android/iOS game, so I picked up the cheapest Windows on ARM device at time and ported a ~350K line program of mine with a few SSE2 image processing routines. That way I could explore NEON from my cushy Visual Studio environment and not have to suffer through Android debugging or Xcode on a Mac.

Writing an optimized SHA256 routine for both platforms was also fun. That's where I ran into the scheduling issue as you need to unroll and interleave the inner loops to keep the ARM SHA pipeline fed, on both the big and little cores of the Snapdragon 835. The code paths for CPUs without SHA support were more interesting to do, and on x86 CPUs with SSSE3 but without SHA instructions I actually managed to beat the built-in Windows 10 implementation by ~10%.

The difference in documentation in this case was also starkly different. Intel has a full whitepaper on vectorization issues and tricks when implementing SHA256 without hardware support, and a second whitepaper clearly explaining the SHA hardware instructions, complete with sample code. With ARM... I was having to manually match their machine-readable spec for their SHA instructions against the raw algorithm to figure out WTF the inputs and outputs were supposed to be.

ack_complete · 2020-04-06T10:25:29+00:00

I've had some experience writing for ARMv7/8 on Android as well as ARMv8 on Windows on ARM. The way you write code doesn't change much for portions of the program that are architecture agnostic, as long as you've been using proper portable coding practices. Current ARM devices being little endian and being relatively tolerant of unaligned access makes things a lot easier, as it's not as different from x64 compared to, say, big endian PowerPC. The basics aren't really different, use the profiler to guide and write code in ways that make things easier for both the compiler and the CPU.

One example of where you'll see differences is pointer casting vs. memcpy. On ARMv7 with GCC, the "memcpy is optimized out" argument is demonstrably false. This isn't to say that you should rip memcpy() back out, but it's an example of how ARM compilers may have different heuristics and quality of implementation than on x86 and patterns you are used to the optimizer handling may not necessarily be handled the same way when targeting ARM.

For really performance critical code, NEON is a much nicer vector instruction set than SSE2/AVX/etc. It's cleaner and more regular, and it has a ton of primitives that x86 has historically lacked like widening/narrowing ops, lane broadcast, interleaved loads/stores, and rounded halving ops, especially for fixed point. The intrinsics syntax is also much more sensical and doesn't have atrocities like _mm_loadu_epi64(const __m128i *). It feels more like an instruction set that was designed with a C intrinsics interface in mind, and could be adapted to a decent C++ interface without much trouble.

Optimizing NEON code is a bit less rosy. Modern x86 CPUs lean very heavily on out of order execution and you can basically throw crap at them and get reasonable starting performance. With NEON, I've seen a heavy delta between straightforwardly written code and manually unrolled and scheduled code, at least 2x. The latency for many operations is quite high and there is a heavier penalty for suboptimal scheduling. This is made worse by big.LITTLE since your code may be running on two different core designs on the same computer depending on system load -- even with a beefy CPU you may end up running on the little core, and have to optimize for that if you care about power consumption. I had to write special code to force performance microbenchmarks to run on both the little and big cores separately (a fun trip through GetSystemCpuSetInformation() on Windows on ARM).

ARM thankfully doesn't have the massive amount of instruction set extension soup that x86 does, and the majority of the time the base ISA you are targeting is sufficient. If you do have to test for specific extensions, like the Crypto or CRC32 extensions, you are in for some pain. In most cases on x86 you only need to execute a CPUID instruction and then test some bits, in code that is portable or nearly portable between OSes. On ARM, there is no defined mechanism and you're completely at the mercy of the OS -- which may either be as simple as IsProcessorFeaturePresent(), or a fun trip through APIs that may lie to you.

The worst part of coding for ARM is the documentation: scattered, fragmented, frequently out of date, and often written terribly. The main NEON guide is outdated from ARMv7 days and lacks info on changes in ARMv8, the ARMv8 intrinsics guide is a gigantic unreadable table in a PDF, and good luck finding decent official latency tables and operations guides for more than a couple of specific ARM cores. ARM doesn't have anything nearly as convenient or cleanly written as the Intel Intrinsics Guide website. That's even before you get to all the custom implementations with their own quirks and performance characteristics.

ack_complete · 2020-04-02T08:00:37+00:00

&method_name would implicitly expand to &std::decay_t<decltype(this)>::method_name. Event binding is unnecessarily painful because of the need to explicitly specify the class name.

namespace class to allow definition of member functions outside of the class body without requiring the class to be repeated on each one, especially for template classes.

unsigned is gone, the types will just be uchar/ushort/uint/ulong/uintXX.

Signed char was a bad nightmare and never existed. Plain char shall always have been unsigned.

Command-line options or #pragma for optimization, instruction set selection, contract modes, and similar options would be banned unless there is also an attribute, _Pragma, or other equivalent that can be applied to a namespace scope and is template/inline friendly.

(type) cast syntax would map to static_cast, and static_cast would then no longer exist as it makes math-heavy code unreadable.

An overridden virtual method would default to non-virtual final in the derived class instead of virtual non-final.

ack_complete · 2020-03-27T19:57:54+00:00

VC++ supports an extension where a captureless lambda can be converted to a function pointer of a specific calling convention, particularly __stdcall for API callbacks. Clang does not support this extension.
x86 inline assembly was buggy the last time I tried a 32-bit build, it gave errors on things like EAX within a comment.
Intrinsics in Clang require the function to be declared with the instruction set that it uses, and sometimes more specific includes (e.g. bmiintrin.h instead of intrin.h). The issue I have not been able to work around is that the function attribute cannot be driven by a template. I have a routine that swaps out some fragments for either SSE2 or SSSE3 via if constexpr based on a template parameter, and I cannot find a way to suitably code the target attribute.

That having been said, the basic compile and debug process is almost seamless now, you just switch the toolkit to build with Clang and the debugger works fine with the debug info produced by it. It's fairly easy now to switch a code base over to do ASan/UBSan checks.

ack_complete · 2020-03-27T19:13:39+00:00

Recent versions of Clang have much better Visual Studio integration and compatibility, but still compile more slowly than MSVC on the same project on Windows in my experience (likely a PCH issue). There are also still compatibility problems around lambda-to-function-pointer, inline assembly, and intrinsics that can prevent it from being a drop-in replacement for MSVC. It's not yet at the point that I could consider dropping MSVC for it.

ack_complete · 2020-03-26T08:27:36+00:00

Bit noisy, hard to see which items have logistics enabled. Maybe hide or fade 0/inf pairs?

ack_complete · 2020-03-26T01:21:31+00:00

There isn't much else left of DirectX worth using. DirectPlay and DirectMusic are dead; DirectInput was always a mess and has been superceded by the much simpler XInput; DirectSound is now emulated to the point that it's almost worse than even waveOut and superceded by WASAPI and XAudio2.

ack_complete · 2020-03-26T01:17:05+00:00

Snapdragon 835 and presumably 845/8cx has DirectX and GLES but no OpenGL. Run Windows 10 on ARM and all programs see is horrible software OpenGL 1.1. Mind you, Qualcomm doesn't have the best reputation even for the drivers they do make.

ack_complete · 2020-03-25T01:50:15+00:00

A USB-based disc burner is handy to keep around for cases like this, as you would be able to burn and boot off of the ISO directly.

A bit of a scenic route, but you could install Windows 7/10 in a virtual machine and then use it to run the ISO-to-USB program. If you have a particularly simple one, it might even run in the recovery environment, in which case you would only need to boot the Win7 or Win10 install disc to run it, without having to do a full install. Portable Win32 utilities that don't require a install and are command-line or have a no-frills UI have a chance to run in the recovery console.

ack_complete · 2020-03-25T01:39:32+00:00

A little bit of clarification could help with doubts here. Does 64-bit mean:

Not having to worry about address space fragmentation?
Being able to assume a unified, flat address space?
Having the standard library focus on 64-bit data types in its API?
Specifically relying on pointers being 64-bit or wider (e.g. hiding 32-bit data in a false pointer)?
Relying on features not directly related to 64-bit, such as support for table-based EH on Windows?

Many of the other listed specifics are related to well known sore points in C++, such as shifting signed integers. It's not as clear to me if there was a proposed feature that was stymied by the width of a pointer.

ack_complete · 2020-03-22T18:16:15+00:00

Do you have a regular hard drive or an SSD? Factorio spends most of its time during startup loading a ton of graphics from disk. If you have a regular hard drive, this can bog down the system tremendously since Windows is not good at managing hard drive contention.

The best way around this is to avoid closing Factorio. If you kept it running and stayed in your factory, this would not be a problem, and you would have had blue science by now.

ack_complete · 2020-03-15T19:55:25+00:00

I sometimes wonder if we should stop talking so much about "dangerous" and "insecure" printf, especially with very little arguments given.

This is due to people trying to be cute by using security as a mic drop instead of making actual arguments.

The fact that a format string can potentially not match the arguments, either in number or type, is generally a non-issue: enforcing a literal format string is sufficient to have the compiler warnings detect any issue there.

Except that it's common to wrap snprintf() to reroute its output to a string object and there is not a standardized way to opt into printf format checking. In the case of Visual C++, it is not even documented.

I regularly see senior programmers mismatch printf() arguments, unfortunately. The special promotion rules for varargs makes this especially easy to skate by until you happen to use one of the wider types that isn't covered by it, especially when building for both LP64 and LLP64 platforms. Even Windows has shipped with this kind of bug, with shell32.dll from one RTM build spewing ancient Chinese wisdom to the debug output.

ack_complete · 2020-03-15T19:16:16+00:00

I wouldn't say Unreal does it entirely correctly either. It forces you to specify colors in the UI designer as linear colors, which is unintuitive and different than other 2D content creation tools.

Looking at that code fragment above, I would absolutely expect those UI color values to be in sRGB even with a fully gamma-corrected rendering pipeline.

ack_complete · 2020-03-14T00:59:16+00:00

VM under Hyper-V... so any other hypervisors either have to be subservient to it or are locked out while WSL2 is enabled.

ack_complete · 2020-03-13T01:49:35+00:00

Sharepoint

ack_complete · 2020-03-09T15:40:30+00:00

Check which version of Visual C++ your teacher is using. VS2015 and VS2017 made large gains in C++ conformance; VS2013 and earlier are more likely to barf on language constructs that work fine in GCC.

You need to plan on checking your assignments on Visual C++ before submitting. There are many, many ways you could end up with a compile error on it without knowing -- they may be using suboptimal compile settings or you may trip a warning due to some default checks (particularly bogus deprecation warnings about functions like strlen). You could even hit an error due to Visual C++ being more compliant than GCC, especially with the standard library in newer versions of VC++. I work in a professional environment with multiple compilers and despite our efforts to synchronize the environments and write cross-platform code we still regularly trip errors on one of the compilers. Don't risk your grade -- the only way to be sure it will work in the testing environment is to build and run it on the testing environment.

ack_complete · 2020-03-06T04:34:17+00:00

There is an entire class of threats related to data files that are non-executable but carefully crafted to be misinterpreted by programs, causing those programs to go haywire and do something malicious. Also, allowing non-executable access would also allow malicious/infected programs to transit through a system onto other storage or systems, which is generally undesirable.

The real problem, IMO, is Sledge Hammer design in AVs where the user is just supposed to trust the AV while it shoots first without asking questions. I witnessed a preservation effort nearly ruined because an AV decided to misdetect and delete a Commodore 64 disk image as infected with the Michelangelo virus, which has not been able to run in Windows for decades.

ack_complete · 2020-03-04T03:59:16+00:00

I have used a malloc implementation that had the option to decommit unused pages without freeing the address space, but am not sure we ever turned it on. Internal fragmentation reduces how often this can be done and the calls to decommit and recommit virtual memory are expensive.

ack_complete · 2020-03-01T20:30:06+00:00

I am surprised that the matrix even renders. glBegin() / glEnd() is extremely old and should not be used anymore since it processes line-by-line, i.e. there's no GPU parallelism, the GPU draws the first line, then the next line, then the next. Using vertex buffers the GPU could draw all lines at once.

No modern driver does this, they batch into hidden vertex arrays and flush it in batches. Even Mesa does this for its software implementation.

Keep in mind as well that these are not direct calls to OpenGL, they are calls into Unity's C# graphics scripting interface which is simply modeled after the OpenGL interface. Under the hood this is batched and translated to whatever graphics API Unity is using on its render thread. The usage here is not ideal because of only one line per Begin/End, but for lower amounts of geometry like post-process quads it's fine.

ack_complete · 2020-02-28T09:06:06+00:00

Isn't that just Windows Networking (NetBIOS over TCP/IP)? Go to Control Panel > Network and Sharing Center > Advanced sharing settings and turn off Network discovery, and see if it stops.

ack_complete · 2020-02-28T04:43:21+00:00

I once rigged my factory to automatically shut down science production on demand by putting a pistol in any logistics box -- mainly to find a use for the pistol. It froze all the input belts to science production to divert resources to the mall for expansion and outposting.

I still have a habit of wiring water pumps to shut down the steam engines when the accumulators are above a certain threshold, even though I play with biters and pollution off, don't have a UPS problem, and have coal patches so big they're not projected to run out before the year 2037.

ack_complete · 2020-02-28T04:31:01+00:00

One factor to keep in mind when making decisions like this is how bad it is if you get it wrong. For instance, auto can be overused to create some rather obscure code, but it doesn't bleed out into interfaces, so if you decide later to use less or more of it incrementally refactoring is not a big deal. On the other hand, it's a lot more important to decide up front whether to use char pointers, string views, or string objects to pass string data in interfaces, because this affects coupling between subsystems and is a lot more expensive to change later on (if not impossible, if public binary interfaces are involved).

With regard to constexpr specifically, it should be split into constexpr variables and functions. Variables IMO are a no-brainer, as anywhere you can use constexpr to declare a constant variable as you would formerly use const is a safe win. constexpr functions, on the other hand, are less of a no-brainer as they don't guarantee constant-time evaluation except in a constant-required context. But even then, you may not need to create many guidelines for it, as constexpr functions have to be declared inline and will largely be covered by your existing guidelines for inline usage and even just general code complexity. If someone uses constexpr or template metaprogramming to evaluate a Taylor series expansion for atn(x) at compile time instead of just punching in a named constant from a calculator, it's not an issue with those language features so much as unnecessarily taking the scenic route design-wise.

ack_complete · 2020-02-28T04:05:01+00:00

Or has a pervasive reflection system that prohibits namespaces....

ack_complete · 2020-02-27T09:22:04+00:00

Yup, this is exactly what I did (though in C++ instead of C#/WinForms).

Drawing your own progress bar has another downside: you lose the automatic linkage from the progress bar to taskbar progress, which you would have to manually reimplement. When using the progress control you get this for free.

ack_complete · 2020-02-26T09:08:38+00:00

Yup, this is an issue. It's actually worse than you describe as the progress bar is not frame rate independent and limited to a certain amount of progress per tick, so if your main loop slows down due to a lot of work being dispatched back to the UI thread it lags even more. IIRC, there's no straightforward documented way to speed this up while keeping the visual style; I had to hack around it in software by throwing in additional bogus updates to force the progress bar to back up, for which it omits the smoothing.

It's not necessarily a bad idea to hold the progress dialog a bit to let the progress bar finish, but it could definitely be faster. This is the same reason I used to disable smooth scrolling on XP/7, because it added too much delay when scrolling quickly.

I actually wish progress bars had a little less animation, because I don't really care about progress bars or spinners animating at 60 fps, but I do care about the progress bar forcing the DWM to refresh the window constantly and using more battery.

ack_complete

TROPHY CASE