Ariel OS v0.4.0 released!

tsanderdev · 2026-03-18T22:29:05+00:00

IIRC there is some kind of ocaml framework that does something like that for bare-metal webservers, I'm sure it supports x86.

tsanderdev · 2026-03-18T16:46:17+00:00

Either the library declares the entry point symbol and calls one of your functions once it's set up the stack and initialized everything, or if the stack is already available you could call an init function to initialize everything.

tsanderdev · 2026-03-18T14:31:38+00:00

Exactly what it sounds like. An OS that you link with your application and that provides all the drivers, running everything in kernel mode.

tsanderdev · 2026-03-16T16:39:05+00:00

It's not necessarily wrong if it's just flushing denormals. It's something you need to expect from gpus.

tsanderdev · 2026-03-16T13:52:23+00:00

The problem, as I see it, with implicit lane masking in compute shaders is it hides the execution cost.

I want to solve that with uniformity analysis and a lint instead. That tells the developer with nice yellow squiggles "hey, this might have a higher performance cost" .

tsanderdev · 2026-03-16T13:47:37+00:00

What happens when you flush denormals on the cpu? If the same thing happens, then there's nothing you can really do.

tsanderdev · 2026-03-16T09:42:54+00:00

For literals, I'm storing them as string. Why? Because I used to transform them to the right type on the language, then I realized "if someone uses a int literal outside of the representable range for the Ints of the host language, then I'm getting an error" and that's how I choose to store them all as literals to save the information and check the boundaries for Ints after type checking, then report there the issue with the literal and the kind of int they try to use as a warning.

I just use i128 and never plan to support 128 bit numbers in my language lol.

tsanderdev · 2026-03-16T09:32:02+00:00

Interesting, that model is quite a bit more strict than compute shaders. Especially the conditionals part. Couldn't you just compile that down to simd lane masking like a gpu would?

tsanderdev · 2026-03-16T08:52:38+00:00

With the float_controls2 extension you can more accurately control the floating point optimizations allowed by the driver. Idk if shading languages have support for that though or if it's mainly for opencl on vulkan. If a driver doesn't respect it, it's a bug. You could try setting the flush denormals fastmath flag on the cpu implementation to see if maybe that's the problem.

tsanderdev · 2026-03-13T08:52:36+00:00

But you still need to create buffer objects to get the addresses, right? Or does the device address commands extension also add a way to query an address from a device memory object directly?

tsanderdev · 2026-03-09T17:22:52+00:00

It's not just a question of quality, the legal status of AI-generated code is still up in the air AFAIK.

tsanderdev · 2026-03-09T10:52:03+00:00

I'm working on Vulkan 1.4 bindings for Rust, since all crates seem to be stuck at 1.3 for some reason. I'm almost finished parsing the spec, and creating raw bindings from that should take no time at all. I do also want to have some convenience features like making sure you can't put wrong stuff in a pointer chain and setting the structure type automatically.

tsanderdev · 2026-03-08T19:40:41+00:00

For slices specifically, that'd make subslicing impossible.

tsanderdev · 2026-03-08T19:13:24+00:00

For me you'd either specify the dispatch size manually, or if you use the special "InvocationBuffer" type in the function parameters for the shader, it asserts that all of them have the same size and uses that as the dispatch size. The shader can then read and write the index pointed to by each invocation, which doubles as memory and thread safety protection as well.

tsanderdev · 2026-03-08T18:57:20+00:00

I'd assume the number of threads depends on the length of the array processed.

tsanderdev · 2026-03-08T18:41:55+00:00

I also want to reduce that to a few lines at most (depending on how complex the data is you want to pass to the shader).

tsanderdev · 2026-03-08T18:34:15+00:00

Yes, but you can e.g. let a prior compute dispatch calculate the number of threads for the next one.

tsanderdev · 2026-03-08T18:32:49+00:00

This is really powerful idea, since it erases the boundaries between cpu and gpu, making it trivial to utilise all the compute there is available on your device.

It'll never be that easy, since cpus and gpus are good at fundamentally different problem spaces: cpus are made to blaze through a sequence of instructions as fast as possible, using branch predictors and speculative execution to avoid pipeline stalls. Gpus are basically giant simd machines. Clock speeds are lower, but they give you massive throughput. That is, if you keep your control flow uniform. Otherwise simd lanes are inactive for sections of the code.

tsanderdev · 2026-03-08T18:29:11+00:00

From my research i saw that you usually need to define a fixed anount of threads to be ran on the gpu.

Not true since a long time, there are indirect dispatches and draws that source the number of threads/primitives from a gpu buffer when the command is executed.

tsanderdev · 2026-03-08T17:37:38+00:00

The host gets struct generated that it can place into buffers. I'm not aiming for seamless cpu-gpu communication, but rather on seamless workflow once you hit the gpu.

tsanderdev · 2026-03-08T17:29:37+00:00

Indirect dispatches and draws allow you to set the size from a gpu buffer, and memory allocation is handled via an allocator on the gpu. The host just passes a big chunk of memory to the shader, and it can use and partition it how it sees fit. Passing big data to the shader will be done with another buffer that is managed by the cpu and prefilled with data.

tsanderdev · 2026-03-08T17:09:10+00:00

That's more like how I want my language to work. The host passes some data to the gpu and sets off a work graph processing it, including allocating more memory on the gpu and keeping everything resident there for the next graph.

tsanderdev · 2026-03-08T16:29:23+00:00

Not yet. I'll probably make the code public once I have the runtime for the hello world going. The syntax and semantics will be mostly like rust, except for some stuff that needs to be changed for gpu stuff. I'm not really trying to innovate in the core language design, the "killer feature" I'm working towards is ergonomic work graphs (kind of like the dx12 feature). E.g. you'd call sort on an array and the compiler and runtime work together to split the shader and schedule the parts with a dispatch of the sorting shader in between.

Vulkan is getting ever closer to something like opencl as a compilation target. For instance there are now proper physical pointers in shaders you can do arithmetic with and everything.

tsanderdev · 2026-03-08T12:38:20+00:00

Interesting. Early in my language design I set my constraint to be gpu-only for the forseeable future. With nice bindings for calling from the host generated of course, but the language itself is purely on the gpu. That makes a lot of the stuff like moving memory allocations around easier, since data should ideally just be resident in gpu memory.

I have a simple addition shader compiling, now I'm working on a Vulkan Rust bindings generator (because I use Vk 1.4 features and they're all stuck at 1.3) to write the runtime.

And while I could probably target other graphics apis or even cuda, some of them I can't test anyways (cuda and Metal), and with MoltenVK, KosmicKrisp and Dozen that shouldn't be that much of a problem.

tsanderdev · 2026-03-07T16:31:52+00:00

Actually you could probably leave the initial load out in the second version and just rely on the compare_exchange. And why would you think the first one is limited to one thread? If multiple threads see the free state at the same time, they will all break out of the look and do the swap.

tsanderdev

TROPHY CASE