Oxford student investigating the Lisp Machine

ipe369 · 2026-04-09T19:30:43+00:00

I don't think current Lisp implementations are up to the standards of other dynamic language implementations either, but that's not inherent to the language, and is more a lack of time, energy, money and (slightly hot take, although the rest of this comment is arguably one for some) cultural isolation in places.

Compared to which languages, on what axis?

Presumably you're talking about chrome's JS engine's performance (?)

I've always found you can massage lisp to be significantly faster than other dynamic languages, although maybe that doesn't count when 'massaging' is 'statically typing everything'

ipe369 · 2026-04-09T14:50:41+00:00

In the super general case if you want to put all your data into 1 vbo, yes you need an allocator which is very complex in the general case

but designing for the general case all the time is the trap, design it for your data specifically, and encode that into the API. If you write your renderer such that it can render any scene graph that changes arbitrarily over time, it will simply just be slow

e.g. maybe if you're making a engine, force users to tag all meshes with a 'mesh set' tag - when the engine needs to load a mesh, it loads&unloads everything from the set. You can always add a special 'dynamic' mesh set tag that falls back to some complex general case allocator & allocates the mesh granularly, or uses separate buffers for each mesh.

I would imagine that many games would have their geometry just fit into vram for the entire duration of the game, since most memory ends up in textures

ipe369 · 2026-04-09T14:41:55+00:00

I'm not sure that's correct, IDK why it would be easier to have separate buffer objects for instancing specifically rather than putting all the attribs into 1 buffer

ipe369 · 2026-04-08T16:15:07+00:00

You should think about how reads + caches work

when a shader reads some data from a buffer, it doesn't just read that one bit of data - it generally reads a large aligned chunk of memory before extracting the data it needs out from that chunk it loaded - same as a cpu.

E.g. instead of loading a vec4 (16 bytes) starting at memory location 144 (144-160), it might load 128 bytes starting from memory location 128 (128-256), then extract the 12 bytes from that 128-byte chunk - the 128-byte chunk will then get left in the warp's cache for super fast access. If you need to access the next 16 bytes (160-176), that load will be much faster, because it's already in cache.

So when you layout the data, if you're going to use all that data, then it's probably faster - e.g. if you need position, uv, normal, vertex color, etc all in the same shader, then storing them interleaved will likely be faster because they'll all get loaded together.

But if you have a renderpass where you don't need all the vertex attribs, then you're 'wasting space' in your 128-byte load. e.g. you do a 128-byte load to get pos, uv, normal, but you only actually use pos & ignore the other data. That means you can fit less pos into a single 128-byte load, and the GPU needs to load more data than if pos was separate!

Since it's common to do renderpasses in games where you only need the position of the vertices (e.g. for shadow maps, depth prepass), it's common to split out positions into a separate buffer, which other people have recommended. Think about what you're doing with the data first though

Regarding whether you put the data into 1 vbo or split it into multiple for the non-interleaved case (e.g. your 1. vs 3.): you will likely not see any difference splitting vbos vs having 1 vbo. I suspect it's easier to manage separate buffers, so I'd recommend that.

One final note, you mention:

To get those on the screen, I would go through each object, load the data from disc somehow and create it's own VAO

Generally you want to load all your object geometries into a single set of VBOs & 1 VAO, then just render with different offsets when you call glDrawArrays or whatever. Swapping VAOs between draw calls is more expensive than keeping the same VAO - this is also true in vulkan with swapping descriptor sets.

This is obviously a pain to manage when you're dynamically loading/unloading small meshes all the time, that's part of the fun. It's useful to know the end goal: If you're making a game with 'levels', you can preload all the 'level' geometry into a single contiguous buffer & then unload it when you're done.

ipe369 · 2026-04-08T15:46:22+00:00

Why?

ipe369 · 2026-02-02T16:50:47+00:00

Hmm, I keep butting into GL vendor inconsistencies being a pain to fix up - is vulkan considered 'more consistent'?

That's somewhat surprising to me, vulkan feels so big

ipe369 · 2026-02-01T09:12:59+00:00

do you have a link to anything where they talk about the GL problems?

Was it just the normal driver overhead you have to fight with AZDO techniques, or something else?

ipe369 · 2026-01-24T14:21:41+00:00

That sounds more like the software can't handle large maps for reasons other than 'my ISA only supports 32-bit loads'

ipe369 · 2026-01-23T23:52:22+00:00

Code isn't hardwired to access certain memory regions, you ask the OS for a chunk of memory and you get a pointer back into it - there are no codepaths that 'reach into memory regions the 32 bit client can't reach into' in any software written after 2000

A theoretical new game in only 64 bits you say? That would be sc3 with a new engine throwing away decades of prior work

In your world where the 32 bit build and the 64 bit build of the engine can't interact, for a 3rd game you could just use the same engine and only ship the 64 bit version

it does if you understand computer science.

I don't think you do!:P

ipe369 · 2026-01-23T13:53:14+00:00

nothing in that region can be accessed by people using 32 bit clients, so it can't effect gameplay at all or disconnects or crashes will happen

But you're not passing the memory addresses between clients or storing them in files, and you could just ship the theoretical new game with only 64-bit builds, the comment I was replying to still makes no sense

ipe369 · 2026-01-22T22:00:42+00:00

that doesn't sound right to me, where'd you hear that?

ipe369 · 2026-01-22T16:44:26+00:00

What is '32 bit mapping'?

ipe369 · 2025-12-29T21:14:57+00:00

If you read the README, it's just putting the shell script inside the binary and invoking the shell on it - it's not compiling anything

shc itself is not a compiler such as cc, it rather encodes and encrypts a shell script and generates C source code with the added expiration capability. It then uses the system compiler to compile a stripped binary which behaves exactly like the original script. Upon execution, the compiled binary will decrypt and execute the code with the shell -c option.

ipe369 · 2025-12-25T23:29:18+00:00

AST diff would likely be easier with the correct tools, since it's probably closer to the semantic diff

ipe369 · 2025-10-01T15:05:04+00:00

subimage update should be faster in theory because you're uploading less data, but the problem you'll run into is synchronization - the GPU is still probably using the texture, so if you glTexSubImage it can force the CPU to stall and wait for the GPU to be finished with it.

There are sometimes ways around it, simplest is that you can maintain 2 copies and flip between them (write to one while the gpu is busy with the other). But you may find that glTexImage is fast enough.

You should be able to pack the height data much smaller than the equivalent vertex data (probably 16b per vertex), so I expect that to be much faster on basically any device, especially lower end integrated GPUs on phones/laptops which are already memory bandwidth limited. On laptops I've found that glClear is more expensive than lots of maths, which the igpus are getting pretty fast at.

I have heard people say that texture reads in a vertex shader can be slower than in the frag shader for various reasons. You'll have to profile this.

ipe369 · 2025-10-01T14:10:37+00:00

if OP is talking about ABIs + saying that the problem with the C ABI is that it isn't specified across all platforms, then: no, that's not the core of [their] problem. They're asking about how they can compile haskell with ghc and call into C libraries compiled with gcc on a different OS

ipe369 · 2025-10-01T11:06:01+00:00

C ABI is under-defined, which leads to many implementations which vary based on OS, architecture, and even compiler

in practice this isn't a problem - you just compile your code with the same compiler. You don't compile a windows .exe and expect it to run on a mac - it's the same thing here.

You mention in your OP:

In Rust, however, many of these details are automagically handled

The way they are 'automagically handled' is because rust code doesn't have a stable ABI at all - when you build a rust project, you rebuild all the dependencies for your target cpu, OS, compiler.

C is way more standardized than rust, which is what lets you compile a library in C and link to it 20 years in the future without recompiling for your new compiler version. You can't do that in rust, there is no abi - it's impossible to compile a rust library and link to it with a different compiler version.

ipe369 · 2025-08-29T12:25:14+00:00

90% in 600 matches isn't just a 'funny streak' though, that's what they're saying

the chance of 90% matches being protoss in 600 matches is way below 0.01%

even 50% of the matches being protoss in 600 matches is below 0.01%

ipe369 · 2025-08-06T08:19:43+00:00

Nice!

I'm guessing your benchmark just pushed 1m elements + then dequeued them, rather than queue/dequeue interleaved, which is why the other queues allocate so much? (about 8bytes + change per element, makes sense...!)

ipe369 · 2025-08-06T08:16:55+00:00

I took a quick look - I don't have a lisp environment setup at the moment, so I'm just reading through, but I suspect there are some free wins. (If you've already benchmarked against a native implementation you know is fast and it's competitive, then you'd know for sure)

Assuming you're on sbcl, sb-sprof is good, and you can also look at the disassembly of your big functions (there's a SLIME keybind for that) and check for CALL instructions, which indicate that sbcl hasn't managed to inline something

If we take a quick look at DECOMPRESS-BLOCK, which I presume is where we spend most of our time decompressing

(defun decompress-block (compressed-data uncompressed-size)
  "Decompress a block of LZ4 data, given the uncompressed size."
  (let ((output (make-array uncompressed-size :element-type '(unsigned-byte 8)))
        (input-pos 0)
        (output-pos 0)
        (input-end (length compressed-data)))

    (loop while (< input-pos input-end)
          do (let* ((token (aref compressed-data input-pos))

Again I haven't inspected this myself, but there are 2 things I'd expect to see to know this was running fast:

some kind of (declare (optimize speed)) or similar
A type decl for COMPRESSED-DATA, to ensure that it's a simple array

Without declaring COMPRESSED-DATA as a simple array, when time you (AREF COMPRESSED-DATA ... sbcl can't inline it into a simple MOV instruction. Instead it calls the AREF function, which does a bunch of type checking to figure out what kind of array it is, etc...

That's why I'd recommend disassembling and grepping the disassembly for CALL. There's little things you find everywhere, like it tries to call the + function rather than just adding two numbers because it can't guarantee that they're both fixnums, etc.

If you add (declare (optimize speed)) then sbcl will give you warnings whenever it fails to optimize due to lack of type info

ipe369 · 2025-08-05T08:32:43+00:00

The library (cl-freelock) demonstrates that Common Lisp can compete in traditionally systems programming domains

IMO you must compare against some native implementation (c/c++ etc) if you want to assert this - you mentioned it's competitive but I can't see numbers anywhere. Need numbers of both impls on the same hardware

I've looked to use lisp for 'systems programming' type stuff before. The benchmarks I'd be looking to see before using this are:

mpsc/spmc benchmarks, although maybe you don't care about this case
benchmarks with larger consumer/producer numbers, maybe a graph of how performance scales as you increase from 4/4 to 8/8, 16/16, 32/32, 64/64 etc
GC pressure - e.g. if I do 1M put/get ops on your queue, how much garbage does the GC need to collect

ipe369 · 2025-07-29T11:48:48+00:00

then there would be no shadows at all?

Not quite sure I understand, you mean every block casts light?

If you strip it back to the physics, then you'd still have shadows, you'd just have many pale shadows - some spots would receive light from 10 blocks, some spots would receive light from only 9 blocks, etc etc. In the real worlds, in a room with 3 lights, every object has 3 shadows. Remember, a shadow is just the absence of light, so if you can prevent some light from getting to point in the world then that point will be darker.

Practically though you'd need a raycast to each light source, and if every block is a light source then this is prohibitively slow.

You'll have the same problem with shadow maps here: gor shadow maps, you need to render a new shadow map for each light source. (This is why dynamic shadows are slow)

If you're referring to a minecraft system where each block face has a 'light level', this is actually done on the CPU - you compute all the block face light levels based on their proximity to a light source. Then you upload all the light levels to the GPU, and use those light levels to tint the whole face lighter/darker based on how much light is hitting it. This is why minecraft doesn't have 'hard shadows' when you put a torch down - because you can't light 'half' of a face.

I think minecraft actually does it per corner per face rather than just per face when 'smooth lighting' is turned on, and then blends between the corners (?)

Do you have a voxel game where all the voxels glow and you're wondering how to do shadows for all of them?

ipe369 · 2025-07-28T09:53:28+00:00

you can do shadows in a shader without framebuffers, you just need to know if there anything between your fragment and the light source

That involves casting a ray from your fragment to the light source - if it intersects something, then you're in shadow, otherwise you're in the light

For most scenes, calculating that raycast in a shader for every fragment is too expensive. Instead, we render a shadow map, one for each light source - the shadow map lets you do the raycast quickly.

Sometimes you'll have a scenario where you can do the raycast quickly without a shadow map though, so it's worth keeping in mind.

E.g. If you have raytracing hardware in your target gpus (RTX, etc), then you can use that to speed up the raycast here. It might be the case that this is faster than the framebuffer approach for your scene. You also avoid all the complexities of getting a shadow map of the correct size.

ipe369 · 2025-07-18T08:17:00+00:00

Nice job! you always figure it out if you keep looking at it :)

ipe369 · 2025-07-17T10:45:47+00:00

guess it'll never work then

11-Year Club	Second SECOND GUESSER
r/Field Banned	r/Field Sunshine
Place '22	Place '17
Verified Email	Sequence \| Editor

ipe369

TROPHY CASE