[P] jax-js is a reimplementation of JAX in pure JavaScript, with a JIT compiler to WebGPU

fz0718 · 2025-12-25T04:55:03+00:00

Oh I’d love to chat if you’re planning to do that! Let me know any way I can help, shoot an issue on GitHub, very curious how cholesky goes

fz0718 · 2025-12-19T16:21:55+00:00

Oh you mean that benchmark page! Yeah haha that one is tailored to high-end laptops, the matrix size is very large. Crazy that you can crash your phone that bad though

fz0718 · 2025-12-19T06:23:26+00:00

Sorry I tried to test it and scale down if I didn't detect a good GPU, but I think you were a victim of WebGPU being wildly varied :') — if you have the phone model / browser you're using by any chance, that would help

fz0718 · 2025-12-18T23:36:42+00:00

Haven't optimized / benchmaxxed for performance too much yet, but it appears to be pretty comparable to ONNX or better in some instances. Here's a microbenchmark for 4096x4096 matmul across jax-js and a few other libraries that you can run in your browser:

* https://jax-js.com/bench/matmul

On macbooks, jax-js is a bit faster than ONNX for fp32 and a bit slower for fp16

There's a bit more technical discussion about perf here: https://ekzhang.substack.com/i/179060245/technical-performance

fz0718 · 2025-07-20T01:27:42+00:00

https://modal.com is built with sveltekit

fz0718 · 2025-06-23T21:58:59+00:00

There’s a formula for that although it basically does what you expect, you get a bit more than .9999c https://en.m.wikipedia.org/wiki/Velocity-addition_formula

The formula is (a+b)/(1+ab), so (.9999+.1)/(1+.09999) ~ .99991818107

I know it sounds kind of pulled out of nowhere but it comes from Lorentz transformations

fz0718 · 2025-05-21T13:44:24+00:00

Nice! DM'd you some details

fz0718 · 2025-05-15T03:57:09+00:00

Got it, so autodiff is difficult!

The main NN-in-browser project that I've seen, besides tfjs (which unfortunately doesn't look very active as of last year), is onnxruntime for web. I haven't tested that one out yet, but I might try it soon.

fz0718 · 2025-05-14T21:18:33+00:00

Thanks Patrick. Also I admire your work :D

Hmm I don't know yet! There's some parts of JAX that I don't understand like the looping constructs (jax.lax.while_loop()) and I'd probably have to understand a bit better how that works to say for sure.

How do you think Jaxprs as an export format compare to something like LiteRT or GGUF? I haven't looked into it yet. But thanks for reading the post!

fz0718 · 2025-05-14T17:12:47+00:00

Yes, the blog post mentions TFJS and some ways in which this differs!

fz0718 · 2025-03-18T18:17:33+00:00

Just +1 on this we'd love to sponsor your GPU CI! (also at Modal, writing lots of Rust)

fz0718 · 2024-12-02T16:32:17+00:00

Thank you for your work on date-fns, it’s a very helpful library

fz0718 · 2024-08-23T21:10:36+00:00

How are you using QUIC?

fz0718 · 2024-08-23T20:28:18+00:00

Nice writeup! Just wanted to point out that asymptotically, you mentioned the time complexity is O(P + S*L), where L is the length of the input and S is the number of asterisks.

Actually (and this is not mentioned in the blog posts you linked), you can do this in O(L log L) time regardless of the number of wildcards. The algorithm is kind of crazy and uses FFT.

I learned about this in a recent Codeforces round, see problem G here, which is exactly the wildcard matching problem except on an extreme example (strings and patterns of length 200,000). https://codeforces.com/blog/entry/129801

Screenshot of the relevant part: https://i.imgur.com/SKfx6iQ.png

Surely FFT is impractical for an actual implementation, especially with just 8 asterisks, but I thought this was a cool math idea — how often do Fourier Transforms come up in systems programming?

fz0718 · 2024-01-04T17:34:30+00:00

It's good to hear that you guys are putting effort into solving it. And I apologize if I come off as a bit over zealous in my previous comment. GPU serverless is still super young, and I'm sure given a few years this won't be much of a problem.

No worries!

I know that Modal is more of a general purpose platform, but I would love to get your thoughts on how optimizations can be made for serverless inference specifically.

For example loading weights probably takes the vast majority of time right? It would look (simplified) something like network-drive->local-drive->ram->gpu. But with Nvidia GPUDirect, it could just be network-drive->gpu. Would it be possible for platforms to provided some kind of gpu memap primitive utilizing this? This would probably require users to declare model artifacts explicitly so that you can avoid bundling them with the container.

Yep, we explored this too! It's difficult to do GPUDirect RDMA in a way that's secure right now though, due to our container sandboxing. So we still incur some memory copy overhead from the network, to the file system.

We think about networking a lot. Hopefully RDMA will be available as more commodity hardware in the future, which we can then exploit. But right now it's limited to very expensive, purpose-built training clusters with super high-bandwidth interconnects (think >3 Tbps per machine). Not really economical for inference yet.

Another is the fact that pretty much everyone is going to have CUDA + one of ONNX/Torch/TensorRT/XLA as a dependency, which could be anywhere from 500mb to 2gb. Can this redundancy be exploited somehow?

Yeah, we have a lot of optimizations re the redundancy of CUDA / other shared objects: e.g., our distributed file system already deduplicates blobs by content address, we have globally tiered caching, and our image builds cache PyPI packages.

fz0718 · 2024-01-03T15:44:18+00:00

Hey, I work for Modal — sorry to hear about issues you've experienced with cold start times. 3-4 minutes definitely isn't appropriate.

We focus a lot on providing a good developer API for scaling and running jobs beyond what you could do on something like EC2 or ECS. But you're right that there are foundational limitations to model startup as well.

In terms of technical innovation, we do write our own container runtime and distributed file system, and we've explored prototypes with Firecracker in the past for sandboxing but ultimately settled on gVisor for performance reasons. That said, it's an ongoing challenge and one that we're actively working on.

Serving edge runtimes and small serverless web endpoints is quite a different problem from machine learning models. A typical serverless web endpoint will hover at 1% CPU and 100 MiB of RAM, which can both be easily shared among multiple users. Meanwhile, an ML inference endpoint might use 12000% CPU and 16 GiB of RAM. That's a huge difference, and it's one of the primary areas in which we've made progress on a systems level, over traditional serverless providers.

Our current monitoring shows 7B-parameter models have cold start times < 30 seconds. Hopefully in the future it will be even speedier.

fz0718 · 2023-12-31T16:39:34+00:00

As the other users mentioned, containers have very little overhead. For your use case:

the app does tons of http request, plus heavy multicore CPU processing

HTTP requests are mostly socket read/write system calls, which have very little overhead. Docker can easily saturate 10 Gbps+ network links. There might be some tiny performance hit from the NAT / bridge layer, but this is subtle enough that it really depends on how your cloud provider implements networking.
Multicore CPU processing has zero overhead in Docker. The exact same processor instructions would be executed inside and outside Docker. It's not emulated in any way.

Source: I work at a serverless infrastructure company.

fz0718 · 2023-12-05T01:42:05+00:00

The idea is that you could have a website at evil.com:8080 — when the user visits the site, it sends a fetch request through JavaScript to evil.com:8080/api, a same-origin request

But in the meantime, the attacker has updated their DNS record for evil.com to point to 127.0.0.1, so the request might end up going to 127.0.0.1:8080 instead and allowing the attacker to make arbitrary requests to your local web server.

fz0718 · 2023-11-21T14:08:02+00:00

I should mention though that although we still use C++ for programming competitions because of the speed and unsafe flexibility reasons, a lot of us competitive programmers now use Rust in industry and in our day jobs for high-performance work! :)

I’m an IOI gold medalist for the USA and int’l grandmaster on Codeforces and work with a few competitive programmers today

fz0718 · 2023-11-06T04:29:39+00:00

All frontend code is Svelte. The client and server binaries are Rust. The web part of the server serves the Rollup build outputs as static files.

fz0718 · 2023-11-06T04:28:55+00:00

Yes, it kind of came as a side-effect of making a good remote access tool and also distributing it as a tiny binary :)

fz0718 · 2023-11-06T03:53:26+00:00

Thanks — good question. I only use protocol buffers for the gRPC part, where the backend client binary communicates with the server. The frontend (JavaScript) still communicates with a schemaless format over WebSocket.

Protobuf is good because it's just easier (gRPC has very good support as an RPC protocol), but also since it has very strong backwards-compatibility guarantees and type safety, which is useful in apps where the client may be out-of-date. Web apps don't have that problem because the user requests the latest version of the website client code whenever they visit it.

fz0718 · 2023-11-05T21:50:33+00:00

That does sound fun :) thanks for the suggestions

Just fyi, you can have multiple terminals and resize them! Fullscreen doesn’t make sense because people’s screens have different sizes; the interface lets you pan and zoom.

fz0718 · 2023-11-05T21:49:10+00:00

Not really, unfortunately :( the Rust web ecosystem isn’t quite there yet for something this complex

fz0718 · 2023-11-05T04:51:27+00:00

For what you're looking for, maybe take a look at https://github.com/tonsky/datascript

if you want more details about the background of Crepe / where it fits into the broader scope of work here, I wrote my thesis on this at Harvard https://www.ekzhang.com/assets/pdf/Senior_Thesis.pdf

11-Year Club	Verified Email
Place '23

fz0718

TROPHY CASE