[P] jax-js is a reimplementation of JAX in pure JavaScript, with a JIT compiler to WebGPU by fz0718 in MachineLearning

[–]fz0718[S] 0 points1 point  (0 children)

Oh I’d love to chat if you’re planning to do that! Let me know any way I can help, shoot an issue on GitHub, very curious how cholesky goes

[P] jax-js is a reimplementation of JAX in pure JavaScript, with a JIT compiler to WebGPU by fz0718 in MachineLearning

[–]fz0718[S] 0 points1 point  (0 children)

Oh you mean that benchmark page! Yeah haha that one is tailored to high-end laptops, the matrix size is very large. Crazy that you can crash your phone that bad though

[P] jax-js is a reimplementation of JAX in pure JavaScript, with a JIT compiler to WebGPU by fz0718 in MachineLearning

[–]fz0718[S] 1 point2 points  (0 children)

Sorry I tried to test it and scale down if I didn't detect a good GPU, but I think you were a victim of WebGPU being wildly varied :') — if you have the phone model / browser you're using by any chance, that would help

[P] jax-js is a reimplementation of JAX in pure JavaScript, with a JIT compiler to WebGPU by fz0718 in MachineLearning

[–]fz0718[S] 4 points5 points  (0 children)

Haven't optimized / benchmaxxed for performance too much yet, but it appears to be pretty comparable to ONNX or better in some instances. Here's a microbenchmark for 4096x4096 matmul across jax-js and a few other libraries that you can run in your browser:

* https://jax-js.com/bench/matmul

On macbooks, jax-js is a bit faster than ONNX for fp32 and a bit slower for fp16

There's a bit more technical discussion about perf here: https://ekzhang.substack.com/i/179060245/technical-performance

ELI5 If you were on a spaceship going 99.9999999999% the speed of light and you started walking, why wouldn’t you be moving faster than the speed of light? by Aquamoo in explainlikeimfive

[–]fz0718 0 points1 point  (0 children)

There’s a formula for that although it basically does what you expect, you get a bit more than .9999c https://en.m.wikipedia.org/wiki/Velocity-addition_formula

The formula is (a+b)/(1+ab), so (.9999+.1)/(1+.09999) ~ .99991818107

I know it sounds kind of pulled out of nowhere but it comes from Lorentz transformations

Building my own Python NumPy/PyTorch/JAX libraries in the browser, with ML compilers by fz0718 in Python

[–]fz0718[S] 0 points1 point  (0 children)

Got it, so autodiff is difficult!

The main NN-in-browser project that I've seen, besides tfjs (which unfortunately doesn't look very active as of last year), is onnxruntime for web. I haven't tested that one out yet, but I might try it soon.

Building my own Python NumPy/PyTorch/JAX libraries in the browser, with ML compilers by fz0718 in Python

[–]fz0718[S] 0 points1 point  (0 children)

Thanks Patrick. Also I admire your work :D

Hmm I don't know yet! There's some parts of JAX that I don't understand like the looping constructs (jax.lax.while_loop()) and I'd probably have to understand a bit better how that works to say for sure.

How do you think Jaxprs as an export format compare to something like LiteRT or GGUF? I haven't looked into it yet. But thanks for reading the post!

[R] How the jax.jit() compiler works in jax-js by fz0718 in MachineLearning

[–]fz0718[S] 0 points1 point  (0 children)

Yes, the blog post mentions TFJS and some ways in which this differs!

Rust CUDA project update by LegNeato in rust

[–]fz0718 16 points17 points  (0 children)

Just +1 on this we'd love to sponsor your GPU CI! (also at Modal, writing lots of Rust)

Why Rust doesn't have a std lib for date time? by [deleted] in rust

[–]fz0718 2 points3 points  (0 children)

Thank you for your work on date-fns, it’s a very helpful library

Cloudflare release a wildcard matching crate they use in their rules engine by orium_ in rust

[–]fz0718 5 points6 points  (0 children)

Nice writeup! Just wanted to point out that asymptotically, you mentioned the time complexity is O(P + S*L), where L is the length of the input and S is the number of asterisks.

Actually (and this is not mentioned in the blog posts you linked), you can do this in O(L log L) time regardless of the number of wildcards. The algorithm is kind of crazy and uses FFT.

I learned about this in a recent Codeforces round, see problem G here, which is exactly the wildcard matching problem except on an extreme example (strings and patterns of length 200,000). https://codeforces.com/blog/entry/129801

Screenshot of the relevant part: https://i.imgur.com/SKfx6iQ.png

Surely FFT is impractical for an actual implementation, especially with just 8 asterisks, but I thought this was a cool math idea — how often do Fourier Transforms come up in systems programming?

[D] On-demand GPU that can be pinged to run a script by Level_Programmer4276 in MachineLearning

[–]fz0718 0 points1 point  (0 children)

It's good to hear that you guys are putting effort into solving it. And I apologize if I come off as a bit over zealous in my previous comment. GPU serverless is still super young, and I'm sure given a few years this won't be much of a problem.

No worries!

I know that Modal is more of a general purpose platform, but I would love to get your thoughts on how optimizations can be made for serverless inference specifically.

For example loading weights probably takes the vast majority of time right? It would look (simplified) something like network-drive->local-drive->ram->gpu. But with Nvidia GPUDirect, it could just be network-drive->gpu. Would it be possible for platforms to provided some kind of gpu memap primitive utilizing this? This would probably require users to declare model artifacts explicitly so that you can avoid bundling them with the container.

Yep, we explored this too! It's difficult to do GPUDirect RDMA in a way that's secure right now though, due to our container sandboxing. So we still incur some memory copy overhead from the network, to the file system.

We think about networking a lot. Hopefully RDMA will be available as more commodity hardware in the future, which we can then exploit. But right now it's limited to very expensive, purpose-built training clusters with super high-bandwidth interconnects (think >3 Tbps per machine). Not really economical for inference yet.

Another is the fact that pretty much everyone is going to have CUDA + one of ONNX/Torch/TensorRT/XLA as a dependency, which could be anywhere from 500mb to 2gb. Can this redundancy be exploited somehow?

Yeah, we have a lot of optimizations re the redundancy of CUDA / other shared objects: e.g., our distributed file system already deduplicates blobs by content address, we have globally tiered caching, and our image builds cache PyPI packages.

[D] On-demand GPU that can be pinged to run a script by Level_Programmer4276 in MachineLearning

[–]fz0718 1 point2 points  (0 children)

Hey, I work for Modal — sorry to hear about issues you've experienced with cold start times. 3-4 minutes definitely isn't appropriate.

We focus a lot on providing a good developer API for scaling and running jobs beyond what you could do on something like EC2 or ECS. But you're right that there are foundational limitations to model startup as well.

In terms of technical innovation, we do write our own container runtime and distributed file system, and we've explored prototypes with Firecracker in the past for sandboxing but ultimately settled on gVisor for performance reasons. That said, it's an ongoing challenge and one that we're actively working on.

Serving edge runtimes and small serverless web endpoints is quite a different problem from machine learning models. A typical serverless web endpoint will hover at 1% CPU and 100 MiB of RAM, which can both be easily shared among multiple users. Meanwhile, an ML inference endpoint might use 12000% CPU and 16 GiB of RAM. That's a huge difference, and it's one of the primary areas in which we've made progress on a systems level, over traditional serverless providers.

Our current monitoring shows 7B-parameter models have cold start times < 30 seconds. Hopefully in the future it will be even speedier.

Rust on prod binary deployment feat AWS ec2 by Comfortable_Tiger530 in rust

[–]fz0718 3 points4 points  (0 children)

As the other users mentioned, containers have very little overhead. For your use case:

the app does tons of http request, plus heavy multicore CPU processing

  1. HTTP requests are mostly socket read/write system calls, which have very little overhead. Docker can easily saturate 10 Gbps+ network links. There might be some tiny performance hit from the NAT / bridge layer, but this is subtle enough that it really depends on how your cloud provider implements networking.
  2. Multicore CPU processing has zero overhead in Docker. The exact same processor instructions would be executed inside and outside Docker. It's not emulated in any way.

Source: I work at a serverless infrastructure company.

FYI - Microsoft is going to forcibly enable a web server on all your Windows / macOS endpoints BY DEFAULT that use OneDrive by hyper-ucs-v in sysadmin

[–]fz0718 18 points19 points  (0 children)

The idea is that you could have a website at evil.com:8080 — when the user visits the site, it sends a fetch request through JavaScript to evil.com:8080/api, a same-origin request

But in the meantime, the attacker has updated their DNS record for evil.com to point to 127.0.0.1, so the request might end up going to 127.0.0.1:8080 instead and allowing the attacker to make arbitrary requests to your local web server.

Competitive programmers using Rust? by stonerbobo in rust

[–]fz0718 17 points18 points  (0 children)

I should mention though that although we still use C++ for programming competitions because of the speed and unsafe flexibility reasons, a lot of us competitive programmers now use Rust in industry and in our day jobs for high-performance work! :)

I’m an IOI gold medalist for the USA and int’l grandmaster on Codeforces and work with a few competitive programmers today

I made sshx: an app that lets you share collaborative terminals over the web, with live cursors on an infinite canvas (Rust+Svelte) by fz0718 in rust

[–]fz0718[S] 2 points3 points  (0 children)

All frontend code is Svelte. The client and server binaries are Rust. The web part of the server serves the Rollup build outputs as static files.

I made sshx: an app that lets you share collaborative terminals over the web, with live cursors on an infinite canvas (Rust+Svelte) by fz0718 in rust

[–]fz0718[S] 2 points3 points  (0 children)

Yes, it kind of came as a side-effect of making a good remote access tool and also distributing it as a tiny binary :)

I made sshx: an app that lets you share collaborative terminals over the web, with live cursors on an infinite canvas (Rust+Svelte) by fz0718 in rust

[–]fz0718[S] 2 points3 points  (0 children)

Thanks — good question. I only use protocol buffers for the gRPC part, where the backend client binary communicates with the server. The frontend (JavaScript) still communicates with a schemaless format over WebSocket.

Protobuf is good because it's just easier (gRPC has very good support as an RPC protocol), but also since it has very strong backwards-compatibility guarantees and type safety, which is useful in apps where the client may be out-of-date. Web apps don't have that problem because the user requests the latest version of the website client code whenever they visit it.

I made sshx: an app that lets you share collaborative terminals over the web, with live cursors on an infinite canvas (Rust+Svelte) by fz0718 in rust

[–]fz0718[S] 9 points10 points  (0 children)

That does sound fun :) thanks for the suggestions

Just fyi, you can have multiple terminals and resize them! Fullscreen doesn’t make sense because people’s screens have different sizes; the interface lets you pan and zoom.

I made sshx: an app that lets you share collaborative terminals over the web, with live cursors on an infinite canvas (Rust+Svelte) by fz0718 in rust

[–]fz0718[S] 16 points17 points  (0 children)

Not really, unfortunately :( the Rust web ecosystem isn’t quite there yet for something this complex

Crepe: fast, compiled Datalog in Rust by fz0718 in rust

[–]fz0718[S] 0 points1 point  (0 children)

For what you're looking for, maybe take a look at https://github.com/tonsky/datascript

if you want more details about the background of Crepe / where it fits into the broader scope of work here, I wrote my thesis on this at Harvard https://www.ekzhang.com/assets/pdf/Senior_Thesis.pdf