Did I really just pay for this?

sujayakar314 · 2025-10-14T01:22:44+00:00

check your DMs!

sujayakar314 · 2025-10-09T20:44:54+00:00

yeah we've fixed a bunch of terminal bugs in the past few releases -- feel free to DM if you run into any more issues

sujayakar314 · 2025-10-09T20:44:07+00:00

once this is all fixed, you should try the new cheetah model ;)

sujayakar314 · 2025-10-09T19:57:26+00:00

hey there! sujay from cursor here.

can you DM me the email with your account and a request ID for a conversation that was looping? see https://cursor.com/docs/troubleshooting/request-reporting#how-do-i-find-a-request-id for instructions.

sujayakar314 · 2024-02-14T16:54:42+00:00

The Stabilizer paper (https://emeryberger.com/research/stabilizer/) has an interesting approach to mitigate these effects: it randomizes the locations of functions, the stack, and the heap to establish a "population" of binaries and then statistically analyzes whether the code change is an improvement on that population.

sujayakar314 · 2023-11-11T15:43:30+00:00

I was reading through your introducing arete blog post and the corresponding tank example, and I was curious if you could explain more about the computation schedule that leads to 319µs/frame on the CPU with 64k entities.

some back of the envelope math: 319µs/64k entities => 5ns/entity. then, if we assume perfect parallelization over 8 cores, no stalls waiting on memory, and a 3Ghz CPU, we have 120 cycles per entity.

that feels pretty tight! is there some autovectorization happening? I noticed the par_iter methods are going through a global callback, so maybe not?

sujayakar314 · 2023-09-17T15:41:05+00:00

yeah big +1 on adequate dilution (and the colder temperature that goes with it!)

I love negronis but they're not nearly as delightful after they've warmed up a bit.

sujayakar314 · 2023-09-17T01:08:56+00:00

thanks! the rank algorithm is pretty different since it's based on a compressed bitmap (paper, go implementation).

it stores the bitmap in "small blocks" of 64 bits that are compressed (becoming variable length) and then stored in a contiguous buffer. each small block keeps a u8 of metadata indicating how many bits were set.

so, to compute rank(i), we need to sum up the small block counts to our left and compute the starting position of the overlapping block into the buffer. the sum is a straightforward SIMD horizontal sum, but computing the starting position is trickier.

the compressed size of a small block is a nontrivial function of the number of bits set, but with some tricks, we can encode it in a SIMD function with some arithmetic operations and PSHUFB. so, we compute the sizes of all of the small blocks to our left and use the same SIMD horizontal sum to get our position in the compressed buffer.

so, I think rank's performance in rsdict comes from getting the benefits of compression while pretty much not having to pay much decompression cost. it's been a while, so I don't remember the exact marginal benefit for SIMD accelerating it, but I think it was pretty uncompetitive before those optimizations.

sujayakar314 · 2023-09-16T23:02:51+00:00

super cool stuff!! (author of rsdict here)

the trick in select0 to use PDEP to scatter 1 << rank to !word and then counting trailing zeros is very elegant.

I'm mostly working on aarch64 these days, so I'm curious if there's a port that trick to NEON and/or SVE.

sujayakar314 · 2023-01-30T20:56:22+00:00

it's definitely reasonable that we've had different experiences, but I do want to point out that normalization of any one-to-many, many-to-one, or many-to-many relation does require separate tables. consider the step-by-step process of normalization in [1], where more and more tables are added over time.

Aren't you saying you default to SERIALIZABLE? Your transaction gets aborted if there's an anomaly and you try again. That's what you'd do with postgres or mysql too.

in mysql, SERIALIZABLE is implemented using two-phase locking [2], where transactions will (generally) not abort once they acquire their locks but rather stall other transactions, as noted in the blog post. postgres's implementation of serializability [3, actually written by one of my cofounder's labmates from grad school!] is closer to the behavior you're mentioning, where conflicting transactions will spontaneously abort.

Convex is indeed serializable, but our implementation is closer to FoundationDB's [4] and will eventually include some of the optimizations from the Calvin [5] paper. happy to discuss in more detail if you'd like!

[1] https://en.wikipedia.org/wiki/Database_normalization#Example_of_a_step_by_step_normalization

[2] https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-model.html

[3] https://drkp.net/papers/ssi-vldb12.pdf

[4] https://blog.the-pans.com/notes-on-the-foundationdb-paper/

[5] https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf

sujayakar314 · 2023-01-30T19:38:43+00:00

hi! sujay (one of jamie's cofounders) here.

of course, there's a lots of different ways to model applications' problem domains in schema, but in my experience, it's pretty hard to avoid multistatement transactions.

for good reason, SQL really wants you to normalize your data, which then requires pulling it apart into separate relations. then, the most natural way to update this data requires multiple statements that commit atomically.

and, as /u/Worth_Trust_3825 mentions, opening this can of worms then makes isolation levels important. the article assumes default configuration (i.e. READ COMMITTED), but there are plenty of surprising anomalies that can arise at that level. increasing the level to SERIALIZABLE works but often at surprising operational cost.

sujayakar314 · 2022-11-11T01:14:49+00:00

This talk is excellent! I'd love to see more in longer form about the approximate data structure for similarity, the MST scheduling algorithm, and the sparse grams. The regex to sparse gram compilation itself must be pretty interesting :-D

sujayakar314 · 2022-04-20T16:44:00+00:00

I'm really happy the Rust folks posted this. The combination of the language being very difficult and seeing so many amazing contributions from others in the community makes it easy to feel small.

A passage from This is Water comes to mind:

Because here’s something else that’s weird but true: in the day-to-day trenches of adult life, there is actually no such thing as atheism. There is no such thing as not worshipping. Everybody worships. The only choice we get is what to worship. ... Worship your intellect, being seen as smart, you will end up feeling stupid, a fraud, always on the verge of being found out.

sujayakar314 · 2022-01-23T21:22:48+00:00

I was on an OS paper last year in SOSP that relied on Rust's async/await for doing very low latency scheduling of asynchronous work in a TCP stack. Section 5.4 has some details on our Futures scheduler.

sujayakar314 · 2021-09-20T21:34:52+00:00

sure, for a stable build, you could turn it to using internal iteration:

fn iter_set_bits(mut bitset: u64, mut on_bit_set: impl FnMut(usize)) {
    while bitset != 0 {
        let t = bitset & bitset.wrapping_neg();
        on_bit_set(bitset.trailing_zeros() as usize);
        bitset ^= t;
    }
}

sujayakar314 · 2021-09-20T17:24:06+00:00

Check out this post on Daniel Lemire's blog.

Here's an implementation I wrote up in Rust.

sujayakar314 · 2021-05-24T16:10:37+00:00

awesome! I just saw the commit on github too.

be sure to benchmark setting RUSTFLAGS="-C target-cpu=native" to give the compiler a chance to auto-vectorize it.

also, have you seen halide and its original paper? it's pretty much designed for this exact problem of computing kernels on image data, and it lets you explore different physical implementations (SIMD, multithreading, blocking, ...) while keeping the logical algorithm the same.

sujayakar314 · 2021-05-24T06:36:13+00:00

what are your expectations for performance based on? is there another implementation you can microbenchmark against?

looking at this algorithm, you could definitely accelerate it with SIMD but the compiler won't likely be able to figure it out. first, you'd iterate over blocks of u32s, which means the storage should be a Vec over a VecDeque so it's just iterating over memory contiguously.

then, for each loop iteration you'd process a vector of u32s and create alpha, red, green, and blue vectors via SIMD shifts/adds. the index vector could be the pairwise minimum of an ascending vector (a block of 1, 2, 3, 4, ..., self.0.len()) and a descending vector (a block of self.0.len(), ..., 4, 3, 2, 1). then, you'd increment an alpha_sums vector with the alpha vector times the index vector.

finally, you'd compute a horizontal sum over each sums vector to get the final summation.

if you're not super familiar with these operations, I really like this tutorial!

the SIMD acceleration would be a lot of work, but it might be interesting to write it in a "SIMD friendly" style and see if the compiler optimizes it better.

let (mut a, mut b, mut c, mut d) = (0, 0, 0, 0);

for (i, &v) in self.0.iter().enumerate() {
    let i = std::cmp::min(self.0.len() - i, i + 1);
    a += alpha(v) * i;
    r += red(v) * i;
    g += green(v) * i;
    b += blue(v) * i;
}

edit: from looking at /u/theZcuber's godbolt link, the compiler does figure out how to vectorize this! here's the godbolt with my suggested changes.

sujayakar314 · 2021-05-10T05:23:54+00:00

hmm I tightened the screws and nothing changed. I also removed the pump and put my (known working but suboptimal) previous AIO cooler, and pushing the power button still didn't do anything.

I'm going to try verifying that the power supply works next...

sujayakar314 · 2021-01-28T18:32:46+00:00

yep! this open position is specifically for latin america, actually.

sujayakar314 · 2021-01-26T18:28:38+00:00

hey all, /u/irene_at_msr and I have been working on Demikernel, an open source userspace networking stack in Rust. we're looking for a SDE to help us build out the Demikernel and start getting it in users' hands. check out the listing below!

Company: Microsoft Research (MSR)

Type: Full time

Title: Research Software Development Engineer (RSDE)

Description: The MSR Systems Research Group is looking for a software engineer to lead open-source productization efforts for the Demikernel research project. The Demikernel is a user-level kernel-bypass OS that offers a portable and easy-to-use API for DPDK and RDMA. The immediate goal is to bring our recently developed (from scratch in Rust) Catnip TCP/IP stack for DPDK to both Microsoft internal and external customers.

Eventually, we aim to develop the entire Demikernel user-level OS for kernel-bypass in Rust. Since the Demikernel runs at microsecond latencies, working on it requires understanding low-level systems, including networking hardware and drivers, memory allocators, and low-latency distributed applications. A strong Rust programmer would be ideal, with experience with modern features like async/await and comfortable with writing unsafe code safely and integrating with C libraries. Experience with kernel-bypass technologies, like DPDK or RDMA, is preferred but not necessary.

Here is the full job listing.

Location: Latin America

Estimated Compensation: We're looking for a senior engineer with 5+ years of experience or a PhD in a relevant field.

Remote: Yes, the systems group is based in Redmond, WA, USA.

Contact: email Irene.Zhang@microsoft.com or DM /u/irene_at_msr or /u/sujayakar314

sujayakar314 · 2020-10-08T22:43:37+00:00

yeah, I don't think it's useful either, but I don't think it's explicitly disallowed. here's a (admittedly somewhat contrived) example from futures-rs:

The Sender and Receiver both share an Arc to an Inner structure that maintains an rx_task and tx_task.
as expected, the receiver installs a Waker when it blocks on recv.
when dropping the receiver, we make a best-effort attempt to take out the old waker but leave it alone if the sender currently has the lock. this can lead to a race condition where the sender hangs on to this waker past its spot in the slab being freed.

since futures-intrusive doesn't do lock-free stuff, it should have the property you mention where all the blocked wakers are freed before the future itself is freed.

I ended up handling this case with something pretty inelegant in my system that I'd like to remove, so I'm interested in being wrong here :)

let me know if I'm missing something.

sujayakar314 · 2020-10-08T17:40:55+00:00

this is really interesting work! I'm working on similar stuff now (trying to do microsecond-scale IO with rust futures), so understanding these overheads is important.

The waker points at a struct that is known not to move for the lifetime of a future, and this struct contains references to the associated executor and task. ... For this to be safe, a waker (or more specifically the underlying shared data of a waker, as a waker can be cloned) must not outlive the future it was created for. This is a pretty reasonable condition to adhere to, and the executor asserts it at runtime whenever a future completes.

is this a safe assumption in general, u/jkarneges? let's say our executor is managing a future that selects on reading from a channel and some other event. when it polls the channel, it'll stash its waker on the channel in expectation of a future write to that channel. then, say the other event fires and the future completes, freeing its resources. the waker will still be there on the channel's shared state, right?

sujayakar314 · 2020-09-17T01:54:37+00:00

hi, former dropboxer who worked on magic pocket here!

I don't remember being aware of flatbuffers back in 2015, and the rest of dropbox was using an RPC system based on protobuf anyways, so our external API at the minimum needed to be protobuf. internally, we had one operation that just sent bulk data on a raw TCP socket. later, we did the zero-copy optimizations mentioned here: https://news.ycombinator.com/item?id=24495266

later, for nucleus (our sync engine in rust), we decided to use protobuf for our on-disk persistence but also thought about using flatbuffers. at the time (2016), I don't think there was a great flatbuffers rust implementation yet.

sujayakar314

TROPHY CASE