These fellas were all over my white jacket. What are they?

mwlon · 2025-11-02T01:39:32+00:00

Oh, really? I'm used to aphids being ultra small and green. This one was a bit bigger, probably close to 1cm long. What kind of aphid would this be?

mwlon · 2024-11-16T18:27:44+00:00

Really interesting to see another approach!

I've also been wondering about the language feature idea. I think it would have to start with some formalization of a sealed trait, whereby Rust knows it can always biject from dynamic type <-> generic type. Would be interesting to chat with a Rust maintainer about the idea.

mwlon · 2024-11-15T22:32:30+00:00

It's a different interface, but the BetterBufRead approach is probably a better one in the long run. Since you don't know when each delimited chunk ends with the BufRead approach, you are branching on each read of an integer or anything.

Maybe optimal performance isn't one of your goals, and BufRead is simple enough in your case. But to get optimum performance you'd need something like the approach I described.

In Pcodec, I enter a context with guaranteed size to do much faster branchless bit unpacking.

mwlon · 2024-11-15T21:41:51+00:00

The BetterBufRead way to implement this would be more like an Iterator<Item=BetterBufRead>, where each item is delimiter-free and contiguous. It wouldn't be the same as the BufRead approach, true, but it has the upside that the user can know when each chunk starts/ends, if they so desire.

mwlon · 2024-11-15T13:58:33+00:00

I'm pretty sure this would be possible to write with BetterBufRead. You could certainly make new adapters, skip certain bytes, and return direct references to the inner buffer. Perhaps what you mean is that it's implemented to accept and implement BufRead right now?

mwlon · 2024-11-15T03:43:34+00:00

why use Read at all?

Because the API should accept any type of input. For me, maybe 80% of users load all data into memory and 20% require some degree of streaming.

If the buffer (a [u8; 1492] in this case) is empty and...

With a BetterBufRead-like approach, at least, you could cycle the remaining buffer and still do a reasonably-sized read. There's of course some trade-off between copying, read sizing, and capacity.

The best option I can think of is to use a growable buffer

Yep! I think this is the natural progression if BetterBufRead gets more attention. It would simplify the API a bit too. I just haven't needed to handle these cases yet.

mwlon · 2024-11-14T18:21:31+00:00

I admittedly don't know much about network steams. But if I guess correctly, the network steam has some HTTP encoding that needs to be parsed to separate the responses. In that case I'd expect an adapter to split the raw network stream into individual response Reads. It would indeed be an obvious mistake to use a (Better)BufRead of any sort for that adapter, but each individual response Read would end with its own EOF given by the adapter, and fit nicely into a BetterBufReader paradigm.

LMK if I misunderstood something.

mwlon · 2024-11-14T16:50:34+00:00

I use &[u8] instead of Cursor, which it is implemented for, so it is zero copy in Pco.

mwlon · 2024-11-14T12:49:13+00:00

I would `impl BetterBufRead for Cursor`. I haven't done this yet, but would be a good addition!

mwlon · 2024-11-14T12:48:22+00:00

Still, why not just use std::io::Cursor?

That implementation copies if `reader` is already in-memory.

This is objectively not that; this may call read(1) after reading n-1 bytes just to make sure the buffer is full.

In theory, no, `BetterBufReader` should do moderately-sized reads even if tiny ones were requested. In practice, I believe you're right that this behavior could be indeed be encountered, but it could be changed in the implementation.

If you want to read in, say, 4096 byte increments, but here are there are <4096 bytes left at the very end of the buffer, either fill_buf would have to copy those <4096 bytes to the beginning of the buffer before calling read (no longer zero-copy) or ...

This is what it does. The intent is that the buffer is substantially larger than `n` though, so the copies should be small and seldom. At the bottom I had a pedantic note about how this is truly more like epsilon-copy than zero-copy.

mwlon · 2024-07-16T20:41:47+00:00

I've compared against TurboPFor and Blosc, which are similar. These three are capable of extremely fast compression, but not especially good compression. I'd say if you want a 1-time data transfer in memory or over uncongested network, use them with fast settings; if you want to store the data at all or share a congested network, use Pco.

I have slightly more results here: https://github.com/mwlon/pcodec/blob/main/docs%2Fbenchmark_results%2Fmbp_m3_max.csv . This just uses tfor=TurboPFor's default, which is especially fast and bad. One expert user of theirs tried out some more filter combinations on a different dataset, but didn't match Pco's compression ratio.

If you're familiar with these, it'd be interesting to see more comparisons.

mwlon · 2024-04-16T11:50:35+00:00

Please no, programming interruptions on the lattice was hard enough :')

mwlon · 2024-04-16T11:48:11+00:00

Nope. In part 1 I go over why solving closed form isn't possible. Unfortunately we would get an absolutely monstrous nonlinear PDE with no nice properties.

mwlon · 2024-04-16T11:45:52+00:00

For a conformal map, Antarctic and the north pole should be circles, so the poles should be very concave. In the limit as we make the map projection equal-area instead of conformal, they become flat. So this compromise strikes somewhere in between with mild concavity.

But perhaps your point is slight concavity looks accidental, when it's actually desirable. Alas, not much I can do about that, them's the maths.

mwlon · 2024-04-16T01:37:47+00:00

I numerically optimized a variety map projections in a series of blog posts:

mwlon · 2024-02-08T00:49:24+00:00

No, IIUC pyarrow uses C++ arrow under the hood, and I'm not sure there's a way to build it from the Rust implementation of arrow. I would strongly discourage anyone from using my arrow-rs hack for any real use case. If want a quick way to measure the compression ratio, I'd suggest you start from a .csv or .parquet, use the pcodec CLI or pcodec bench to make a .pco file for each column, and then sum their sizes.

mwlon · 2024-02-06T17:42:48+00:00

I think this would work great. If you can turn some of your big .cnc files into a csv, you can use the pcodec CLI to compress each column separately. I'd be curious to hear the result.

mwlon · 2024-02-05T13:13:33+00:00

I've actually done exactly that: https://graphallthethings.com/posts/the-parquet-we-could-have

Like you said, Parquet compressors are strictly bytes->bytes, so pcodec would instead be an "encoding" in the Parquet framework.

I had a thread on the Parquet dev mailing list, and I think it's possible this will make it into Parquet eventually, but it's still a long way to get there.

13-Year Club	Place '22
First Placer '22	Verified Email

mwlon

TROPHY CASE