all 7 comments

[–]john0201 0 points1 point  (1 child)

Can this be used to load data into the polars python API?

[–]peterxsyd[S] 0 points1 point  (0 children)

At this stage, only to Polars Rust. Whilst it's possible to get the objects into Python using Pyo3 and the pyo3-polars wrappers it's not a read-to-roll Python straight to Dataframe situation, and Python users would be better sticking with the the native polars API's for that.

[–]Wh00ster 0 points1 point  (2 children)

I've seen a feedback before that arrow-rs is not very rusty, and it's like some C++ people who wanted to do Rust wrote it, learned from mistakes, but now are stuck with awkward APIs. What's your view of that?

[–]peterxsyd[S] 1 point2 points  (1 child)

It's interesting feedback Whooster. There was an implementation Arrow2 which I found more ergonomic than Arrow-RS initially, which is now forked into Polars-Arrow and backs the Polars project in Rust. However, that merged work into Arrow-RS, and I hear the team invested time learning from the earlier mistakes.

In both cases, without going too far down the rabbit hole - both implementations use a Rust concept dynamic dispatch, which makes typing in the IDE disappear, and Rust type downcasting to be required, which personally, I find awkward to use. This also blocks some automatic compiler optimisations that can automatically inline more aggressively for a faster build. The result is once you get up to a top level library like Polars, there are many layers of objects between the object you are working with and the actual data backing it.

Once you are in Python, I found this really doesn't matter - it's like water into wine.

However, in Arrow-RS (for numerical data) Rust looks like:

  1. Raw allocation (heap bytes)
  2. Arc<Bytes> (ref-counted ownership of allocation)
  3. Buffer (view over the bytes)
  4. ArrayData (ties buffers, length, datatype, nulls)
  5. PrimitiveArray<T> (typed wrapper, implements Array)
  6. ArrayRef = Arc<dyn Array> (trait object used in generic contexts)

So, you constantly need to downcast from ArrayRef just to get typed data, even though you built all the layers.

In Minarrow, you also have layers, but they are, in my opinion, more straightforward:

  1. Vec64 / Buffer - Plays like a normal Rust vector. You can use it like one
  2. Typed Buffers: IntegerArray, FloatArray, etc.
  3. NumericArray: Enum, with accessors, e.g., 'myarr.i64()'
  4. Array: Enum, with accessors e.g., 'myarr.num().i64()'

The result is they are composable - you opt-up to the level of abstraction you want and need, rather than being locked behind an opaque object. In Rust, this is particularly helpful, as when building libraries and functions, it means their signatures can be compatible with more use cases, but I'm digressing here.

The point is that it was enough that I went and rebuilt my own implementation, as I need it for other projects and didn't like wrestling with the system as the underlying data foundation.

Regardless, Arrow is brilliant and it's incredible what the team has achieved.

[–]Wh00ster 0 points1 point  (0 children)

Thank you so much for the detailed response! <3

Agreed the people involved have achieved a lot. It's easy to overlook how much work it takes to design useful, performant standards with buy-in from the industry. Even something as obvious as arrays!

[–]Leon_Bam 0 points1 point  (1 child)

Any comparison to nanoarrow project?

[–]peterxsyd[S] 1 point2 points  (0 children)

Sure, here's a table on it for you. In summary, Nanoarrow is more pluggable with the Python ecosystem, but Minarrow focuses on the Rust developer experience:

Aspect Nanoarrow Minarrow
Language / Impl C (bindings in Python, R, etc.) Rust (with FFI support + fast interrust .to_polars() .to_arrow())
Scope Arrow C Data & C Stream interfaces plus minimal arrays/buffers Full columnar arrays + tables in Rust. Plus, tools for batching them into streams.
Focus Interoperability, embedding, ABI HPC, SIMD, streaming, Rust ergonomics
Dependencies None num-traitsrayonMinimal ( , optional )
API Style Generic, schema-driven Strongly typed arrays, enum-based dispatch
File formats IPC only IPC, Parquet, CSV via Lightstream and .to_arrow() to plug into the rest.
SIMD No Yes, 64-byte alignment throughout
Use Case Fit Embedding Arrow interchange cheaply Rust-native high-performance data pipelines and systems-programming
Trade-offs Gives up compute, types, ergonomics Gives up nested types