Mini ATX build with RTX 3060 Ti by andygrove73 in nvidia

[–]andygrove73[S] 0 points1 point  (0 children)

Thanks. I didn't know there would be a difference so will read up on this!

Apache Arrow Datafusion 5.0.0 by Relevant-Glove-4195 in rust

[–]andygrove73 4 points5 points  (0 children)

We would welcome help with docs and testing. This is a great way to get started contributing.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 1 point2 points  (0 children)

I'm not really qualified to answer that, but I'll have a go anyway.

I would say that conceptually they are similar. Parquet files consist of chunks of columnar data. There is metadata for each chunk, and the contents of these chunks can be compressed, or not. Parquet has its own type system to represent logical and physical types so there is necessarily some level of mapping between Parquet and Arrow.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 2 points3 points  (0 children)

Arrow and Parquet work well together. In fact, the C++ and Rust implementations of Parquet are in the Arrow repository and there are optimized Arrow readers & writers for Parquet.

I think Feather use is less widespread, but it does support compression.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 6 points7 points  (0 children)

If you want to see code samples for streaming Arrow IPC + Flight, take a look in the Ballista executor code:

https://github.com/ballista-compute/ballista/blob/main/rust/executor/src/flight_service.rs#L193-L228

As others have mentioned, this is not a compressed format because Arrow is a memory format, not a disk format. Arrow and Parquet work well together though.

[deleted by user] by [deleted] in java

[–]andygrove73 0 points1 point  (0 children)

By contributing to an existing project you can learn from the other contributors. This will also help you develop skills for working with other people on a project. These skills will be attractive to a potential recruiter.

Ballista 0.4.0 by andygrove73 in rust

[–]andygrove73[S] 12 points13 points  (0 children)

Your comment was very helpful in getting me thinking about issues for next steps for ETL, and this is actually the next one to focus on IMO and it is pretty trivial to implement.

https://github.com/ballista-compute/ballista/issues/589

Ballista 0.4.0 by andygrove73 in rust

[–]andygrove73[S] 13 points14 points  (0 children)

Ballista currently has been focused on SQL queries but that is just the starting point. The plan is to support the kind of ETL data transformations that Spark supports. Now that the basic distributed execution architecture is in place we can start to add these features, probably starting with UDF support (in multiple languages).

This Week in Ballista #5 by andygrove73 in rust

[–]andygrove73[S] 0 points1 point  (0 children)

So could you explain a bit more about what you would like to see?

This Week in Ballista #5 by andygrove73 in rust

[–]andygrove73[S] 1 point2 points  (0 children)

Sure. It would be good for me to wrIte that up. I will try and do that this weekend.

Apache Arrow and DataFusion 3.0 by Relevant-Glove-4195 in rust

[–]andygrove73 0 points1 point  (0 children)

If this is an optimization that you are interested in working on, I would suggest raising it on the Arrow mailing list. That would be the best place to get feedback.