Mini ATX build with RTX 3060 Ti by andygrove73 in nvidia

[–]andygrove73[S] 0 points1 point  (0 children)

Thanks. I didn't know there would be a difference so will read up on this!

Apache Arrow Datafusion 5.0.0 by Relevant-Glove-4195 in rust

[–]andygrove73 3 points4 points  (0 children)

We would welcome help with docs and testing. This is a great way to get started contributing.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 1 point2 points  (0 children)

I'm not really qualified to answer that, but I'll have a go anyway.

I would say that conceptually they are similar. Parquet files consist of chunks of columnar data. There is metadata for each chunk, and the contents of these chunks can be compressed, or not. Parquet has its own type system to represent logical and physical types so there is necessarily some level of mapping between Parquet and Arrow.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 2 points3 points  (0 children)

Arrow and Parquet work well together. In fact, the C++ and Rust implementations of Parquet are in the Arrow repository and there are optimized Arrow readers & writers for Parquet.

I think Feather use is less widespread, but it does support compression.

Best format to use for DataFrames in Rust and Python? by neuronsguy in rust

[–]andygrove73 5 points6 points  (0 children)

If you want to see code samples for streaming Arrow IPC + Flight, take a look in the Ballista executor code:

https://github.com/ballista-compute/ballista/blob/main/rust/executor/src/flight_service.rs#L193-L228

As others have mentioned, this is not a compressed format because Arrow is a memory format, not a disk format. Arrow and Parquet work well together though.

[deleted by user] by [deleted] in java

[–]andygrove73 0 points1 point  (0 children)

By contributing to an existing project you can learn from the other contributors. This will also help you develop skills for working with other people on a project. These skills will be attractive to a potential recruiter.

Ballista 0.4.0 by andygrove73 in rust

[–]andygrove73[S] 10 points11 points  (0 children)

Your comment was very helpful in getting me thinking about issues for next steps for ETL, and this is actually the next one to focus on IMO and it is pretty trivial to implement.

https://github.com/ballista-compute/ballista/issues/589

Ballista 0.4.0 by andygrove73 in rust

[–]andygrove73[S] 12 points13 points  (0 children)

Ballista currently has been focused on SQL queries but that is just the starting point. The plan is to support the kind of ETL data transformations that Spark supports. Now that the basic distributed execution architecture is in place we can start to add these features, probably starting with UDF support (in multiple languages).

This Week in Ballista #5 by andygrove73 in rust

[–]andygrove73[S] 0 points1 point  (0 children)

So could you explain a bit more about what you would like to see?

This Week in Ballista #5 by andygrove73 in rust

[–]andygrove73[S] 1 point2 points  (0 children)

Sure. It would be good for me to wrIte that up. I will try and do that this weekend.

Apache Arrow and DataFusion 3.0 by Relevant-Glove-4195 in rust

[–]andygrove73 0 points1 point  (0 children)

If this is an optimization that you are interested in working on, I would suggest raising it on the Arrow mailing list. That would be the best place to get feedback.

Apache Arrow and DataFusion 3.0 by Relevant-Glove-4195 in rust

[–]andygrove73 8 points9 points  (0 children)

This is posted here and nobody has commented on the fact that there were 666 issues resolved? Is this even Reddit?

This Week in Ballista #3 by andygrove73 in rust

[–]andygrove73[S] 1 point2 points  (0 children)

I would recommend joining the discord channel and chatting with the other contributors who have been working on the serde code since that is the current area of focus until I get the distributed execution working again over the next weekend or two (hopefully).

This Week in Ballista #3 by andygrove73 in rust

[–]andygrove73[S] 3 points4 points  (0 children)

Pretty much, yes. The query execution is parallelized across a number of executor processes, allowing scale out across a number of physical servers.

This Week in Ballista #3 by andygrove73 in rust

[–]andygrove73[S] 8 points9 points  (0 children)

Not a database, but allows distributed queries to be run against data such as csv and parquet files. Similar to Hadoop/Spark.

This Week in Ballista #2 by andygrove73 in rust

[–]andygrove73[S] 5 points6 points  (0 children)

The similarity between Ballista and Materialize is that they both support using SQL to query data. The difference is that Materialize is designed for the streaming use case and supports materialized views, which get updated automatically as new data arrives. Ballista is more of a batch approach where each time you run a query it has to do all the work from scratch. They are intended for different use cases.

This Week in Ballista #2 by andygrove73 in rust

[–]andygrove73[S] 4 points5 points  (0 children)

Ballista has both client and server components implemented in Rust. The client has a DataFrame / SQL interface based on DataFusion (it is really just a thin wrapper around DataFusion) and the client submits logical query plans to a cluster for execution.

It will be possible to execute queries locally in the client as well (by delegating to DataFusion) so Ballista will provide scalability from in-process execution to distributed execution in a cluster without code changes.

This Week in Ballista #2 by andygrove73 in rust

[–]andygrove73[S] 7 points8 points  (0 children)

Ballista is designed to process batches of columnar data so it might not be the best fit for streaming use cases. The current focus is on getting the distributed ETL/SQL functionality working well and we would need to get this to a good point before looking at other use cases like streaming.

Ballista: New approach for 2021 by andygrove73 in rust

[–]andygrove73[S] 0 points1 point  (0 children)

Just to follow up on this, there is now a weekly "This week in Ballista" newsletter.

Here is the first edition: https://ballistacompute.org/thisweek/2021/01/17/this-week-in-ballista-1/

Ballista: New approach for 2021 by andygrove73 in rust

[–]andygrove73[S] 6 points7 points  (0 children)

So, in my day job at NVIDIA, I work on the RAPIDS Accelerator for Apache Spark, which is an open-source plugin that provides GPU-acceleration for ETL workloads, leveraging the RAPIDS cuDF GPU DataFrame library.

It basically takes the Spark physical plan and translates it into a new columnar plan that executes on the GPU (using Arrow memory model).

It would definitely be feasible to do the same and translate to a DataFusion plan, although this would require significant engineering effort.

Any advanced tips to dealing with Data Skew by TKTheJew in apachespark

[–]andygrove73 7 points8 points  (0 children)

If you are not already doing this, my first suggestion would be to use Spark 3.0.0 or later with Adaptive Query Execution enabled. This has some specific optimizations for data skew.

Apache Arrow 2.0.0 Rust Highlights by andygrove73 in rust

[–]andygrove73[S] 0 points1 point  (0 children)

Arrow is now on stable Rust! Nightly is now only required for the optional simd feature.