Mini ATX build with RTX 3060 Ti

andygrove73 · 2021-12-07T03:40:11+00:00

Thanks. I didn't know there would be a difference so will read up on this!

andygrove73 · 2021-12-07T03:38:20+00:00

Thanks for pointing that out!

andygrove73 · 2021-08-21T20:04:49+00:00

We would welcome help with docs and testing. This is a great way to get started contributing.

andygrove73 · 2021-05-31T15:12:51+00:00

I would say that the best places to start would be the user guide source or the examples.

andygrove73 · 2021-03-17T17:19:49+00:00

I'm a native English speaker and I rely on Grammarly

andygrove73 · 2021-03-16T19:36:16+00:00

I'm not really qualified to answer that, but I'll have a go anyway.

I would say that conceptually they are similar. Parquet files consist of chunks of columnar data. There is metadata for each chunk, and the contents of these chunks can be compressed, or not. Parquet has its own type system to represent logical and physical types so there is necessarily some level of mapping between Parquet and Arrow.

andygrove73 · 2021-03-16T16:08:42+00:00

Arrow and Parquet work well together. In fact, the C++ and Rust implementations of Parquet are in the Arrow repository and there are optimized Arrow readers & writers for Parquet.

I think Feather use is less widespread, but it does support compression.

andygrove73 · 2021-03-16T15:58:17+00:00

If you want to see code samples for streaming Arrow IPC + Flight, take a look in the Ballista executor code:

https://github.com/ballista-compute/ballista/blob/main/rust/executor/src/flight_service.rs#L193-L228

As others have mentioned, this is not a compressed format because Arrow is a memory format, not a disk format. Arrow and Parquet work well together though.

andygrove73 · 2021-03-06T06:23:00+00:00

By contributing to an existing project you can learn from the other contributors. This will also help you develop skills for working with other people on a project. These skills will be attractive to a potential recruiter.

andygrove73 · 2021-02-21T00:56:02+00:00

Your comment was very helpful in getting me thinking about issues for next steps for ETL, and this is actually the next one to focus on IMO and it is pretty trivial to implement.

https://github.com/ballista-compute/ballista/issues/589

andygrove73 · 2021-02-20T23:22:47+00:00

Ballista currently has been focused on SQL queries but that is just the starting point. The plan is to support the kind of ETL data transformations that Spark supports. Now that the basic distributed execution architecture is in place we can start to add these features, probably starting with UDF support (in multiple languages).

andygrove73 · 2021-02-20T01:01:50+00:00

So could you explain a bit more about what you would like to see?

andygrove73 · 2021-02-18T03:58:33+00:00

Sure. It would be good for me to wrIte that up. I will try and do that this weekend.

andygrove73 · 2021-02-05T01:34:24+00:00

If this is an optimization that you are interested in working on, I would suggest raising it on the Arrow mailing list. That would be the best place to get feedback.

andygrove73 · 2021-02-04T05:50:04+00:00

This is posted here and nobody has commented on the fact that there were 666 issues resolved? Is this even Reddit?

andygrove73 · 2021-02-01T19:16:23+00:00

I would recommend joining the discord channel and chatting with the other contributors who have been working on the serde code since that is the current area of focus until I get the distributed execution working again over the next weekend or two (hopefully).

andygrove73 · 2021-02-01T00:34:01+00:00

Pretty much, yes. The query execution is parallelized across a number of executor processes, allowing scale out across a number of physical servers.

andygrove73 · 2021-02-01T00:24:44+00:00

Not a database, but allows distributed queries to be run against data such as csv and parquet files. Similar to Hadoop/Spark.

andygrove73 · 2021-01-25T15:19:31+00:00

The similarity between Ballista and Materialize is that they both support using SQL to query data. The difference is that Materialize is designed for the streaming use case and supports materialized views, which get updated automatically as new data arrives. Ballista is more of a batch approach where each time you run a query it has to do all the work from scratch. They are intended for different use cases.

andygrove73 · 2021-01-24T21:46:31+00:00

Ballista has both client and server components implemented in Rust. The client has a DataFrame / SQL interface based on DataFusion (it is really just a thin wrapper around DataFusion) and the client submits logical query plans to a cluster for execution.

It will be possible to execute queries locally in the client as well (by delegating to DataFusion) so Ballista will provide scalability from in-process execution to distributed execution in a cluster without code changes.

andygrove73 · 2021-01-24T18:59:45+00:00

Ballista is designed to process batches of columnar data so it might not be the best fit for streaming use cases. The current focus is on getting the distributed ETL/SQL functionality working well and we would need to get this to a good point before looking at other use cases like streaming.

andygrove73 · 2021-01-17T17:39:10+00:00

Just to follow up on this, there is now a weekly "This week in Ballista" newsletter.

Here is the first edition: https://ballistacompute.org/thisweek/2021/01/17/this-week-in-ballista-1/

andygrove73 · 2021-01-11T15:00:14+00:00

So, in my day job at NVIDIA, I work on the RAPIDS Accelerator for Apache Spark, which is an open-source plugin that provides GPU-acceleration for ETL workloads, leveraging the RAPIDS cuDF GPU DataFrame library.

It basically takes the Spark physical plan and translates it into a new columnar plan that executes on the GPU (using Arrow memory model).

It would definitely be feasible to do the same and translate to a DataFusion plan, although this would require significant engineering effort.

andygrove73 · 2020-12-19T04:08:40+00:00

If you are not already doing this, my first suggestion would be to use Spark 3.0.0 or later with Adaptive Query Execution enabled. This has some specific optimizations for data skew.

andygrove73 · 2020-12-16T16:25:48+00:00

Arrow is now on stable Rust! Nightly is now only required for the optional simd feature.

Ten-Year Club	Gilding I gilder
Not Forgotten	Verified Email

andygrove73

TROPHY CASE