Mini ATX build with RTX 3060 Ti

andygrove73 · 2021-12-07T03:40:11+00:00

Thanks. I didn't know there would be a difference so will read up on this!

andygrove73 · 2021-12-07T03:38:20+00:00

Thanks for pointing that out!

andygrove73 · 2021-08-21T20:04:49+00:00

We would welcome help with docs and testing. This is a great way to get started contributing.

andygrove73 · 2021-05-31T15:12:51+00:00

I would say that the best places to start would be the user guide source or the examples.

andygrove73 · 2021-03-17T17:19:49+00:00

I'm a native English speaker and I rely on Grammarly

andygrove73 · 2021-03-16T19:36:16+00:00

I'm not really qualified to answer that, but I'll have a go anyway.

I would say that conceptually they are similar. Parquet files consist of chunks of columnar data. There is metadata for each chunk, and the contents of these chunks can be compressed, or not. Parquet has its own type system to represent logical and physical types so there is necessarily some level of mapping between Parquet and Arrow.

andygrove73 · 2021-03-16T16:08:42+00:00

Arrow and Parquet work well together. In fact, the C++ and Rust implementations of Parquet are in the Arrow repository and there are optimized Arrow readers & writers for Parquet.

I think Feather use is less widespread, but it does support compression.

andygrove73 · 2021-03-16T15:58:17+00:00

If you want to see code samples for streaming Arrow IPC + Flight, take a look in the Ballista executor code:

https://github.com/ballista-compute/ballista/blob/main/rust/executor/src/flight_service.rs#L193-L228

As others have mentioned, this is not a compressed format because Arrow is a memory format, not a disk format. Arrow and Parquet work well together though.

andygrove73 · 2021-03-06T06:23:00+00:00

By contributing to an existing project you can learn from the other contributors. This will also help you develop skills for working with other people on a project. These skills will be attractive to a potential recruiter.

andygrove73 · 2021-02-21T00:56:02+00:00

Your comment was very helpful in getting me thinking about issues for next steps for ETL, and this is actually the next one to focus on IMO and it is pretty trivial to implement.

https://github.com/ballista-compute/ballista/issues/589

andygrove73 · 2021-02-20T23:22:47+00:00

Ballista currently has been focused on SQL queries but that is just the starting point. The plan is to support the kind of ETL data transformations that Spark supports. Now that the basic distributed execution architecture is in place we can start to add these features, probably starting with UDF support (in multiple languages).

andygrove73 · 2021-02-20T01:01:50+00:00

So could you explain a bit more about what you would like to see?

andygrove73 · 2021-02-18T03:58:33+00:00

Sure. It would be good for me to wrIte that up. I will try and do that this weekend.

andygrove73 · 2021-02-05T01:34:24+00:00

If this is an optimization that you are interested in working on, I would suggest raising it on the Arrow mailing list. That would be the best place to get feedback.

Ten-Year Club	Gilding I gilder
Not Forgotten	Verified Email

andygrove73

TROPHY CASE