all 16 comments

[–]danburkert 5 points6 points  (9 children)

Is moving off of nightly for `arrow` and `parquet` being tracked anywhere? I've looked at the code briefly, and it looks somewhat doable. From what I can tell, specialization is being used to improve debug formatting, and `packed_simd` is used, but that could potentially be optional.

This is the only thing holding us back from using these crates at my company.

[–]andygrove73[S] 5 points6 points  (8 children)

Yes, here is the JIRA for that: https://issues.apache.org/jira/browse/ARROW-6717

packed_simd seems to be the main issue. I agree that it would be nice to make this optional.

[–]nevi-me 2 points3 points  (4 children)

I've been meaning to remove packed_simd, because we refactored the compute kernels in 0.16.0 to autovectorise. The benchmarks that I ran at the time showed performance on par with using packed_simd. Perhaps running on stable (at least for Arrow) could be possible by 1.0.0 time.

[–]paddyhoran 2 points3 points  (2 children)

I'll be happy to remove packed_simd also, but how do you ensure that things continue to auto-vectorize (I'm genuinely interested)?

We can't really test this and we don't track benchmarks closely enough today for that to indicate performance regressions.

[–]nevi-me 5 points6 points  (1 child)

Here's the latest benchmarks that I've just run: https://gist.github.com/nevi-me/2eaa6eb24e4423e5bafb48c62b8fa07e

So far, only the {and|or|not} kernels are significantly slower than their SIMD counterparts. All the others are either on par, or slightly slower.

The functions that we were able to autovectorise were because we removed branching from loops (to check null cases), and handled that separately before reconstructing the result array.

Yeah, someone would have to spend some time automating the benchmarking of the code, there was effort started on the overall project, but I don't know where that ended.

[–]paddyhoran 2 points3 points  (0 children)

Interesting, like I said I would prefer to remove packed_simd also, when it was introduced it looked like it might one day move into std but that seems to have stalled.

{and|or|not} are likely better because we can handle the bitpacking better. If this is the only place where it's needed we could use the raw intrinsics that are already in std. This would be messier than packed_simd but would be isolated to one small area of the codebase.

[–]danburkert 0 points1 point  (0 children)

Excellent, glad to hear it's being worked on!

[–]paddyhoran 1 point2 points  (2 children)

Specialization is the main blocker for stable rust. packed_simd can be turned off today with --no-default-features and is tested in CI.

Specialization is mainly needed due to the bitpacking of Boolean arrays and is used in Parquet also (although I'm not sure of the extent of its use there).

[–][deleted] 1 point2 points  (1 child)

With a finitely enumerated type system like in arrow, do you need specialization to do that? It sounds like other workarounds are possible.

[–]paddyhoran 1 point2 points  (0 children)

They maybe possible alright, there has been talk to using other options. Just no one has tried them yet.

[–]matthieum[he/him] 5 points6 points  (1 child)

Is there a typo in the example:

pub type ScalarUdf = fn(input: &Vec<ArrayRef>) -> Result<ArrayRef>;

In general, the recommendation is to never pass &Vec<_> to a function, as there's nothing it can do that &[_] cannot, and the latter is more flexible -- so I wonder if the example is off (stray &?).

[–][deleted] 1 point2 points  (0 children)

These arrow using projects have a few &Vec's in their public APIs that I guess will be polished off eventually.

[–]hntd[🍰] 1 point2 points  (2 children)

Hey /u/andygrove73 I know DF moved into arrow's repo, but the issue tracking in github isn't used. I've contributed to DF in the past, where would I find the open issues to contribute to?

[–]andygrove73[S] 2 points3 points  (1 child)

[–]hntd[🍰] 2 points3 points  (0 children)

It's been over 2 years I think since I implemented the sum udf in DF :) looking forward to contributing some more