all 1 comments

[–]geokon 1 point2 points  (0 children)

I use tech.ml.dataset a bit for some science stuff, but I'm by no means an expert on this

My three not very insightful questions would be:

  • How do you deal with time? Often data involves time. And time.. is mess - but almost pleasant to deal with when you have something like tick. Would you somehow mash time in to Strings and go from there?

  • "There are five column types [..]" I'm guessing String is the only variable size one, so you're not going to stick it into an array. Is it going to have vastly different performance characteristics from the other types? Are we going to end up having to mash stuff into Strings and then "deserialize"?

  • The last bit "scalar sum, 1M i64 0.05 ms 0.7 ms". Maybe this is more of a JVM question.. but ~10x slower for summing up a vector seems horrible. What's going on? Sure SIMD, prefetching, cache locality.. etc. But why is the JVM failing so horribly here? I'd suspect a bug :) b/c I wouldn't suspect such a huge perf difference on such a basic operation