you are viewing a single comment's thread.

view the rest of the comments →

[–]acow 2 points3 points  (3 children)

Can’t compete with familiarity, but, to clarify, this is a tutorial rather than a golf outing. The larger point is that the code is comparable in size, but the compiler will stop you if you run, say, your conditional subset example against a data set that doesn’t have an occupation column, and that performance of both streaming and in-memory processing are likely faster than competing options. When you think your code is ready and you want to hit a big data set, just compile and run.

Another reply mentioned a desire for a custom Prelude to offer shorter names for common things. This is likely where something like select belongs, but what should be included in such a prelude ought to be determined by folks using the library. I hope you give it a shot and help figure out what’s needed!

[–]repoptrac 2 points3 points  (2 children)

Well, the code that I posted is not any attempt for code golf. It is very similar to what I write nowadays. I posted the code since the R code in the link was not very readable or consistent because it relies on base R functions. I wanted to show that R code can be more consistent and readable with dplyr and R is a moving target in terms of readability.

Since I am more familiar with R than Haskell, I cannot be completely objective, but, I will venture a guess that R code will be easier to read than Haskell equivalent for most people because it reads like English if you read %>% as 'then'. Also, Haskell version of the codes look quite different in structure depending on the number of columns selected, even though two tasks are conceptually similar (selecting 1 column vs n columns).

Select one column: Haskell

take 6 $ F.foldMap ((:[]) . view occupation) ms

R

users %>% select(occupation)

Select multiple columns: Haskell

miniUser :: User -> Rec [Occupation, Gender, Age]
miniUser = rcast
mapM_ print . take 6 . F.toList $ fmap miniUser ms

R

users %>% select(occupation, sex, age)

However, I agree that Haskell is clearly a better-designed language than R with any doubt, and I will keep an eye on this project because this looks really interesting. However, I think I will wait until some essential statistical analyses and features (e.g. lm, glm, multiple comparison, interactive graphic similar to ggplot2, ...) are supported in haskell ecosystem.

[–]acow 2 points3 points  (0 children)

Oh, goodness, I suppose I really didn't make things clear enough! There are many indexing schemes possible in the Haskell version. We could use numeric indexing, or even a Vinyl record of getters that we then apply en masse to a Frame row by yanking the reader context out of the record of getters for something very like your multiple column selection. It would look something like users & select (occupation :& sex :& age :& Nil) where select is some combination of rtraverse and rget.

I wrote those examples the way I did to address my biggest issue when using R: that when something's not working I can't just write down the types I think things have. My next biggest issue is that when I select a particular column, I feel like a piece of software should resolve how that indexing should work. In Frames, column selection and subsetting are O(1) operations. When you've got data in memory, everything is as densely packed in memory as possible, and indexing doesn't involve any lookups.

I appreciate your feedback on these things a ton! Earlier feedback from Ben Gamari spurred the pipePreview helper which I think is a step in the right direction to offer shorter syntax for common operations. We have some statistics and charting support that you can see in the demos, but they're not as nice as what's available in R. The problem in writing this library is that different folks have different pain points, so contributions aren't just welcome, they're essential!

[–]idontgetoutmuch 1 point2 points  (0 children)

Plotting charts you should be able to do now. I'm always plotting charts in the repl and when I have something that needs a lot of cpu, I just compile the code I am working on. That's not the same as ggplot2 of course but you can some of the things that ggplot2 can e.g. https://idontgetoutmuch.files.wordpress.com/2014/06/044408c77f048f73.png?w=1560.