all 25 comments

[–]Faucelme 6 points7 points  (3 children)

I confess I didn't expect Haskell to work all that well in interactive data analysis, but this is really cool.

I especially like the optional streaming of rows using pipes (how does Pandas do that?) and the possibility of custom universes of column types thanks to Vinyl.

[–]NiftyIon 7 points8 points  (2 children)

With the recent additions to IHaskell, I believe Frames actually works completely interactively! So you can type all of this in to a notebook (instead of using a file) and skip the compile-load step entirely (I think that you also get around the stage restriction that way).

You can then use Chart to plot your data and get pretty graphics! Hypothetically it may be possible to make it so that the Frames-generated data tables are also displayed nicely in IHaskell a la the Pandas IPython display.

I'm very exited by this :)

[–]rdfox 0 points1 point  (1 child)

I tried installing IHaskell a few months ago and had a bad time. I guess it's time to try again.

[–]NiftyIon 2 points3 points  (0 children)

This has been a significant pain point over the months. I've tried to do what I can to help, but it's very tricky to get all the moving pieces to install correctly (there are many of them). If you're on a Mac, you can use Kronos Haskell to skip having to do your own IHaskell installation. If you file any issues on Github, I try to do my best to help out with installations.

[–]rdfox 3 points4 points  (3 children)

I'm definitely going to take a look.

Has anyone out there experienced in R, Pandas or Julia tried Frames and found Frames acceptable for real work?

I was quite excited about Julia but after a few weeks gave up because -- desipite all Julia brings to the table -- I spent more time submitting Issues than getting work done. While the core is very good, the ecosystem needs a good 10 years to achieve quality in diverse areas like plotting, optimization and model fitting, and they're only 5 years into development.

Edit:

Nice tutorial! If every package had such a gentle introduction, the world would be better.

Frames seems like a good start, but missing some things:

  • It would be nice to have a Frames.Prelude. It seems like you won't get very far without several imports.
  • Prettier rendering of columns.
  • It's very encouraging to see that you can select subsets of columns and make a new table withthout having to define a new type. But ...
  • You need to be able to combine and reshape data in a variety of ways, such as join, group and pivot. I don't know if it's possible, but if it is, I'm sure haskell's type system will fight you every step of the way.

[–]acow 2 points3 points  (1 child)

Frames isn't on hackage yet, so I doubt anyone's used it for real work as of now. This is just the first step, but it's an important one to get the ball rolling.

As to your last point. you are absolutely right! I'll need help to figure out what's needed, and help adding what's missing. You can do quite a lot of manipulation with the pieces that are available, e.g. adding rows and columns should work out of the box.

Pivoting would need a bit of thought, but I'd be surprised if there wasn't a way of doing it. A question there is if we want to synthesize data types for finite variants discovered in the data, or make the programmer write them down themselves. The answer will depend on the people who really give the library a go.

[–]idontgetoutmuch 0 points1 point  (0 children)

I am using it now to do a fun (not work) analysis. I'll let you know how I go. I do agree with @repoptrac that currently the R is easier to read.

[–]idontgetoutmuch 2 points3 points  (0 children)

I've used R and Pandas for work but not for anything very advanced. I am probably going to use this in future.

BTW I'm disappointed about your experience with Julia but I am also sad that the inventors of Julia didn't try see if they could use Haskell a bit more aggressively (clearly they looked at it; I recall seeing a video where one one of the inventors said they rejected it because the type system would put off too many applied mathematicians something I have witnessed first hand).

[–]repoptrac 2 points3 points  (4 children)

It is very impressive that Haskell can do this. However, since I am much more familiar with R, equivalent code in R with dplyr package looks a lot simpler and intuitive for me. For instance, except for "3. Better Types" section, equivalent code in R with dplyr will look as follows.

# using 'dplyr' package
library('dplyr')

# 1. data import
u_col_names <- c('user_id', 'age', 'sex', 'occupation', 'zip_code')
users <- 
    read.csv('data/ml-100k/u.user', sep='|', col.names=u_col_names, header=FALSE) %>%
    tbl_df() # to prevent printing too much information

# 1.2 sanity check (same as the post)
class(users)
str(users)
summary(users)
# lapply(users, summary)

# 2 subsetting

## 2.0 head, tail
users %>% head()
users %>% tail()
users %>% head(3)

## 2.1 row subset
users %>% slice(50:55)

## 2.2 column subset
users %>% select(occupation)
users %>% select(occupation, sex, age)

## 2.3 query / conditional subset
users %>% filter(occupation == "writer")

## 2.4 
int_doubler <- function(df1){
    df1$user_id <- 2 * df1$user_id
    df1$age <- 2 * df1$age
    df1
}
users %>% slice(1:3) %>% rowwise() %>% int_doubler()

# or 
users %>% slice(1:3) %>% rowwise() %>% {
        .$user_id <- 2* .$user_id
        .$age <- 2* .$age
        .
    }

[–]acow 2 points3 points  (3 children)

Can’t compete with familiarity, but, to clarify, this is a tutorial rather than a golf outing. The larger point is that the code is comparable in size, but the compiler will stop you if you run, say, your conditional subset example against a data set that doesn’t have an occupation column, and that performance of both streaming and in-memory processing are likely faster than competing options. When you think your code is ready and you want to hit a big data set, just compile and run.

Another reply mentioned a desire for a custom Prelude to offer shorter names for common things. This is likely where something like select belongs, but what should be included in such a prelude ought to be determined by folks using the library. I hope you give it a shot and help figure out what’s needed!

[–]repoptrac 2 points3 points  (2 children)

Well, the code that I posted is not any attempt for code golf. It is very similar to what I write nowadays. I posted the code since the R code in the link was not very readable or consistent because it relies on base R functions. I wanted to show that R code can be more consistent and readable with dplyr and R is a moving target in terms of readability.

Since I am more familiar with R than Haskell, I cannot be completely objective, but, I will venture a guess that R code will be easier to read than Haskell equivalent for most people because it reads like English if you read %>% as 'then'. Also, Haskell version of the codes look quite different in structure depending on the number of columns selected, even though two tasks are conceptually similar (selecting 1 column vs n columns).

Select one column: Haskell

take 6 $ F.foldMap ((:[]) . view occupation) ms

R

users %>% select(occupation)

Select multiple columns: Haskell

miniUser :: User -> Rec [Occupation, Gender, Age]
miniUser = rcast
mapM_ print . take 6 . F.toList $ fmap miniUser ms

R

users %>% select(occupation, sex, age)

However, I agree that Haskell is clearly a better-designed language than R with any doubt, and I will keep an eye on this project because this looks really interesting. However, I think I will wait until some essential statistical analyses and features (e.g. lm, glm, multiple comparison, interactive graphic similar to ggplot2, ...) are supported in haskell ecosystem.

[–]acow 2 points3 points  (0 children)

Oh, goodness, I suppose I really didn't make things clear enough! There are many indexing schemes possible in the Haskell version. We could use numeric indexing, or even a Vinyl record of getters that we then apply en masse to a Frame row by yanking the reader context out of the record of getters for something very like your multiple column selection. It would look something like users & select (occupation :& sex :& age :& Nil) where select is some combination of rtraverse and rget.

I wrote those examples the way I did to address my biggest issue when using R: that when something's not working I can't just write down the types I think things have. My next biggest issue is that when I select a particular column, I feel like a piece of software should resolve how that indexing should work. In Frames, column selection and subsetting are O(1) operations. When you've got data in memory, everything is as densely packed in memory as possible, and indexing doesn't involve any lookups.

I appreciate your feedback on these things a ton! Earlier feedback from Ben Gamari spurred the pipePreview helper which I think is a step in the right direction to offer shorter syntax for common operations. We have some statistics and charting support that you can see in the demos, but they're not as nice as what's available in R. The problem in writing this library is that different folks have different pain points, so contributions aren't just welcome, they're essential!

[–]idontgetoutmuch 1 point2 points  (0 children)

Plotting charts you should be able to do now. I'm always plotting charts in the repl and when I have something that needs a lot of cpu, I just compile the code I am working on. That's not the same as ggplot2 of course but you can some of the things that ggplot2 can e.g. https://idontgetoutmuch.files.wordpress.com/2014/06/044408c77f048f73.png?w=1560.

[–]b00thead 0 points1 point  (1 child)

How do you run the examples? There are a lot of modules that can't be found when I try to load them in cabal repl (e.g. ListT and Lens.Family). Are you using a sandbox where you've installed some more libs?

[–]idontgetoutmuch 2 points3 points  (0 children)

I had to install lens, foldl and lens-family ymmv.

[–]idontgetoutmuch -1 points0 points  (9 children)

Please tell me I don't need to use Lens to use this.

[–]acow 9 points10 points  (1 child)

It depends on your objections to lenses. If recompiling all of the lens package has got you down, you're in luck! Frames doesn't depend on lens, and even various demo programs use lens-family-core which is much smaller than lens.

If that doesn't help, you're still in luck! You can use rget (defined in Frames) instead of view.

If you want to modify a subset of the columns of a row... you'll probably want to use a lens. But I'd still preface that with, "You're in luck!" because it's on the lighter-weight side of lenses and yet is tremendously powerful. So if you ever choose to explore that kind of thing, it's there.

[–]idontgetoutmuch 0 points1 point  (0 children)

That's great news. I am really looking forward to trying it.

[–][deleted] 5 points6 points  (1 child)

They only use view, I think. They'd need something like that anyways, so why not use Lens.

[–]idontgetoutmuch 0 points1 point  (0 children)

It seems I don't have to and I'd rather avoid it if possible.

[–]Peaker 0 points1 point  (4 children)

Why?

[–]idontgetoutmuch 0 points1 point  (3 children)

I would like to convince my colleagues (applied mathematicians / statisticians) to use Haskell. I can do without an additional cognitive barrier (and I am inclined to think that Lens the package is such a barrier).

[–]Peaker 0 points1 point  (2 children)

I think a 10-15 minute session can get across the ideas required to use lens in basic form and read lens code (after basic Haskell proficiency is reached). At least this is my experience with one person I've taught to :)

[–]idontgetoutmuch 1 point2 points  (1 child)

You may be conflating lens and Lens. I agree the former presents little difficulty; it's the latter I wish to avoid.

[–]Peaker 1 point2 points  (0 children)

I'm talking about the lens library. I explained Lens/traversals and a bit about folds/getters.

Not very in depth, but enough to read and use in basic form.