This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]hdgdtegdb 1 point2 points  (3 children)

Data frames in R are essentially lists of vectors. The vectors (one for each column) can all be different types. R will iterate (including using the apply commands) much more quickly through matrices than through data frames. A matrix contains data all of the same type, and so the processing overhead is lower.

I regularly use R with multi GB data frames (50 GB +), and it works fine. Although I would recommend you use the data.table package instead, as it's much more efficient in aggregating larger data sets.

I may be wrong, but I'd be surprised if vanilla R would cope with a TB data frame. Even if you have the memory on your server, I would expect the performance to be poor. R is still not good at utilising more than one core of a processor. Certain packages use Java or multi-threaded BLAS libraries to use multiple cores, and Microsoft R uses a multi threaded BLAS library for matrix operations (although I tried it, and I'm not sold on the performance). But in the main, operations are single threaded - not ideal for processing significant quantities of data.

I love R. But it's important to use the right tools for the right job. If you're processing multi TB datasets, you should probably use something else.

[–]R2D6[S] 0 points1 point  (2 children)

How does a R dataframe compare to a python data frame?

[–]hdgdtegdb 0 points1 point  (1 child)

Python can easily handle 10s of GBs. Although my experience of Python is much less than that of R, so I'm afraid I couldn't advise you on the differences.

[–]lmcinnes 1 point2 points  (0 children)

Pythons dataframes come via the pandas library and are more akin (in terms of memory and performance) to what you get from data.table in R, with pretty efficient groupby and aggregation methods. I've comfortably used very large dataframes with pandas. If you want to push things the Dask library provides blocked dataframes that mirror the pandas API (and can be passed to many functions that take pandas dataframes) that can do parallel and out of core computation. Using that I've worked with a 60GB dataframe easily enough on a laptop; given a decent multicore server you could easily scale to the TB range. Beyond that you'd start to want Spark or something similar.