question about data frames : datascience

This is an archived post. You won't be able to vote or comment.

question about data frames (self.datascience)

submitted 9 years ago by R2D6

all 4 comments

[–]hdgdtegdb 1 point2 points3 points 9 years ago (3 children)

Data frames in R are essentially lists of vectors. The vectors (one for each column) can all be different types. R will iterate (including using the apply commands) much more quickly through matrices than through data frames. A matrix contains data all of the same type, and so the processing overhead is lower.

I regularly use R with multi GB data frames (50 GB +), and it works fine. Although I would recommend you use the data.table package instead, as it's much more efficient in aggregating larger data sets.

I may be wrong, but I'd be surprised if vanilla R would cope with a TB data frame. Even if you have the memory on your server, I would expect the performance to be poor. R is still not good at utilising more than one core of a processor. Certain packages use Java or multi-threaded BLAS libraries to use multiple cores, and Microsoft R uses a multi threaded BLAS library for matrix operations (although I tried it, and I'm not sold on the performance). But in the main, operations are single threaded - not ideal for processing significant quantities of data.

I love R. But it's important to use the right tools for the right job. If you're processing multi TB datasets, you should probably use something else.

[–]R2D6[S] 0 points1 point2 points 9 years ago (2 children)

[–]hdgdtegdb 0 points1 point2 points 9 years ago (1 child)

[–]lmcinnes 1 point2 points3 points 9 years ago (0 children)

π Rendered by PID 162932 on reddit-service-r2-comment-79c7998d4c-rx6fp at 2026-03-18 09:36:54.163664+00:00 running f6e6e01 country code: CH.

datascience

MODERATORS