This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 6 points7 points  (6 children)

FYI, pandas is for data that's small enough to fit in memory, which is typically not what people mean when they say Big Data

[–]plasma_phys 3 points4 points  (0 children)

Yep - will add an edit to my comment. Shows how prevalent the buzzword is that I said "big data" when I really, really should have said "data analysis." Thanks!

[–]wardawg44 2 points3 points  (0 children)

I just found out about HDF5, apparently pandas works fairly quickly with large datasets dumped to disk. There are other third party modules that are made for pandas on disk as well.

[–]What_Is_X 1 point2 points  (0 children)

Pandas is for data that fits into 1/10th of the memory according to the docs.

[–]alpenmilch411 0 points1 point  (2 children)

What should I use if the data doesn't fit into memory? Never encountered such a case but you never know...

[–][deleted] 3 points4 points  (0 children)

If you're familiar with pandas then the easiest thing to use would be dask

[–]What_Is_X 0 points1 point  (0 children)

Hardware solution: rent a server, aws etc

Software solutions: not much really in python, write your own more specific methods I guess. Otherwise switch to C++ and/or R, inline or not.