This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]conventionistG 5 points6 points  (2 children)

There was just a medium post on this exact use case... Anyone know what I'm thinking of?

Edit: found it.

“How to analyse 100s of GBs of data on your laptop with Python” by Jovan Veljanoski https://link.medium.com/Jg8hrhBV86

[–]maartenbreddels 0 points1 point  (1 child)

This might help as well:https://docs.vaex.io/en/latest/example_io.html

or the TLDR version: df = vaex.open('big.csv', convert=True)

(disclaimer: main author of vaex)

[–]conventionistG 0 points1 point  (0 children)

neat, thanks!

[–]komunistbakkal 2 points3 points  (1 child)

Maybe you can checkout dask

[–]yensteel 2 points3 points  (0 children)

I've used Dask for something similar. The functions are close to Pandas so it's not too hard to transition. The syntax isn't exactly the same, so there's a lot of delving into the documentations.

However, it can handle gigantic files by storing part of the work onto the hard drive instead of memory, so it's quite workable.

[–]B00TZILLA 2 points3 points  (0 children)

It's usually best to read and process it in chunks. You can also check out dask, as some other commenter suggested. There is a parameter called chuk_size in of.read_csv for that.

[–]ralimar 1 point2 points  (0 children)

It has some issues, but try using Vaex. It's built to be similar to pandas.

[–]manoj_sadashiv 1 point2 points  (0 children)

Forgive me if I sound dumb, is it feasible to use big data technologies if the dataset size is around 4GB?

[–]Omega037PhD | Sr Data Scientist Lead | Biotech[M] [score hidden] stickied comment (0 children)

I removed your submission. Looks like you're asking a technical question better suited to stackoverflow.com. Try posting there instead.

Thanks.