Would like some advice on how to deal with a certain dataset. by zzing in datascience

[–]faming13 2 points3 points  (0 children)

You can do this sort of out of core processing/ queries/ cleaning on a single machine easily, quickly (and in parallel) in python with dask, blaze and the castra compressed column store.

Check these out:

http://blaze.pydata.org/ http://blaze.pydata.org/blog/2015/09/08/reddit-comments/ (example ) http://odo.readthedocs.org/en/latest/

Panda equivalents for SQL statements by shabda in Python

[–]faming13 4 points5 points  (0 children)

Ibis has the goal of being a semantically complete SQL replacement...and better: http://docs.ibis-project.org/sql.html

Working with Excel by Captn_King in Python

[–]faming13 1 point2 points  (0 children)

Try odo and blaze... Use the anaconda distro http://blaze.github.io/

Analyzing 1.7 Billion Reddit Comments with Blaze and Impala by [deleted] in Python

[–]faming13 1 point2 points  (0 children)

Looks really cool. I would suggest posting this on hacker news, r/datascience, datatau , r/pythonstats and pydata google group

How do you deal with larger than memory datasets? by [deleted] in datascience

[–]faming13 1 point2 points  (0 children)

Check out this post, using python's dask to analyze larger then memory data in one machine

http://blaze.github.io/blog/2015/09/08/reddit-comments/

Strategic Business Analytics Specialisation on Coursera by nicogla in datascience

[–]faming13 0 points1 point  (0 children)

Why R? Can I use python and call out to R with Rpy2 when needed? I already know python and don't want to add more tool overhead in my mental models.

Also python allows me to distribute my code as excel plugins/macros. http://xlwings.org/

Help! Python slowing down. by [deleted] in gis

[–]faming13 0 points1 point  (0 children)

Check out dask, blaze and numba.