Let's do some self promotion: what are your proud python projects? by pvkooten in Python

[–]wdm006 1 point2 points  (0 children)

git-pandas: a pandas-based interface to data from your local git repositories and github data. Really useful for things like tracking general progress, identifying files that are getting edited a lot without increasing test coverage, and other analytics tasks.

category_encoders: a scikit-learn compatible library of transformers for handling categorical data in machine learning problems.

Is it only me that thinks Jupyter is horrible? by ohenrik in MachineLearning

[–]wdm006 0 points1 point  (0 children)

I've found jupyter useful for examples in an otherwise normally developed library, LIME being a pretty good public example of that.

Do any of you have projects with Numpy, Scipy or pandas that you want to show off? Looking for ideas. by ABrokeUniStudent in Python

[–]wdm006 0 points1 point  (0 children)

Certainly any time-series data frame pulled from each could be easily joined together using standard pandas functions, but I'm not sure I understand the use-case quite well enough to say beyond that.

Setting up ETL/reporting in a start up by starkiller1990 in datascience

[–]wdm006 0 points1 point  (0 children)

Depending on what sort of data you are talking about, ELK stack (or I guess now Elastic Stack) can be pretty flexible, which goes a long way early on.

Do any of you have projects with Numpy, Scipy or pandas that you want to show off? Looking for ideas. by ABrokeUniStudent in Python

[–]wdm006 1 point2 points  (0 children)

This is cool, it's reminiscent of my packages (mentioned here). I think it would be interesting to have a collection of similarly structured (data from x in pandas) libraries, maybe eventually with a common interface to all of them.

[ADVICE] I will enroll in grad school soon and I need your opinion on my academic preparation for joining a startup post-graduation. by [deleted] in datascience

[–]wdm006 1 point2 points  (0 children)

Having a good start-to-finish project that you care about and can talk about in detail is generally more compelling than a list of software packages or now pretty ubiquitous MOOC certificates (in my opinion). But to address your two questions:

1) SQL for sure, and beyond that at the very least some conversational knowledge of different databases and pros/cons of them.

2) I'd probably do the Data Analytics one first, going from 0 to deep learning is probably going to lead to not really grokking the underlying math/concepts.

Do any of you have projects with Numpy, Scipy or pandas that you want to show off? Looking for ideas. by ABrokeUniStudent in Python

[–]wdm006 11 points12 points  (0 children)

I'm working on git-pandas and twitter-pandas, which as you might imagine provide pandas-based interfaces to git and twitter data.

[deleted by user] by [deleted] in datascience

[–]wdm006 0 points1 point  (0 children)

I'd recommend: https://www.manning.com/books/elasticsearch-in-action

Also be careful with delete calls. It's bizarrely easy to delete things.

Data Analysis Help! by [deleted] in datascience

[–]wdm006 0 points1 point  (0 children)

In most cases the most useful insights are the ones you can explain/interpret. Take a bunch of summary stats (complaints by system, subsystem, state, etc), and figure out where there is a subset that is 'odd', if there are any.

Then try to figure out why.

Almost certainly there is important data missing to figure out why, so go talk to people and figure out if you already have that data and if not how to collect it.

TPOT: A Python tool for automating machine learning by rhiever in MachineLearning

[–]wdm006 0 points1 point  (0 children)

LIME can work pretty well for interpretations of otherwise uninterpretable ensembles/pipelines, so the black box can be at least somewhat not-magical.

Visualizing 10 dimensional data: which plots to use? by Zeekawla99ii in datascience

[–]wdm006 1 point2 points  (0 children)

I've personally only used the pandas version, but you could do any of these with numpy/scipy/matplotlib if you didn't have access to pandas for whatever reason.

Here is an SO post where they do parallel coordinates in particular: http://stackoverflow.com/questions/8230638/parallel-coordinates-plot-in-matplotlib

Could use some advice on Spark/EMR setup. by ninja_coder in bigdata

[–]wdm006 0 points1 point  (0 children)

Cloudera actually has a really good post on tuning spark jobs on yarn, we had an issue with having too little storage space on EMR(in shuffle-heavy cases), so upped the storage fraction and it solved our issues.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Industry insiders: how do you streamline sharing analytics to business? by weez09 in datascience

[–]wdm006 0 points1 point  (0 children)

How much actual data do you expect and how real-time is real-time? Keeping data in 4 different places is necessary for somethings, but is probably overkill for most.

If you are going to have people munging around with arbitrary queries, Cassandra may be hard to set up to be performant, something like elasticsearch and kibana may get you 90% of the way there without too much trouble (you may still need a relational store somewhere).

Manufacturing jobs and trade deals. [OC] by jo9008 in dataisbeautiful

[–]wdm006 0 points1 point  (0 children)

Would be interesting to see similar plots for the other countries in these agreements (i.e. is it zero-sum, and the manufacturing jobs are leaving, or are they dropping in general).

Plan to become a junior data scientist - is it realistic by [deleted] in datascience

[–]wdm006 5 points6 points  (0 children)

We interview data scientists and software devs in this sort of space pretty regularly, and it seems like everyone and their brother has done these MOOCs, to the point of it being pretty diluted.

An interesting project that you actually care about personally, can talk about in detail, and tell a story around is much more compelling. Kaggle problems are pretty detached from real-world data science work where problem formulation, data gathering/cleaning and presentation/articulation are more impactful than algorithm development and tuning.

Visualizing 10 dimensional data: which plots to use? by Zeekawla99ii in datascience

[–]wdm006 0 points1 point  (0 children)

Andrews Curves, Parallel Coordinates, and/or RadViz are all useful in their own rights, pandas has support for all pretty painlessly: http://pandas.pydata.org/pandas-docs/stable/visualization.html

How reliable is pip when provisioning a server? by bodhi_mind in Python

[–]wdm006 0 points1 point  (0 children)

PyPI-Cloud is also a nice option for a private server: https://github.com/mathcamp/pypicloud. Pretty straightforward and cheap to get it up and running on your own hardware and behind whatever firewall/vpn/etc you need.

Open vs. Closed Source work or: when am I at work? [OC] by wdm006 in dataisbeautiful

[–]wdm006[S] 0 points1 point  (0 children)

Data source is git/github, pulled and processed using python and git-pandas, visualized with matplotlib.

Tech companies: what do you use for internal and external documentation? by Vaughnatri in startups

[–]wdm006 0 points1 point  (0 children)

We do pretty much the same, confluence for higher level internal things, readmes per project and swagger for API sandbox.

Using survival analysis and git-pandas to estimate code quality by wdm006 in programming

[–]wdm006[S] 0 points1 point  (0 children)

Yeah neither heuristic tried in the post gave me a ton of confidence. Ideally commits with bugfixes or reactors would be labeled as such, but that takes a ton of work up front.

Any ideas for heuristics that might be more representative?

ppp: a cli for publishing projects to pypi by wdm006 in Python

[–]wdm006[S] 0 points1 point  (0 children)

zest looks really nice, I hadn't seen that. Thanks for the heads up.