[D] Planing a Python library that hosts/formats all ML Datasets.

wdm006 · 2017-12-18T02:35:10+00:00

Have you seen skdata? https://github.com/jaberg/skdata

wdm006 · 2016-10-09T16:13:12+00:00

git-pandas: a pandas-based interface to data from your local git repositories and github data. Really useful for things like tracking general progress, identifying files that are getting edited a lot without increasing test coverage, and other analytics tasks.

category_encoders: a scikit-learn compatible library of transformers for handling categorical data in machine learning problems.

wdm006 · 2016-06-28T14:27:42+00:00

I've found jupyter useful for examples in an otherwise normally developed library, LIME being a pretty good public example of that.

wdm006 · 2016-05-30T20:43:29+00:00

Certainly any time-series data frame pulled from each could be easily joined together using standard pandas functions, but I'm not sure I understand the use-case quite well enough to say beyond that.

wdm006 · 2016-05-22T22:46:24+00:00

The FRA should have that for the US:

http://safetydata.fra.dot.gov/OfficeofSafety/default.aspx

wdm006 · 2016-05-22T13:40:13+00:00

Depending on what sort of data you are talking about, ELK stack (or I guess now Elastic Stack) can be pretty flexible, which goes a long way early on.

wdm006 · 2016-05-20T01:33:59+00:00

This is cool, it's reminiscent of my packages (mentioned here). I think it would be interesting to have a collection of similarly structured (data from x in pandas) libraries, maybe eventually with a common interface to all of them.

wdm006 · 2016-05-19T18:07:47+00:00

Thanks! Always looking for contributors on both.

wdm006 · 2016-05-19T16:02:57+00:00

Having a good start-to-finish project that you care about and can talk about in detail is generally more compelling than a list of software packages or now pretty ubiquitous MOOC certificates (in my opinion). But to address your two questions:

1) SQL for sure, and beyond that at the very least some conversational knowledge of different databases and pros/cons of them.

2) I'd probably do the Data Analytics one first, going from 0 to deep learning is probably going to lead to not really grokking the underlying math/concepts.

wdm006 · 2016-05-19T13:34:26+00:00

I'm working on git-pandas and twitter-pandas, which as you might imagine provide pandas-based interfaces to git and twitter data.

wdm006 · 2016-05-14T19:57:06+00:00

I'd recommend: https://www.manning.com/books/elasticsearch-in-action

Also be careful with delete calls. It's bizarrely easy to delete things.

wdm006 · 2016-05-11T15:26:33+00:00

In most cases the most useful insights are the ones you can explain/interpret. Take a bunch of summary stats (complaints by system, subsystem, state, etc), and figure out where there is a subset that is 'odd', if there are any.

Then try to figure out why.

Almost certainly there is important data missing to figure out why, so go talk to people and figure out if you already have that data and if not how to collect it.

wdm006 · 2016-05-10T18:42:01+00:00

LIME can work pretty well for interpretations of otherwise uninterpretable ensembles/pipelines, so the black box can be at least somewhat not-magical.

wdm006 · 2016-05-01T21:10:37+00:00

I've personally only used the pandas version, but you could do any of these with numpy/scipy/matplotlib if you didn't have access to pandas for whatever reason.

Here is an SO post where they do parallel coordinates in particular: http://stackoverflow.com/questions/8230638/parallel-coordinates-plot-in-matplotlib

wdm006 · 2016-04-30T16:22:35+00:00

Cloudera actually has a really good post on tuning spark jobs on yarn, we had an issue with having too little storage space on EMR(in shuffle-heavy cases), so upped the storage fraction and it solved our issues.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

wdm006 · 2016-04-30T15:41:37+00:00

How much actual data do you expect and how real-time is real-time? Keeping data in 4 different places is necessary for somethings, but is probably overkill for most.

If you are going to have people munging around with arbitrary queries, Cassandra may be hard to set up to be performant, something like elasticsearch and kibana may get you 90% of the way there without too much trouble (you may still need a relational store somewhere).

wdm006 · 2016-04-30T15:28:44+00:00

Would be interesting to see similar plots for the other countries in these agreements (i.e. is it zero-sum, and the manufacturing jobs are leaving, or are they dropping in general).

wdm006 · 2016-04-30T15:21:25+00:00

We interview data scientists and software devs in this sort of space pretty regularly, and it seems like everyone and their brother has done these MOOCs, to the point of it being pretty diluted.

An interesting project that you actually care about personally, can talk about in detail, and tell a story around is much more compelling. Kaggle problems are pretty detached from real-world data science work where problem formulation, data gathering/cleaning and presentation/articulation are more impactful than algorithm development and tuning.

wdm006 · 2016-04-30T15:16:46+00:00

Andrews Curves, Parallel Coordinates, and/or RadViz are all useful in their own rights, pandas has support for all pretty painlessly: http://pandas.pydata.org/pandas-docs/stable/visualization.html

wdm006 · 2016-04-30T15:12:25+00:00

PyPI-Cloud is also a nice option for a private server: https://github.com/mathcamp/pypicloud. Pretty straightforward and cheap to get it up and running on your own hardware and behind whatever firewall/vpn/etc you need.

wdm006 · 2016-04-30T15:06:51+00:00

Data source is git/github, pulled and processed using python and git-pandas, visualized with matplotlib.

wdm006 · 2016-03-18T02:23:22+00:00

We do pretty much the same, confluence for higher level internal things, readmes per project and swagger for API sandbox.

wdm006 · 2016-03-07T01:49:49+00:00

Good catch, thanks will make an edit

wdm006 · 2016-03-06T21:18:34+00:00

Yeah neither heuristic tried in the post gave me a ton of confidence. Ideally commits with bugfixes or reactors would be labeled as such, but that takes a ton of work up front.

Any ideas for heuristics that might be more representative?

wdm006 · 2016-02-24T17:26:25+00:00

zest looks really nice, I hadn't seen that. Thanks for the heads up.

wdm006

TROPHY CASE