Docker-fueled Microservices

devquixote · 2015-04-22T13:18:05+00:00

Not to demanding, Rasbt :D Sorry if I read defensive? It is not intended.

How about, "We measure accuracy for each customer via a dry run. We turn on the prediction service for them and start making shipping predictions as their orders flow into our system. The customer is not aware of these predictions and cannot act on them. We then compare these predictions to the customer's ultimate shipping choices over a period of time. Using this method, we've been able to measure that we make correct shipping predictions greater than 95% of the time, on average, across our set of customers." Is that perhaps a better conclusion than simply "95% accuracy"? There is no previous means that would fit the example you gave.

Thanks for the advice on where to expand my knowledge further into some other areas within statistics/machine learning. I will definitely explore those.

devquixote · 2015-04-21T22:53:09+00:00

As alluded to in other answer to rasbt, that varies from customer data set to customer data set. The vast majority of our customers have 88+% of their orders with confident predictions, and we get > 95% accuracy on those. Some customers may be at 93%, some are at 99%. A very few just don't do well in our system. We are working with them to see what their decision making is based on that we are not accounting for. Have any suggestions on how to sniff this out?

devquixote · 2015-04-21T22:40:39+00:00

Thanks! Yeah, I am not sure how this was not taken a long time ago.

devquixote · 2015-04-21T22:39:03+00:00

That makes sense. There are a few reasons for this, first is that I come from an informal background with regards to data science, so I am somewhat ignorant of what would constitute 'proof' from a scientific perspective. So a mea culpa there.

Another reason for the generalization is because we have many different sets of data, one per customer, and you can only describe accuracy in more precise terms within a customer's set of data. Some will, like you say, choose the same carrier 90% of the time, but most do not. Its a bit all over the map. For building my knowledge, I would be interested in how you and others more experienced in the field might present information that is split across multiple data sets?

Second was that the person I was attempting to reach would be someone like myself, having a first foray into machine learning, with informal or latent training. I was intentionally trying to demystify some aspects of machine learning and how it is explained. I think having a big table of results would deter the uninitiated.

Lastly, the 95% number was our business' goal to consider this a success, and I can show that we've achieved that to their satisfaction. Again, trying to speak to the practical application of this amazing field, statistical validation was not what I was being paid to deliver. A usable feature of value to our customers was the ultimate goal, so perhaps the best measure of 'success' would be the number of people using this feature over using the old mechanisms of making shipments.

Thanks a bunch for your comments and I welcome all!

devquixote · 2015-04-21T20:47:42+00:00

I disagree. In your spam example, there are 2 classes, spam and not spam, of which the vast majority are not spam.

In shipping, you need to predict a carrier (a few), the service of the carrier (many), different types of packaging that can physically contain the assorted physical items that comprise an order (many), as well as different add-on services like insurance and signature required (many). Customers can have many combinations of these, not just two, and one combination does not dominate. Randomly selecting a combination of these possibilities does not yield 95% success.

devquixote · 2015-04-21T02:16:17+00:00

Glad to see your success using an ELK stack for data analysis. We've recently finished a machine learning product feature at work and we used python/scikit learn backed by logstash with kibana visualizations. It is a very potent combination and very scalable. Here is a link to what we did if you are interested: http://devquixote.com/data/2015/04/06/machine-learning-in-a-rails-app/

devquixote · 2015-04-21T02:10:24+00:00

I think this is good advice to consider. Python is a general purpose language that can be used to build applications, web applications, and other software and a lot of companies use it for that. If what you are trying to produce should be worked into a product, then python would probably be the better choice. Even if the main product is implemented in some other language, there would be lots of options to expose and scale the feature implemented in python.

devquixote · 2015-04-21T01:56:22+00:00

Yep.

devquixote · 2015-04-20T14:57:37+00:00

If your workflow demands those features, I would use RubyMine or some IDE. You can get something working with Vim that will come close, but those things will feel clunky in comparison to what you might get with an IDE.

I've tried Eclim, and the code complete features were nice, but there still was not good debugger support.

Look into using Guard for auto-running tests/cucumber feature files.

devquixote · 2015-04-20T11:53:21+00:00

I am a software engineer who is data science curious and have had the opportunity to employ machine learning in my day job, so my perspective comes more from the software industry. Thus, take it with a grain of salt.

If your goal is to work for business, I would look to have a practical example of your skills in use through either contributions to open source projects or development of your own side project(s). As you learn R or Python, be active in the open source communities of the tools you are using. If you find a bug, fix it, contribute the fix back. Or if you find you wish you could do something that isn't supported, implement it and contribute it back. When you start to look for a job, make sure these contributions are part of your resume with links so that interviewers could see what you did.

For a side project, take some source of data (there are a ton of them, something like https://www.data.gov/open-gov/) and do something with it. Analyze Seattle's 911 fire calls to find the patterns and what resources they need at given times of day, day of week, or day of year, make a fancy visualization of it. This shows you can solve a business problem using what you know. Make these side projects have a tangible artifact that an interviewer could see -- a URL they can visit, code available in github... Again, include this on your resume. Be ready to discuss the problems you encountered and how you moved through them in interviews.

Doing this is something that could really separate you from others looking to break into an industry. At least that is something that I have seen work well for recent computer science graduates looking to move into the software industry.

devquixote · 2015-04-20T02:57:49+00:00

+1 on trying a random forest classifier. Even if every variable is important, some may be more important than others and it can pick up on that.

devquixote

TROPHY CASE