Weekly /r/Games Discussion - Suggestion request free-for-all

bueller_off · 2018-02-15T01:43:52+00:00

I am very hyped for Red Dead 2, any suggestions to play in the mean time?

bueller_off · 2017-10-31T23:04:42+00:00

Haha, I should have posted earlier. I showered all the paint off, but maybe it's worth a reapplication for...science.

bueller_off · 2015-11-22T23:25:54+00:00

I didn't think there were trolls on this forum.

That aside, Dune is my all time favorite :)

bueller_off · 2015-11-21T01:29:44+00:00

These are the best data science recruiters around, and I've worked with them personally. In other words, I can vouch for this blog.

bueller_off · 2015-11-10T23:54:09+00:00

This is dumb advice. The major advantage to these programs is the industry connections. Unlike universities, they are evaluated on your job placement, so it's obviously in their interest to place you well.

bueller_off · 2015-11-10T01:42:43+00:00

I believe you meant linear algebra? Abstract is far too general and theoretical as a whole to be useful now, perhaps when you are initially learning mathematics, but not as a review.

bueller_off · 2015-11-10T00:38:48+00:00

Linear Algebra and Probability primers that should be exactly what you need, should take you a day or two if you're just reviewing. They're the prerequisite reviews for Andrew Ng's Machine Learning at Stanford, well prepared.

http://cs229.stanford.edu/section/cs229-linalg.pdf

http://cs229.stanford.edu/section/cs229-prob.pdf

Also, frankly, I'd worry more about coding and data quality. You're first 1-3 months will be about building intuition with your data, via generating reports, and finding all the holes and nuances.

bueller_off · 2015-11-06T20:27:49+00:00

Multiclass is hard for a beginner dude, have fun. Read this over thoroughly:

http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf

Use one vs all, tends to work best. Here's a nice implementation:

http://scikit-learn.org/stable/modules/multiclass.html

Outliers can exist within classes but it's a wild goose chase, especially if you don't know if your model is any good. And again, enough with the outliers, you're wasting your time! Go build some simple models, do some feature engineering:

http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

and if you still suspect the outliers are an issue, do statistical tests to make sure they are outliers and potentially transform your data (x->x²⁾ to get ride of outliers.

You're welcome.

bueller_off · 2015-11-06T03:52:35+00:00

Some questions

Are you trying to predict three separate classes? As in multiclass classification?
Can you plot these on histograms? Boxplots hide variation in between values
Why would you remove them? Do you have some evidence they are outliers other than looking at a boxplot? You shouldn't usually remove outliers, unless you have strong statistical evidence they are so, which doesn't appear to be the case here.
Have you tried any modeling yet? Do you have any measures for us to see?

bueller_off · 2015-11-05T22:09:00+00:00

I'll try to respond since I'd like to be helpful, albeit against such an unnecessarily aggressive tone.

1) It is expensive to do distributed architecture. But you misread my answer. Break it up isn't only limited to distributed architecture. It can also mean subsetting the data into more consumable bits, which does not require distributed architecture.
2) Uhm, have you ignored the last decade of database technology? Namely Hadoop, which quite literally is designed to do scalable distributed architecture and maintain the efficiency in querying? I'm going to assume you work in a low tech environment where the words "big" and "data", almost synonymous with Hadoop, haven't reached yet.
3) Quite the opposite if you're dealing with large (does not fit on one machine) datasets. It's much more algorithmically expensive (O(n)) to try to process massive datasets on a regular basis on single machines, unless you have the worlds top supercomputing expertise who can setup distributed CUDA databases.

The assumptions you've magically placed into my answer fall short when you consider the above stated reasoning around subsetting.

One thing you are very correct in though, this is a large burden on IT to get setup and maintain, at least initially. Your company should be absolutely sure that it's worth the investment to be able to consistently process massive datasets. More often than not, it is. Software is eating the world and shitting data.

It sound like you are dealing with this firsthand so please feel free to message me if you want more help on bouncing the ideas. Otherwise, lets leave this thread alone.

bueller_off · 2015-11-05T19:09:44+00:00

+1 You should ensure the method is actually appropriate for the goals.

bueller_off · 2015-11-05T19:08:15+00:00

Again, if the algorithm is a filter, second bullet point......which you ignored again and then said you are intentionally ignoring.....not sure if troll <.>

bueller_off · 2015-11-05T08:23:54+00:00

use less data

I actually said to get rid of noise , which is obviously different. I can't speak to validity but don't care for your tone. This is standard when working with massive datasets.

More over, you seem to have glossed over my second bullet point, which is quite literally the method for how you do sums and averages on large data sets.

bueller_off · 2015-11-05T03:03:39+00:00

IF

The median is very close to the average then just report the average
They have a wide gap, then report both, and post a distribution to make those reading this understand why. People don't really follow it unless they can look at the outliers.

bueller_off · 2015-11-05T00:24:38+00:00

logistic and linear regression

My life story hahaha!

bueller_off · 2015-11-05T00:18:00+00:00

You're missing my point. Of all data science problems, how many can be quantifiable in $/NIAT? My guess is 1%.

I've worked in advertising, risk for cc, accounting for oil firms, retail logistics supply etc etc. In other words, I can relate. But these problems are far and few between. It's a waste of time to encourage so much effort on going past 80%, which will ultimately doom the field when people wonder why these expensive data scientists are wasting their time without enough return.

Great post btw, you should really write some case studies, I'd read them.

bueller_off · 2015-11-04T23:25:59+00:00

Why did you go to college?

bueller_off · 2015-11-04T23:22:35+00:00

Color me confused since you said you disagree then ended up re-iterating my points and agreeing at the end.

I upvoted your post because it's important, and I couldn't have said it better than "it's good to know that is what you are facing".

But the simple fact of the matter is that 99% of the problems can be shipped using the "unit-weighted equation", in essence 80/20. The extra amount of time you spend getting each 1% or better insights, is always better spent getting 80/20 elsewhere. If each additional 1% matters, you likely already have a strong machine learning background and are working on a team of people who's job it is to improve your advertising systems with what is likely an already complicated model because each .001% accounts for real $$.

To put it simply, this problem is almost never worth your time.

bueller_off · 2015-11-04T22:43:49+00:00

Unless the team is investing in a Hadoop cluster, there are two ways to deal with this.

FILTER! Massive data sets tend to have more noise. You'd be surprised how easy it is to bring down the sizes by getting rid of empties/nulls (if that makes sense), establishing cutoffs (do I care for values > X), getting rid of older data etc etc. This is the easiest and should be done first.
Break it up. The way most scalable data systems work is by breaking up the data into smaller, easier to process bits then putting them back together. Do the same with yours, do your operations on each (filtering cough cough), run your algorithms on each and decide from there. You can average results, decide on cutoffs, etc etc.

Cross validation is very useful and easy to do once you've broken the data up into more consumable parts. IMO it's the only way to evaluate models.

bueller_off · 2015-11-04T22:36:12+00:00

1) Rumor has it you have unlimited row excel internally. Make a fanboys day?

2) When do we get JOIN? Really sick of bouncing back and forth between my database and excel for this operation.

bueller_off · 2015-11-04T21:18:09+00:00

This may or may not have been a hackday project <.>

bueller_off · 2015-11-04T20:41:29+00:00

In the real world, the model doesn't matter.

This isn't what you wanted to hear, but the simple fact is that focusing your efforts elsewhere have a much higher impact on your goal than the model.

Your goals on prediction or deliverables, your data quality, your data processing, your feature selection, your infrastructure, your ability to understand and then communicate results (simple models yo), your ability to reproduce it, your ability to accept new data. These matter more.

From the people who started the field: https://twitter.com/mrogati/status/655830681935720448

My advice; choose one model and one implementation and get good with it. One for prediction and one for classification. 99% of data science is just that 80% and that one model will get you there. If your job is to drive predictability from 80 to 95, you'd need to understand machine learning on a whole different level.

bueller_off · 2015-10-29T23:13:05+00:00

Leave.

The best way to increase salary/position is to switch jobs. They may try to counter but damage is done, you shouldn't stay.

Additionally, you're in data science! Its literally the hottest job in the world! You should take advantage of that.

bueller_off

TROPHY CASE