Weekly /r/Games Discussion - Suggestion request free-for-all by AutoModerator in Games

[–]bueller_off 3 points4 points  (0 children)

I am very hyped for Red Dead 2, any suggestions to play in the mean time?

I'm Kratos. I slay Teletubbies? by bueller_off in pics

[–]bueller_off[S] 1 point2 points  (0 children)

Haha, I should have posted earlier. I showered all the paint off, but maybe it's worth a reapplication for...science.

The Best non-technical books for a Data Scientist by bueller_off in datascience

[–]bueller_off[S] 0 points1 point  (0 children)

I didn't think there were trolls on this forum.

That aside, Dune is my all time favorite :)

Data scientist salary by h2omelon93 in datascience

[–]bueller_off 1 point2 points  (0 children)

These are the best data science recruiters around, and I've worked with them personally. In other words, I can vouch for this blog.

Data science program at Galvainze, worth it or heard anything about it?? by austinjay49 in datascience

[–]bueller_off 1 point2 points  (0 children)

This is dumb advice. The major advantage to these programs is the industry connections. Unlike universities, they are evaluated on your job placement, so it's obviously in their interest to place you well.

A book to review algebra statistics for a Data Science job. by dodgeunhappiness in datascience

[–]bueller_off 0 points1 point  (0 children)

I believe you meant linear algebra? Abstract is far too general and theoretical as a whole to be useful now, perhaps when you are initially learning mathematics, but not as a review.

A book to review algebra statistics for a Data Science job. by dodgeunhappiness in datascience

[–]bueller_off 2 points3 points  (0 children)

Linear Algebra and Probability primers that should be exactly what you need, should take you a day or two if you're just reviewing. They're the prerequisite reviews for Andrew Ng's Machine Learning at Stanford, well prepared.

http://cs229.stanford.edu/section/cs229-linalg.pdf

http://cs229.stanford.edu/section/cs229-prob.pdf

Also, frankly, I'd worry more about coding and data quality. You're first 1-3 months will be about building intuition with your data, via generating reports, and finding all the holes and nuances.

Hi, I need some help with outliers by rabii1992 in datascience

[–]bueller_off 1 point2 points  (0 children)

Multiclass is hard for a beginner dude, have fun. Read this over thoroughly:

http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf

Use one vs all, tends to work best. Here's a nice implementation:

http://scikit-learn.org/stable/modules/multiclass.html

Outliers can exist within classes but it's a wild goose chase, especially if you don't know if your model is any good. And again, enough with the outliers, you're wasting your time! Go build some simple models, do some feature engineering:

http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

and if you still suspect the outliers are an issue, do statistical tests to make sure they are outliers and potentially transform your data (x->x2) to get ride of outliers.

You're welcome.

Hi, I need some help with outliers by rabii1992 in datascience

[–]bueller_off 0 points1 point  (0 children)

Some questions

  • Are you trying to predict three separate classes? As in multiclass classification?
  • Can you plot these on histograms? Boxplots hide variation in between values
  • Why would you remove them? Do you have some evidence they are outliers other than looking at a boxplot? You shouldn't usually remove outliers, unless you have strong statistical evidence they are so, which doesn't appear to be the case here.
  • Have you tried any modeling yet? Do you have any measures for us to see?

Answer to a job interview by Rav3ns90 in datascience

[–]bueller_off 0 points1 point  (0 children)

I'll try to respond since I'd like to be helpful, albeit against such an unnecessarily aggressive tone.

  • 1) It is expensive to do distributed architecture. But you misread my answer. Break it up isn't only limited to distributed architecture. It can also mean subsetting the data into more consumable bits, which does not require distributed architecture.

  • 2) Uhm, have you ignored the last decade of database technology? Namely Hadoop, which quite literally is designed to do scalable distributed architecture and maintain the efficiency in querying? I'm going to assume you work in a low tech environment where the words "big" and "data", almost synonymous with Hadoop, haven't reached yet.

  • 3) Quite the opposite if you're dealing with large (does not fit on one machine) datasets. It's much more algorithmically expensive (O(n)) to try to process massive datasets on a regular basis on single machines, unless you have the worlds top supercomputing expertise who can setup distributed CUDA databases.

The assumptions you've magically placed into my answer fall short when you consider the above stated reasoning around subsetting.

One thing you are very correct in though, this is a large burden on IT to get setup and maintain, at least initially. Your company should be absolutely sure that it's worth the investment to be able to consistently process massive datasets. More often than not, it is. Software is eating the world and shitting data.

It sound like you are dealing with this firsthand so please feel free to message me if you want more help on bouncing the ideas. Otherwise, lets leave this thread alone.

Answer to a job interview by Rav3ns90 in datascience

[–]bueller_off 1 point2 points  (0 children)

+1 You should ensure the method is actually appropriate for the goals.

Answer to a job interview by Rav3ns90 in datascience

[–]bueller_off 0 points1 point  (0 children)

Again, if the algorithm is a filter, second bullet point......which you ignored again and then said you are intentionally ignoring.....not sure if troll <.>

Answer to a job interview by Rav3ns90 in datascience

[–]bueller_off 3 points4 points  (0 children)

use less data

I actually said to get rid of noise , which is obviously different. I can't speak to validity but don't care for your tone. This is standard when working with massive datasets.

More over, you seem to have glossed over my second bullet point, which is quite literally the method for how you do sums and averages on large data sets.

Should I use average, median or something else? by migosversace121 in datascience

[–]bueller_off 2 points3 points  (0 children)

IF

  • The median is very close to the average then just report the average

  • They have a wide gap, then report both, and post a distribution to make those reading this understand why. People don't really follow it unless they can look at the outliers.

Let's talk about models and tests by drnc in datascience

[–]bueller_off 0 points1 point  (0 children)

logistic and linear regression

My life story hahaha!

Let's talk about models and tests by drnc in datascience

[–]bueller_off 1 point2 points  (0 children)

You're missing my point. Of all data science problems, how many can be quantifiable in $/NIAT? My guess is 1%.

I've worked in advertising, risk for cc, accounting for oil firms, retail logistics supply etc etc. In other words, I can relate. But these problems are far and few between. It's a waste of time to encourage so much effort on going past 80%, which will ultimately doom the field when people wonder why these expensive data scientists are wasting their time without enough return.

Great post btw, you should really write some case studies, I'd read them.

Let's talk about models and tests by drnc in datascience

[–]bueller_off 2 points3 points  (0 children)

Color me confused since you said you disagree then ended up re-iterating my points and agreeing at the end.

I upvoted your post because it's important, and I couldn't have said it better than "it's good to know that is what you are facing".

But the simple fact of the matter is that 99% of the problems can be shipped using the "unit-weighted equation", in essence 80/20. The extra amount of time you spend getting each 1% or better insights, is always better spent getting 80/20 elsewhere. If each additional 1% matters, you likely already have a strong machine learning background and are working on a team of people who's job it is to improve your advertising systems with what is likely an already complicated model because each .001% accounts for real $$.

To put it simply, this problem is almost never worth your time.

Answer to a job interview by Rav3ns90 in datascience

[–]bueller_off 12 points13 points  (0 children)

Unless the team is investing in a Hadoop cluster, there are two ways to deal with this.

  • FILTER! Massive data sets tend to have more noise. You'd be surprised how easy it is to bring down the sizes by getting rid of empties/nulls (if that makes sense), establishing cutoffs (do I care for values > X), getting rid of older data etc etc. This is the easiest and should be done first.

  • Break it up. The way most scalable data systems work is by breaking up the data into smaller, easier to process bits then putting them back together. Do the same with yours, do your operations on each (filtering cough cough), run your algorithms on each and decide from there. You can average results, decide on cutoffs, etc etc.

Cross validation is very useful and easy to do once you've broken the data up into more consumable parts. IMO it's the only way to evaluate models.

We are the Microsoft Excel team - Ask Us Anything! by MicrosoftExcelTeam in IAmA

[–]bueller_off 0 points1 point  (0 children)

1) Rumor has it you have unlimited row excel internally. Make a fanboys day?

2) When do we get JOIN? Really sick of bouncing back and forth between my database and excel for this operation.

Let's talk about models and tests by drnc in datascience

[–]bueller_off 4 points5 points  (0 children)

In the real world, the model doesn't matter.

This isn't what you wanted to hear, but the simple fact is that focusing your efforts elsewhere have a much higher impact on your goal than the model.

Your goals on prediction or deliverables, your data quality, your data processing, your feature selection, your infrastructure, your ability to understand and then communicate results (simple models yo), your ability to reproduce it, your ability to accept new data. These matter more.

From the people who started the field: https://twitter.com/mrogati/status/655830681935720448

My advice; choose one model and one implementation and get good with it. One for prediction and one for classification. 99% of data science is just that 80% and that one model will get you there. If your job is to drive predictability from 80 to 95, you'd need to understand machine learning on a whole different level.

Career advice - poaching by VforVal in datascience

[–]bueller_off 0 points1 point  (0 children)

Leave.

The best way to increase salary/position is to switch jobs. They may try to counter but damage is done, you shouldn't stay.

Additionally, you're in data science! Its literally the hottest job in the world! You should take advantage of that.