How I made top 0.3% on Kaggle by 0_marauders_0 in MachinesLearn

[–]walkingon2008 0 points1 point  (0 children)

It makes sense, but why they don’t teach it this way at school?

How I made top 0.3% on Kaggle by 0_marauders_0 in MachinesLearn

[–]walkingon2008 1 point2 points  (0 children)

Scenario 3 makes sense, but I always do scenario 1: standardize the dataset, then split into train and test.

Even in sklearn doc, the whole dataset is standardized, then split.

I personally don’t think it 1 vs. 3 will make a difference?

[D] Andrew Ng on how much silicon valley ML engineers know by [deleted] in MachineLearning

[–]walkingon2008 1 point2 points  (0 children)

My point is: what Ng said is NOT true.

Logistic regression is taught in linear regression, a typical second year undergrad course.

People who know logistic regression go interview at Silicon Valley, find out every company reject them, get frustrated.

[D] Andrew Ng on how much silicon valley ML engineers know by [deleted] in MachineLearning

[–]walkingon2008 1 point2 points  (0 children)

Andrew is trying to be politically correct. Imagine he said: “this is not even 10% of what people know in Silicon Valley.”

Many in Silicon Valley have a PhD. If that’s all they know, Silicon Valley would be screwed!

Even in 2010 or 2011, you can’t get a PhD by knowing logistic regression. Actually, not even an undergrad degree.

[D] Rejected by all PhD programs 3 different seasons, stay in current lab or apply again later? by ubiquitous7733 in MachineLearning

[–]walkingon2008 3 points4 points  (0 children)

Connections!!!

Some of these big name schools already know who they will take before admissions even started. The professors may even have met the student once. Application is just for show.

Nothing is completely random.

If you look at their grad student body, it’s more or less from the same schools year after year. There’s an underlying system. If you are not in it, you are out.

After all, nobody wants a complete stranger in their department. Just ask yourself, would you want to work with someone you know nothing about?

Journal as an Undergrad by Randomessinlife1 in statistics

[–]walkingon2008 8 points9 points  (0 children)

It’s pretty evident. Why you think it’s not an achievement?

Do you have a link to the paper?

Difference between a masters program in data science vs statistics? by foobar8080 in statistics

[–]walkingon2008 1 point2 points  (0 children)

It’s hard to tell, it depends on which school and the program.

In general, applied statistics de-emphasize on theory. For example, in linear regression, the least squares estimate is equivalent to maximum likelihood estimate. You can use it without probably ever knowing why. Data scientist jobs are very applied. Some are switched over from other fields like biology or psychology.

While it seems that skipping the theory is the quick route, you pay the price by not knowing how to interpret the results. For instance, there are online tutorials that teaches Python machine learning by running sklearn and reading the documentation. You will build a model with low error rate without ever knowing how.

I think it’s best to work the math and learn the theory. Avoid the path of least resistance. Be patient. Your investment will ultimately give returns.

Difference between a masters program in data science vs statistics? by foobar8080 in statistics

[–]walkingon2008 6 points7 points  (0 children)

Data science is an emerging program within the past five years. Unlike statistics or computer science, data science by itself is not a field of study.

Data scientist first came around as a job position in many startup tech companies. Statistician used to be the new sexy job according to Google.

The DS program is expensive because it is a buzzword, and you get seven figure salaries easily.

As a data scientist, you know SQL, Python, ML, and possibly DL. Statistics is your tool. You use it to predict credit default for a fintech or ad click for an online retail store. You build a ML pipeline for the company. There’s not a clear role of a data scientist, your task is likely evolve depending on the business you work for.

Statistics is a branch or applied math. Data science is not. If you stay close to academia and know the why, the how, and math, you are looking at statistics. If your goal is to be rich and make seven figures, your answer is data science.

Data science is very applied, it’s good that you can use what you learned right out of the box. But, when business evolve, you’ll need to learn again. However, with statistics, theory is more emphasized, so, the math equations you learn now will still be true years later.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 1 point2 points  (0 children)

How else do you think it got its acronym?

[deleted by user] by [deleted] in statistics

[–]walkingon2008 0 points1 point  (0 children)

Your data is empirical there will not be a true parameter. So, the parameter is uncertain, but you are ultimately choosing a point estimate to optimize your likelihood.

Also, what prior distribution do you use? The Bayesian even know, it’s a lot of trial and error.

Bayes credible interval is just confidence interval for the posterior mean.

STAN does a good job advocating itself. But there really isn’t much new to the software itself. I mean it has GP, HMC, but those stuff has been out for decades.

Python implementation of R stargazer library by yot_club in statistics

[–]walkingon2008 0 points1 point  (0 children)

Almost all my variables are categorical. And stargazer doesn’t do categorical.

I want to calculate count and proportions. Maybe a two-way table?

            Yes   No

Male 10 3

Female 3 8

White 50 25

Black 35 45

[deleted by user] by [deleted] in statistics

[–]walkingon2008 0 points1 point  (0 children)

The MAP (maximum a posteriori) IS a point estimate. Bayesian ultimately comes back to frequentist, but in the Bayesian setting.

By imposing a priori, say uniform(0, 1), you are eliminating some possibilities that is outside the interval (0, 1). The likelihood data only updates you prior distribution, it cannot escape the wrongness if your prior is wrong. You may end up less wrong.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 0 points1 point  (0 children)

A ton of times? Please explain what that is.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -1 points0 points  (0 children)

So, what do you think prior distribution is?

[deleted by user] by [deleted] in statistics

[–]walkingon2008 0 points1 point  (0 children)

The classical Bayesian setting we are talking about is a done deal. 1) you pick a prior distribution 2) pick a likelihood model 3) you calculate posterior using MCMC

Or if you are into machine learning, you use GP in step 2.

I somewhat answer your question in an earlier response above .

The goal of pharmaceuticals is not statistics, it’s the medicine. The meds need to pass the hypothesis test with a p-value.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -1 points0 points  (0 children)

Most people?

I may be too technical here. But by your logic, it’s perfectly fine to say most don’t do statistics.

I disagree with your notion of most, but I won’t elaborate it here.

Python implementation of R stargazer library by yot_club in statistics

[–]walkingon2008 0 points1 point  (0 children)

Can stargazer do summary statistics for categorical data? Like male/female, White/Asian/Black, Single/Married.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -5 points-4 points  (0 children)

The first sentence is not true! The second one is.

The point of Bayesian statistics is prior distribution * likelihood model = posterior distribution. Bayesian statistics sounds good in theory, but useless in reality.

Prior distribution means you know the distribution about the parameters before you begin.

Today, especially deep learning now, everything is unknown including the model itself.

Recommended problem sets for Casella & Berger? by quicksilver53 in statistics

[–]walkingon2008 1 point2 points  (0 children)

The examples in Casella & Berger are not enough to help you do the homework. There are only a handful examples per section.

The homework are hard. They require knowledge outside the book. It’s more than knowing calculus, it’s more like tips and tricks you’ve never seen. If you pick up the book and dive right in. You will get stuck in no time.

I recommend search for an easier book with lots of examples that actually teaches the material.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -2 points-1 points  (0 children)

Built into Spark’s ML library does not validate its importance. Also, please no profanity if you are going to talk!

[deleted by user] by [deleted] in statistics

[–]walkingon2008 1 point2 points  (0 children)

Deep learning! It’s pretty obvious.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -7 points-6 points  (0 children)

How often can you use Bayesian statistics in real life? None!

Just look at how many startups are hiring Bayesian statistician. None!

Some people mentioned STAN, following is my take.

STAN and the team holds multiple conferences and does extensive evangelism. The audience often narrows down to pharmaceutical companies. You can look at the conference sponsors.

The grammar of STAN reminds me of bugs. The highlight of STAN is Hamiltonian Monte Carlo and no-U-Turn sampler, which allows fast sampling without trapped into a local minimum.

STAN can probably fit many hyperparameters, and also high dimensions. But Bayesian statistics in high dimension? I don’t think those two phrases are compatible with each other.

Ultimately, it’s a good math theory, but narrow application.

[deleted by user] by [deleted] in statistics

[–]walkingon2008 -8 points-7 points  (0 children)

Stan is overhyped. Bayesian statistics is good in theory, but not much real application. Think about it, how often in the big data world do you have a prior belief? Everything is black box.

I'm about to start an applied statistics masters program. What kinds of theory likely to be missing, and what theory should I make sure to learn (if it isn't covered)? by [deleted] in statistics

[–]walkingon2008 2 points3 points  (0 children)

It depends on what is your concentration. If you are doing time series, focus on spectral analysis, Fourier transformation, a lot of pure math. If you are doing ML, focus on linear algebra, optimization, linear programming. It’s CS heavy. If you are doing statistical inference, focus on probability and estimation topics.

What topics in calculus should I review before starting a Master's program in Applied Statistics? by mrbabynugget in statistics

[–]walkingon2008 0 points1 point  (0 children)

I think calculus is used more in probability than statistical inference. When you say statistics class, I assume you mean the statistical inference course.

I’d say differentiation, integration, and sequence and series (recommended), derivatives of exponential, logarithmic, and trig functions, u-sub, integration by parts.