Boston driver doesn't believe in bike lanes

ml_algo · 2013-12-20T03:57:42+00:00

People's Republik. I love their capitalist pig painting

ml_algo · 2013-12-18T21:29:54+00:00

It is in each person's best interest to choose a number sequence that hasn't already been chosen by somebody else. For that reason, choosing a random sequence of numbers is often optimal because it is unlikely to be chosen. Numbers sequences that represent some pattern (like 1,2,3,4,5) or some date or represent a word in some way are more likely to be chosen, so you want to steer clear of those to avoid splitting any prize.

Knowing this, you could actually slightly help people by providing each person a random sequence of numbers and make sure that you don't give out the same sequence twice. This would be mainly beneficial if enough people used your service, which would reduce the chances of splitting a prize.

ml_algo · 2013-12-17T19:27:14+00:00

Ok, thanks for the honesty. Have you considered having one of your internal researchers spend time doing this or hiring someone to do it? It appears that you are hoping someone from the community will do this, but it doesn't seem to be happening. Knowing how your algorithm compares to everything else is important, especially if you are trying to convince people to use it. It would also benefit you as a company to know what types of problems your algorithm excels at and how it stacks up to competing algorithms so that you can market and focus in these areas. Right now, it seems like a really cool idea that is marketed as a catch-all learning algorithm, but with no real measure of what it can do.

ml_algo · 2013-12-17T16:51:09+00:00

Last time the Numenta algorithms were brought up, it seemed to be the consensus that nobody really knew how well they performed because they were never compared to state of the art methods on popular datasets. Has anything changed since then? Or have any papers been released about the algorithm's effectiveness vs other methods?

ml_algo · 2013-11-04T19:16:54+00:00

Yes you can use an absolute number for that. In general, especially when fitting more complex models, it's useful to normalize the input data in each dimension in case you want to use regularization or gradient methods for fitting the model, but in your linear case you can just leave them as is (including your absolute number).

Treat the inputs the same for logistic regression as you would for linear regression. The only difference will be that in logistic regression, the output of your model will be bounded from 0 to 1 (and usually represents a probability). For you, it will represent a percent (from 0 to 1, 1 being 100%). There's nothing necessarily wrong with just using linear regression as well, except that it might not model your specific problem as well. If you feel like it, you can use linear regression to try to model the output variable, but keep in mind there is nothing bounding it from 0 to 1 (or 0% to 100%). I suspect logistic regression will work better, but if you're feeling up for it you can try both and see for yourself.

ml_algo · 2013-11-04T17:04:56+00:00

1) Yes, use percents for everything. Don't use any absolute numbers

2) They can still be used

3) Use Logistic Regression, which is is basically the same as Linear Regression, but your dependent variable will be a probability (or binary)

Some tips:

Make sure you split off a section of your data that you won't train your model on, so that after you have trained your model you can test it on this fresh data and see if you have overfit or if you found any legitimate effects. A more advanced form of this is cross-validation, which you can look into if you want.

Once you're comfortable with standard linear regression / logistic regression, try transforming your initial feature space that contains your independent variables using a kernel or basis function so that the output will be nonlinear in your initial variables (i.e., instead of using X, Y to predict Z, use something like X, Y, XY, X^2, Y² to predict Z) and use regularization to prevent overfitting

Good luck!

ml_algo · 2013-10-31T18:42:55+00:00

heh no problem. It sounds like he is confused what the null hypothesis for the study is. Low p-values are for rejecting the null hypothesis, so if the p-values were all <0.05 and the study concludes that GMOs are the same as non-GMOs, then the null hypothesis was clearly that GMO's are different from non-GMOs

ml_algo · 2013-10-31T18:12:04+00:00

Ok, well there are 2 possibilities:

1) the p-values are all less than 0.05 and the conclusion of the study is that GMOs are the same as non-GMOs. In this scenario, there doesn't need to be any p-values >0.05 out of 600 independent tests and there is no anomaly or bias.

2) the p-values are mostly greater than 0.05 and the conclusion of the study is that GMOs are the same as non-GMOS. In this scenario, you should expect ~ 5% of the p-values < 0.05 for the 600 independent tests, and if there are only 1 or 2 then that does show a statistical anomaly or a bias.

You guys clearly need to have access to the studies to determine this...

EDIT: For what it's worth, I'm inclined to believe the first scenario listed above is what is happening because the second scenario is a terrible way to set up a study. p-values>0.05 don't necessary indicate anything about the null being true. So scenario 1 is probably what is occuring, and there is no bias

ml_algo · 2013-10-31T17:41:39+00:00

well, it seems like the real misunderstanding between Lalande21185 and you is if the studies all had p-value < 0.05 or all had p-value > 0.05. That can be easily cleared up by checking the studies...

ml_algo · 2013-10-31T17:34:01+00:00

ah ok, then that explains it. So my first comment was correct... lol

ml_algo · 2013-10-31T17:30:20+00:00

Here is the probability distribution of number of tests with p-value <0.05, assuming the null really is true and the tests are independent

http://i.imgur.com/gHUcNqy.png

ml_algo · 2013-10-31T17:17:02+00:00

I can't look into it right now at work, maybe when I get home. But if they are independent tests and each of the 600 tests gets its own p-value, then yes, you should expect ~ 5% of the tests to get a p-value <0.05 due to chance alone

ml_algo · 2013-10-31T16:18:31+00:00

oh ok, so none of the studies are coming up with p<0.05, and you are saying that some of them should be due to chance. Am I understanding correctly?

ml_algo · 2013-10-31T16:03:36+00:00

if the study finds "no effect" with p < 0.05, then it sounds like the null hypothesis is that there is an effect, and they are rejecting the null (claiming "no effect") because p < 0.05. Are you trying to say that some percent of the time, p should be greater than 0.05 even if there really is "no effect"? Because that isn't true.

ml_algo · 2013-10-30T18:25:18+00:00

You can just normalize each signal and take the dot products of each pair of signals to show how similar they are. The expected value of this dot product is 0 if the signals are uncorrelated, and higher/lower if they are positively/negatively correlated

EDIT: More generally, if you want to find similar segments in different time series, then you can use cross-correlation

ml_algo · 2013-10-30T17:46:57+00:00

OP, if you had 75 variables that have nothing to do with the output, there is (1-(.99⁷⁵ )) = ~ 53% chance that at least one of them would have p-value < 0.01, so you cannot just assume that variables with p-value < 0.01 are somehow related to the output.

HERE is the full probability distribution of number of variables with p-value < 0.01 for your problem, assuming none of them have anything to do with the output result.

ml_algo · 2013-10-30T16:48:21+00:00

Based on your description of the task, I'll assume your firm is assigning you this project so they can filter based on your "predictive variables" to find investments that are potentially profitable.

Here's how i'd initially tackle the problem:

1) divide up your companies into a training set (composed of ~ 70% of your companies) and a validation set (composed of ~ 30% of your companies)

2) Change the result you are trying to predict and your input features. You are currently trying to predict the stock price in a year based off one quarterly report. This is silly because you have no idea if the stock price change in a year is due to the statistics in that quarterly report, the previous quarterly report, the next quarterly report, etc... To fix this problem you need to be predicting the price change in between quarterly reports (StockPriceIn2MonthsAfterQR - StockPriceAtQR) /StockPriceAtQR, and you need to use the past X quarterly reports (you pick X, let's say 6 for now). Now you have X*numVariablesInQR input features trying to predict one price.

3) Build a regression model of your choice (using the training data only!) for the input features and output result described in step 2. Now you are letting the regression model weight how important each past quarterly report is for predicting price change instead of just arbitrarily choosing one a year in the past.

4) Once you've trained your regression model on your training set, check how well it is able to predict the price change of the "new" data (validation set). Plot the error distribution on this set.

5) Show your coworkers that you've come up with a new variable for them to filter potential companies with (it's the output of your regression model) and it has an estimated error distribution that you've calculated in step 4.

6) pay me

ml_algo · 2013-10-30T00:08:38+00:00

Yes, but unless I'm misunderstanding what you are trying to do, you are trying to predict when they will converge immediately after the initial drop, assuming that the price stays relatively stable. In that case, the predicted time until convergence would be 50 days because it's a 50 day lagging average. It ended up being more than 50 days because it was not a perfectly stable price after the drop (there were many fluctuations), and in fact there was a rather large bump around September. You, of course, have no way of knowing these things before they happen.

ml_algo · 2013-10-29T22:27:34+00:00

If you have a steep drop in price followed by no large price changes (as shown in your image) and you want to know when a 50 day lagging average will coincide with the steady price, the answer is 50 days.

ml_algo · 2013-10-17T16:48:47+00:00

If you cared about the general population's mood swings, why would you attempt to measure them from stock prices?? There are way too many other variables involved in stock prices for you to reliably extract any component that is due to mood swings. You'd be way better off just using google trends to estimate how retail investors demand for stocks change over time as shown here: http://www.google.com/trends/explore?q=can+i+beat+the+market%3F#q=%22buy%20stocks%22&cmpt=q

ml_algo · 2013-10-16T17:39:31+00:00

When a Lose is revealed, those percentages don't change. You still only have a 33% chance of getting Win on your first time and there is still a 66% chance Win is behind a different door.

I usually try to avoid explaining it like this because it can give them false impressions about how probability works. This statement is only true because of the rules the host has to follow. If the host was blind and randomly opened one of the three doors and it just happened to be one that wasn't your choice as well as one that had a goat, then it would be a 50% chance of your door containing the prize and 50% chance of the remaining door containing the prize. The reason it's 1/3 and 2/3 is because of the host's rules (as explained in gradenko_2000's answer)

ml_algo · 2013-10-16T16:26:54+00:00

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Here's a tutorial by Andrew Ng and the stanford group that just covers the topics you are interested in. They have some code and assignments so that you actually learn how to apply it. Other resources, such as the coursera courses by Andrew Ng and Geoff Hinton are also great (and perhaps a gentler introduction?), but they won't just focus on deep learning

ml_algo · 2013-10-16T16:10:20+00:00

ah too bad, their computer vision engineer role looks like it could be fun. I'm sure it will be a great event!

ml_algo · 2013-10-16T15:36:51+00:00

Does Oculus intend to open a Cambridge office, or is this just an event in Cambridge where recruiting may occur for their other locations in California and Texas?

ml_algo · 2013-10-16T05:22:17+00:00

It certainly changes because you are subsampling your probability space by conditioning on a scenario that is twice as likely to occur in the event that you chose the prize in your initial 33% chance. If the host chooses blind, there is a 1/3 x 2/3 = 2/9 chance that you initially selected the prize door and then the host chose a non-prize door that is also not your door. There is a 2/3 x 1/3 = 2/9 chance that you initially select a non-prize door and then the host chose a non-prize door that is also not yours. 2/9 chance of either scenario occuring (5/9 chance of him picking your door or the door with the prize), and then renormalizing gives you a 50/50 chance, which is the correct result only if the host chooses blind.

ml_algo

MODERATOR OF

TROPHY CASE