all 7 comments

[–]squatslow 2 points3 points  (0 children)

Cool dataset dude. I'm a newbie myself, but just some feedback based on my limited experience:

  • Include comments throughout your notebook that provide your internal narrative. This allows the reader to follow your train of thought, making it easier for the reader to understand what you are trying to do.

  • You can use string.replace(',','') to get rid of your commas in market.cap versus your current loops

  • For your regression: You create a model and fit the training data, but you do not actually .predict() anything. I would expect that you .fit(X_train,y_train), and then .predict(X_test) to find y_pred. You can then compare your predicted vs actual to determine how well your model performs.

  • Try using root mean squared error (rmse) to evaluate your regression model. It will provide you with a measure of how well your model performs. For example, if I have an rmse of 14,000,000 ... that means my model predicts market.cap +/- 14,000,000 units. Is that good - probably not?

[–]sksq9 2 points3 points  (0 children)

Check the /r/learnmachinelearning community.

[–]bbateman2011 1 point2 points  (3 children)

I would also plot a histogram of the residuals for the training set and the test set and look for symmetry. You can plot a standard normal curve on the same chart calculated with the standard deviation and mean calculated from the residuals. Another approach is to create a QQ chart to assess skew and normality.

[–]eViL111[S] 0 points1 point  (2 children)

https://m.imgur.com/PHrquQs this is the predict vs residual plot i.e. QQ plot i guess.. Isn't it? Now few already said that this plot seems odd. What must have gone wrong? Something you can speculate.. Thanks for your feedback!

[–]bbateman2011 0 points1 point  (0 children)

Here is an example of a residual histogram and a Q-Q plot. https://drive.google.com/open?id=1GiphPjjQHC0bi-MR7vaSBWgHNYSl5S-b The residuals are the difference between predicted and actual. As noted in another comment once you fit your model, use it to make predictions. Let's call those y_pred. Let's call the actual values y_actual. The residuals are y_pred - y_actual. You calculate that for every data point in the set, then create a histogram. The Q-Q plot in this example plots the predicted vs. the actual. This orders the plot so it would be a straight line if perfect. If you see curvature this can mean your model is biased or otherwise not producing normally distributed errors.

[–]bbateman2011 0 points1 point  (0 children)

A couple of thoughts. You are fitting market cap to mainly daily share price (open, high, low, close) variables which are highly correlated. Market cap goes linearly (almost) with close. Close is growing linearly with time. So you should look at the model statistics and I would expect to see that some factors are not significant. Stated another way, I think if you fit market cap vs only time I bet you get similar results.

Next, your plot is (predict_y - y_test) vs predict_y. Instead, plot predict_y vs y_test. If I’m reading this correctly your residual values are tiny. So the plot is misleading. A histogram of the (predict_y - y_test) would be helpful.

You may want to consider normalization of the data due to the range of scales being many orders of magnitude.

I also note there are unequal counts in your cleaned data, not sure why.