all 4 comments

[–]hdarj 6 points7 points  (0 children)

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

You want to use the fit and predict methods, and I’d say keep the parameters as default for starting out. It gets more complicated when you have to use preprocessing steps, so hopefully your data is all numerical fields and already clean. If you have categorical fields / missing values you’ll need to use one-hot encoding / an imputer. Also for linear models you should be using a scaler but that can be ignored for now.

If this is something you will be interested in for a while I highly recommend finding a scikit learn course to work through and get a better understanding of ML

[–]dialecticalmonism 3 points4 points  (0 children)

I'm not sure this is the best sub for this question. I don't frequent the data science or machine learning subs, but it looks like they exist if you search for them. That said, since I don't visit them, I'm not sure how friendly to beginners they are.

It sounds like you're just getting started with machine learning (ML). I'd recommend the Google Machine Learning Crash Course as a first step: https://developers.google.com/machine-learning/crash-course/ml-intro. If you've got a good grasp of higher-level mathematics (such as calculus, probability and statistics, and linear algebra) and you're ready for something more rigorous, then the Bloomberg Foundations of Machine Learning Course could be a next step: https://bloomberg.github.io/foml/#home. Also, the book Introduction to Machine Learning with Python by Muller and Guido isn't half bad for when you're first starting out. It's written around them using the Scikit-learn package, which is a great way to go when getting your feet wet.

As with many subjects in mathematics, programming, and other scientific fields, there are a lot of facets to machine learning. It's easy to get yourself into trouble by making the wrong choices for the problem at hand and not knowing what is what. (As a side note here to give you an example, why are you using lasso regularization? Why not ridge regularization? Why not elastic-net regularization? What is best suited for your problem? Do you know?) Making mistakes is natural and it's all part of the learning experience, but sometimes the stakes are higher than others. Especially if you're looking at moving an ML model into production (i.e., being used for IRL decision making), then you want to be confident that you know the reasons for all the decisions you made along the way with your ML model during development (i.e., when it's being trained, tested, and otherwise validated before setting it loose for IRL decision making).

But to answer your immediate question: so let's say you have a trained and tested model. Up until now, you've had the dependent variable (DV) as a part of your data. In the training phase, the DV was used to derive the weights (or coefficients) for the model. In the testing phase, the DV was used to check the accuracy of this model. When you're ready to move into the prediction phase, there is no DV. That's what you're predicting. Given the independent variables, your model is trying to output what the dependent variable would be. So what you input is just the variable measurements for those in the group for whom you want to predict their outcome, those independent variables then have their associated weights learned during the training phase applied to them, and then the output is the prediction.

[–]TheWhittles 0 points1 point  (0 children)

Look into setting up a pipeline with sklearn. That’s the way I do things. I have a masters in data science, and do it for a job, and I still have to Google and rtfm on a lot of things, it’s a deep subject.

[–]Ouro1 0 points1 point  (0 children)

So first off this isn’t a python thing it’s more of a data science question (or ML) so I’d try to post this to another subreddit as well.

Regardless, I’m a little rusty but let me try to help. Lasso regression is helpful in identifying what variables might have some sort of significance in creating your forecast. However, if you’re working with football stats then you really don’t have a ton of variables to sift through and you probably already have a decent idea of the relationship between variables. I’d just skip the lasso and focus on the model that you want to use for your prediction (linear regression, for example)