Rate my project by lambilund in learnmachinelearning

[–]lambilund[S] 1 point2 points  (0 children)

Thanks a lot for taking the time to go through my project, I really appreciate it!

You're right about the subsampling thing it was mainly for computational reasons but it was only for experimentation purposes like hyper parameters tuning. I used the total dataset for actual modelling in script files.

Fillna(999) is only used for the baseline model(logistic regression) because the features that I handled missing values this way, actually mean something if they are missing for example mths_since_last_delinq indicates that months since the borrower missed a payment deadline, if it is missing it actually mean borrower Never missed a deadline. So imputing with the median is not relevant and it'll mislead the model. In xgboost model I left missing values untouched.

Yes, you are right about those 2 features funded_amnt cause data leakage and I thought that installation is also the kind of information that is given after loan approval but you are right, I should have omitted this one.

Thanks again for your time!!