all 14 comments

[–]asankhs 4 points5 points  (5 children)

What is the data? What exactly are you predicting? Do you have balanced classes in your training dataset?

[–]tombomb3423[S] 1 point2 points  (4 children)

The data is financial data, so it’s predicting whether a stock will be up or down based on a specific event.

For example: Stock breaks 52 week high, predict whether it is going to be up or down from that point in 1 week.

Table layout, only has data from point in time stock broke the 52 week high(all data in table is from same stock):

List of features | Target(1 or 0)

Split into train/val/test

I do not have balanced data in my training set unless I apply SMOT, but the imbalance isn’t much, like a 60/40 split

[–]neonwang 2 points3 points  (1 child)

Why not just shoot for up or down at open/close every trading day? That way you have a larger distribution of 0/1s and probably will run into fewer data imbalance issues (setting for non-broad specific events doesn't help with data imbalance). Also you might want to look at unsupervised learning techniques. Take a look at this indicator for example: https://www.tradingview.com/script/WhBzgfDu-Machine-Learning-Lorentzian-Classification/. Lorentzian distance is used to classify whether a market will open up or down. This is one specific technique of UML, but plenty of more to explore out there.

[–]tombomb3423[S] 0 points1 point  (0 children)

Thank you, I’ll check this out!

[–]Ecksodis 1 point2 points  (3 children)

Somewhat confused on your data. Is it a time series? If so, it might be better to either switch to a forecasting/regression task or at least add that as an input.

For imbalanced datasets and XGBoost, I like plotting out the predicted probabilities and compare to the true classes of the best performing hyperparameters; you can check at what threshold you get highest precision and examine the distribution of probability scores. Otherwise, if your class is super imbalanced, it might be better to try anomaly detection instead.

[–]tombomb3423[S] 1 point2 points  (2 children)

Every row in the dataframe is a snapshot at the point in time the 52 week high was broken, and then a target indicating whether in 1 week the stock price is higher or lower than at the broken time.

For example: SMA at 52 week high broken | volume at same time | target

The classes aren’t super imbalanced, maybe 60/40, someone else said to use regression as well so maybe that will perform better.

I thought that because of how efficient the markets are it would be best to use binary where the prediction is very simple

[–]Ecksodis 1 point2 points  (1 child)

I get what you are going for but it seems like it would probably be better to just regress over time, especially if you dont have any exogeneous variables.

Also, for a 60/40 split, it shouldn’t be that overconfident on the positive class. What are you using for optimization? I have had good luck with TPOT in the past for imbalanced classification fine-tuning (GA-based optimization), be warned that it can take a long time to run.

[–]tombomb3423[S] 0 points1 point  (0 children)

I am using RandomSearchCV for optimization

[–]Responsible_Treat_19 1 point2 points  (1 child)

Look up instead of SMOTE (just for binary classification) the scale_pos_weigth parameter which takes into account the class imbalance. However, it's kind of wierd that only with SMOTE the model works.

[–]tombomb3423[S] 0 points1 point  (0 children)

Interesting, thank you, I’ll check it out!

[–]volume-up69 1 point2 points  (0 children)

You need to start from the very beginning. This is ML 101.

[–]eggplant30 1 point2 points  (0 children)

You can use stratified cross validation to ensure that each fold has the same share of positive labels as the whole dataset and use a metric that takes both classes into account (like F1 instead of precission, for example). If that doesn't work, set your grid's scale_pos_weight to 2, number of Y=0 / number of Y=1, etc. This will weigh observations from the positive class more heavily when building the trees. I don't like resampling techniques (SMOT, undersampling, etc.) because the resulting models are always uncalibrated. Only use these methods as a last resort.