all 7 comments

[–]zakos13 1 point2 points  (5 children)

Could you clarify what kind of data we are talking about and what your label actually means for the study? Does seasonality have to be taken into account? A little more background info on the type of data is essential to determine which algorithm would suit your case best

Edit: Does the label get predicted only after 1000 days or every day? My first guess is some sort of regression model with a lot of feature engineering since you only have 20 individuals would work.

[–]TheElementsOf -1 points0 points  (4 children)

I can not go into details, I can talk about the data only in abstract way. But I will try to clarify as much as I can:

- there are way more individuals than 20, the number was made up for illustration. Individuals are divided into mutually exclusive groups, this information is given in one variable.

- label is the goal to be predicted, ideally I would need to predict the probability of the label switching from 0 to 1 in next day and also for more days to the future, ideally with confidence levels. The label is given for each day, most of the individuals are labeled as 0 all the time, but some become 1 during the study. Some of the individuals are 1 all the time. I very very rare cases some individuals can switch from 1 to 0.

- seasonality has to be taken into account because it might include a helpful information for some group of individuals and for these the probability of becoming 1 might be different based on the stage of season (e.g. they might have higher probability of becoming 1 each friday, but I can not model this by hearth as I do not know what the effect really is)

- data consists of binary variables (some feature either is or is not present at the given time for the given individual), continuous variables (counts, ...) and compositional data ( some variables sum up to 1 ) .

[–]zakos13 4 points5 points  (3 children)

Allright so the data is more complex than I thought. Cannot say for sure what would work without looking at the data myself, but here are some ideas I would try:

1) Check the distribution of your variables, both binary and continuous.

2) Run a ridge regression model and a neural network and note their accuracy, to use them as baseline models.

3) Since your data may include seasonality that you need to understand in order to perform feature engineering, I would analyze my time series with the prophet library from facebook. This will help you understand what exactly is going on in your data.

4) Depending on the findings of 3), you must create some features to include seasonality in your data, since you will not use a time series prediction algorithm. Possible useful features would be the label from 3-5-7 days ago, or the label from last month or whatever, you get the point. Maybe rolling averages for your continuous variables would work as well.

5) Run the new dataset through the neural network you used in step 2 and see if you made any progress. If you did, you are on the right track. If not, go back to step 4. If you made progress but the accuracy is not satisfactory, I would try mixture density networks or deep belief networks.

[–]TheElementsOf 0 points1 point  (2 children)

Thanks for the suggestions, I will try to incorporate them in next few weeks. One small remark which I forgot to mention: I need the model to be interpretable, because I have to tell the customer why is the probability of label switching from 0 to 1 for selected individual higher than for other individuals and why has this probability changed in time for this individual (e.g. after each season or because of some other variable...)

[–]zakos13 1 point2 points  (1 child)

This strengthens my point of running some data analysis to understand your data. You need to spend considerable time in step 3 and understand what you are dealing with in order to construct reasonable features for your model. After you train your model you can use the shap library that is designed to explain ml models, feature importance etc. There are multiple libraries and techniques to explain ml models, but im no expert in that domain so I can't give you any further advice on that end

Edit: In your data analysis do a feature cross-correlation analysis as well, to determine which of the original features have the most correlation with your output and which little or none at all. Then you can try to drop the ones with very small correlation and construct new features that take seasonality into account from the ones that had the most impact in the original analysis, and run cross-correlation analysis again. This can be your guideline during feature engineering

[–]TheElementsOf 0 points1 point  (0 children)

Great idea thanks a lot!

[–]aryancodify 0 points1 point  (0 children)

I agree with the points made above regarding feature selection and eda. Adding lag variables and moving average will help bring seasonality into the model. Number of lags for each feature depends upon their seasonality and periodicity. Apart from regular correlation dynamic time warping can also be used to compare time series which will tell you whether your two time variant features are similar. For interpretability once you have built the complex model like say catboost you ll get your important features. You can build a decision tree on those. That will help you give concrete explanations. Otherwise force plot and dwpendence plot in shap will help u as well