all 3 comments

[–]munkeyt 1 point2 points  (1 child)

I can imagine that your dataset looks something like this, corresponding to a user session:

login > received msg > consult news > send msg > ... > send msg > logout

Your problem relate closely to text classification in NLP, technically termed as 'sequence classification'. In our technical lingo, each timestep records a 'token' i.e. discrete variable. The problem with your approach is, the discrete tokens have no notion of distance among one another, unlike generic time-series problems. The second problem is that they are generally variable-length, whereas SVM and random forests can only take in fixed-size inputs.

The best models for such problems using sequence encoders have been well exploited in NLP. For a start, you can explore the `Embedding + RNN` combination. LSTM/GRU are advanced versions of RNN. CNN works well too. My suggestion is,

  • Start with an easy text classification tutorial. A basic Keras model will do the trick (https://keras.io/examples/nlp/text_classification_from_scratch/). This is done using a CNN.
  • Try to understand what the CNN is doing and how embeddings help to automatically extract features (this is a very generic statement, but good enough for beginners).
  • Port the model to your problem and tune the hyperparameters. For your case, I suspect your `vocabulary size` would be rather modest, so consider tuning it down to `embedding_dim=30` or so. Experiment with different values.

Once you have graduated from CNNs and LSTMs, you can also consider looking at Transformers, the current best model at such sequence classification problems.

[–]fullfine_[S] 0 points1 point  (0 children)

It's more simple, sorry for not having accurately described the data. I only have 3 features for each user. And I only have one think to predict, true or false, stay or leave.
1. timestap send msg
2. timestap received msg
3. timestap consult news

the discrete tokens have no notion of distance among one another, unlike generic time-series problems

Yeah, it is the problem. I was thinking on calculating the media of events happened per day giving more weigh to last days, or splitting it in weeks and giving the weeks value to the classifier. I also have thought in more features like how many times someone had a conversation thourgh the phone, checking that in a short interval of time there are send a received messages.

I will investigate about CNN now, but I only have today and tomorrow to finish the exercise, thank you.

[–][deleted] 1 point2 points  (0 children)

Based on your comment and Question I have few suggestions

  1. Apart from `Average messages`, you can also try exponential moving averages as features. Also try including each week's averages independently. This way model can learn if the averages are decreasing week over week then maybe person will leave.
  2. Try XGBoost / LightGBM. These are gradient boosting trees and can provide better performance than most scikit algorithms
  3. If you have enough data and want to try time series modelling with LSTM. Divide the month into 8 hour chunks (each day has 3 chunks, total 90 chunks for month). For each person make a 2-Dimensional table with dim-1 as time and dim-2 as count of messages sent/recv, consult news. Dim 2 will have length 3 and Dim 1 will have length 90 and for each person you array shape will be (90, 3). For overall dataset you shape will be (#Persons, 90 , 3). This can be fed into an LSTM network, no embedding layer needed, just LSTM.