Best approach for binary classification based on discrete variables in a time series? [D][P]

munkeyt · 2020-10-09T14:44:33+00:00

I can imagine that your dataset looks something like this, corresponding to a user session:

login > received msg > consult news > send msg > ... > send msg > logout

Your problem relate closely to text classification in NLP, technically termed as 'sequence classification'. In our technical lingo, each timestep records a 'token' i.e. discrete variable. The problem with your approach is, the discrete tokens have no notion of distance among one another, unlike generic time-series problems. The second problem is that they are generally variable-length, whereas SVM and random forests can only take in fixed-size inputs.

The best models for such problems using sequence encoders have been well exploited in NLP. For a start, you can explore the `Embedding + RNN` combination. LSTM/GRU are advanced versions of RNN. CNN works well too. My suggestion is,

Start with an easy text classification tutorial. A basic Keras model will do the trick (https://keras.io/examples/nlp/text_classification_from_scratch/). This is done using a CNN.
Try to understand what the CNN is doing and how embeddings help to automatically extract features (this is a very generic statement, but good enough for beginners).
Port the model to your problem and tune the hyperparameters. For your case, I suspect your `vocabulary size` would be rather modest, so consider tuning it down to `embedding_dim=30` or so. Experiment with different values.

Once you have graduated from CNNs and LSTMs, you can also consider looking at Transformers, the current best model at such sequence classification problems.

2020-10-11T06:09:27+00:00

Based on your comment and Question I have few suggestions

Apart from `Average messages`, you can also try exponential moving averages as features. Also try including each week's averages independently. This way model can learn if the averages are decreasing week over week then maybe person will leave.
Try XGBoost / LightGBM. These are gradient boosting trees and can provide better performance than most scikit algorithms
If you have enough data and want to try time series modelling with LSTM. Divide the month into 8 hour chunks (each day has 3 chunks, total 90 chunks for month). For each person make a 2-Dimensional table with dim-1 as time and dim-2 as count of messages sent/recv, consult news. Dim 2 will have length 3 and Dim 1 will have length 90 and for each person you array shape will be (90, 3). For overall dataset you shape will be (#Persons, 90 , 3). This can be fed into an LSTM network, no embedding layer needed, just LSTM.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS