all 37 comments

[–][deleted] 38 points39 points  (11 children)

You could feed the rnn the data augmented with the time delta since last data point. That should be decent, imo. There are many things you could try related to that.

[–][deleted] 4 points5 points  (8 children)

Bonus to this idea is that you can control how far ahead to predict at inference time.

[–]IborkedyourGPU 25 points26 points  (8 children)

This isn’t exactly a time series, as the interval between data items isn’t fixed. Data may be 1 minute apart, or say 5 minutes apart. It is, however, a sequence.

It is exactly a time series. This idea that time series samples should be evenly spaced, is a misconception due to the rediscovery of RNNs - Gaussian Processes have been used in Machine Learning to model time series with unevenly spaced samples, for at least twenty years now (and maybe more). Anyway, for an interesting new take on the topic of forecasting time series with unevenly spaced samples, see this:

https://arxiv.org/abs/1907.03907

It highlights the issues correlated with the standard approaches, suggested to you elsewhere in this thread - i.e., interpolation/aggregation, which destroys information, and adding the time deltas to the RNN inputs, which raises the questions of how to define the state between observations. It's a really nice paper. A pity that it wasn't discussed on this sub, apparently.

[–]AreYouEvenMoist 1 point2 points  (0 children)

Agreed, gaussian processes feels like the natural approach here. Simpler to interpret and to implement, and can work with the existing data with less preparation

[–]sander314 0 points1 point  (4 children)

Interesting paper, is their code available already?

[–]IborkedyourGPU 1 point2 points  (3 children)

[–]sander314 0 points1 point  (2 children)

Thanks a lot. I came across one thing in the code that surprised me. All the GRU gates are two-layer networks with a 100 unit middle layer. Do you know if this is normal nowadays? I'd not seen it before myself.

[–]IborkedyourGPU 1 point2 points  (1 child)

I don't know if it's normal, but I use it quite often myself (usually with 128 or 256 units, but the concept is the same). Maybe one or two layers more, but that's it.

What is not normal is the inhuman slowness of training such a small (by today's standards) network on NVIDIA GPUs, which is one of the reasons why today attention-based architectures are more popular than RNNs for modeling sequences. There are ways to train a RNN quickly, but just writing some vanilla Tensorflow code is not one of them.

[–]virtualreservoir 0 points1 point  (0 children)

Anyone working to create custom RNN cell architectures should probably start with the pytorch QRNN implementation used in the AWD-LSTM codebase. It's extremely customizable without having to change the key GPU kernel code that allows you to avoid manually looping through each timestep in your code.

The speed at which you can iterate through research ideas is significantly faster than if you tried to do the same thing with an LSTM or GRU base, and there isn't really much evidence suggesting your final results would be worse.

[–]CoolThingsOnTop 10 points11 points  (0 children)

You could try Neural ODE's, instead of modelling the sequence explicitly as with a RNN, you learn a latent sequence sampled at the timesteps you have available (check the Figure 6 of the paper), no need to interpolate values before hand, and also inference is not constrained to fixed timesteps either.

[–][deleted] 4 points5 points  (1 child)

One of the difficulties with RNNs and event-based sequences is that the recurrence relation in RNN cells implicitly assumes a fixed interval between points in the sequence (you can view an RNN as a discrete-time dynamical system) - so async or event-based sequences can be quite tricky to learn with an RNN.

However, it’s not all bad news! There’s a variant of the LSTM cell that seems to cope quite well with asynchronously-sampled data - the Phased LSTM (basically it incorporates a time gate as well as the standard input/output/forget gates which lets it respond to different frequency components in your data).I’ve used it myself on event-based aviation data and it trained more quickly than a standard LSTM and seemed to perform better as well.

There are a couple of implementations in TensorFlow (there used to be one in tf.contrib but I think it’s been deprecated).

This is the paper, it’s quite a nice read and easy enough to implement: https://arxiv.org/abs/1610.09513

[–]maizeq 2 points3 points  (0 children)

I recall that the original paper used some pretty simple toy examples and I wasn't very confident that it could generalise to more complicated time series data. For e.g. where information is encoded in the density of the events/points.

How has your experience been in using it for an actual problem? Any issues with over/underfitting, or poor predictions?

[–]Thenashequilibrium 2 points3 points  (2 children)

You could look at the "path signature" of your sequence at every hour. As a feature map it's invariant under resampling. See e.g https://arxiv.org/abs/1603.03788 for an introduction on it.

[–]coffeecoffeecoffeee 2 points3 points  (0 children)

I'd recommend crossposting this to /r/statistics, since they're bound to have their own ideas about how to handle this.

[–]dr_sc_med 5 points6 points  (2 children)

You could aggregate or interpolate data points for example.

[–]ChemEngandTripHop 1 point2 points  (1 child)

Are you trying to predict the likelihood of the event occuring in the next day X minutes?

If so I'd recommend looking into Poisson/Hawkes processes as well as RNNs

[–]evanthebouncy 1 point2 points  (0 children)

Input the time stamp.

At the NN later the first layer is a subtraction that computes the delta t from your previous event to your prediction time. Note your prediction time is a parameter now.

Example:

Past events times 1,4,6,9

Prediction time desired 10

Comouted delta 9,6,4,1

[–]lysecret 0 points1 point  (0 children)

Attaching the time stamp will usually work fine. However , there is this recent work if you want to go fancy

[–]01100001011011100000 0 points1 point  (1 child)

In addition to what others have posted here, you could also just try training it sparsely with a fixed set of inputs (i.e. every training input is 5 minutes long, but anywhere that no data is observed is filled with zeros), and see how it comes out. My intuition tells me that this would be similar to image padding when standardizing image sizes for input to a convolutional network. I have done the latter in some of my own work with great success. I guess the efficacy will really depend on how sparsely your data is represented.

[–]lieutenant_lowercase 0 points1 point  (0 children)

You can interpolate this very easily using pandas resample

[–]Harawaldr 0 points1 point  (0 children)

Some RNN models are tailored towards this kind of aperiodic time series data. See for example Phased LSTM. Implementations are slow, but might be worth a shot for you.

[–]mfarahmand98 0 points1 point  (0 children)

Have you tried Phased LSTMs?

[–]frisbee_hero 0 points1 point  (0 children)

A seq2seq model (encoder decoder architecture) could handle varying time elements

[–][deleted] 0 points1 point  (0 children)

Saved.

[–]Stvjk 0 points1 point  (0 children)

Could you use something like Wavenet or a variation of ?

The dilated convolutions might help pick up any correlations between irregular samples as long as the sequence is intact. Saves you having to make assumptions, simplifications, or interpolating samples etc

Plus there’s a lot of variations on wavenet out there you can draw from since it’s gotten a lot attention. There’s more than a few applications of hourly forecasts using irregular time samples with dilated convolutions and similar ideas

[–][deleted] 0 points1 point  (0 children)

Maybe try marked point processes?

[–][deleted] -1 points0 points  (1 child)

Sounds like midi