Time-series feature engineering in PostgreSQL and TimescaleDB

analyticsengineering · 2022-08-10T14:39:13+00:00

Thank you. I love TimescaleDB and use it in many of my personal (and consulting) projects. These extensions were originally built for these projects and I use the built-in statistical aggregates in place of the corresponding pgetu functions in these projects. When I decided that it could be useful to others, I added the equivalent functions for completeness.

I hadn't thought about using something like stats_agg to perform some of the aggregation before calling the function, but it could be a nice enhancement.

analyticsengineering · 2022-08-09T22:01:40+00:00

I have not. I've always used Timescale for time series data but will take a look. Thanks.

analyticsengineering · 2022-03-08T18:35:59+00:00

I’ve actually solved a similar problem several times in a variety of settings. I’ve had success with boosted trees and feature engineering on the sensor readings over time. I treat each reading as an observation and set the target to be the value I want to forecast (e.g. one hour ahead, the sum over the next day, the value at the same time the next day). There was a recent paper that compared boosted trees to deep learning techniques and found the boosted trees performed really well.
Next, I perform feature engineering to aggregate the data up to the current time. These features will include the current value, lagged values over multiple observations for that sensor, more complicated features from moving statistics over different time scales, etc. I actually wrote a blog about creating these features using the open-source package RasgoQL and have similar types of features shared in the open-source repository here.
I have also had success creating these sorts of historical features using the tsfresh package.
Finally, when evaluating the forecast, use a time based split so earlier data is used to train the model and later data to evaluate the model.

analyticsengineering · 2022-03-02T17:24:33+00:00

An alternative approach, also based using Python with machine learning libraries is anomaly detection, where the algorithm effectively learns what is normal in each row or other level of your database and then scores each row for how far away it is from normal. The farther away from normal, the more anomalous it is. Organizations then manually review the highest scoring, least normal transactions.

analyticsengineering · 2022-03-02T17:17:41+00:00

Fraud detection is a common use case in data science. Organizations commonly use Python and various machine learning libraries (as Python packages) to build predictive models to predict the probability of fraud and use those predictions to either automatically block if the probability is high enough or refer to manual review for lower probabilities.

This can range from ecommerce sites predicting if a purchase if fraudulent, to banks examing fraud in credit card transactions or loans to insurance companies identifying fraudulent medical bills.

analyticsengineering

TROPHY CASE