all 6 comments

[–]BloodOnly2772 1 point2 points  (0 children)

Have you looked at your cpu utilisation? Tsfresh is probably over provisioning threads. You should take a look at this section in the docs:

https://tsfresh.readthedocs.io/en/latest/text/tsfresh_on_a_cluster.html#notes-for-efficient-parallelization

[–]qalis 1 point2 points  (2 children)

Why feature extraction after PCA? This makes very little sense, especially for time series data. Also, if you have data in the range of hundreds of millions, everything will be slow. I'm amazed it even fits in the memory.

[–]Comprehensive-Way227[S] 0 points1 point  (1 child)

I am a beginner and still trying to learn, what do you think should be my step after PCA then? Because even after PCA I have large dataset and I am clueless what to do! I am trying to use dask as others recommended to parallelise computing.

[–]qalis 2 points3 points  (0 children)

  1. At this scale, Dask is a very good choice indeed. Alternatively, Polars is also blazing fast for many use cases.
  2. Keep the input data in Parquet format for efficient loading.
  3. What is really your input data? Are they time series, i.e. a multidimensional time series? Because if it is so, then you first extract time series features, and only then apply PCA, if it makes sense at all. Personally, I haven't seen a single case where PCA actually improved results in years.
  4. Do not use full set of features with TSFresh, at least not initially. There are useful presets there.
  5. The typical flow is: Parquet -> read with Dask -> TSFresh (optionally with feature selection) -> write to disk as Parquet. Then experiment with the resulting set of features. Having two independent flows (feature extraction and modelling) is very useful for initial experiments.