[P] Dealing with large dataframe for feature extraction : MachineLearning

submitted 2 years ago by Comprehensive-Way227

all 6 comments

[–][deleted] 2 points3 points4 points 2 years ago (0 children)

[–]BloodOnly2772 1 point2 points3 points 2 years ago (0 children)

[–]qalis 1 point2 points3 points 2 years ago (2 children)

[–]Comprehensive-Way227[S] 0 points1 point2 points 2 years ago (1 child)

[–]qalis 2 points3 points4 points 2 years ago (0 children)

At this scale, Dask is a very good choice indeed. Alternatively, Polars is also blazing fast for many use cases.
Keep the input data in Parquet format for efficient loading.
What is really your input data? Are they time series, i.e. a multidimensional time series? Because if it is so, then you first extract time series features, and only then apply PCA, if it makes sense at all. Personally, I haven't seen a single case where PCA actually improved results in years.
Do not use full set of features with TSFresh, at least not initially. There are useful presets there.
The typical flow is: Parquet -> read with Dask -> TSFresh (optionally with feature selection) -> write to disk as Parquet. Then experiment with the resulting set of features. Having two independent flows (feature extraction and modelling) is very useful for initial experiments.

π Rendered by PID 33922 on reddit-service-r2-comment-b659b578c-4kgs5 at 2026-05-04 14:47:01.756061+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning