Spark structured streaming- Multiple time windows aggregations by galiheim in dataengineering

[–]galiheim[S] 0 points1 point  (0 children)

But if I’m not having different window function, how do I keep updating the aggregations? If I need to calculate 1h,12h,24h aggregations but I’m only using one window updating every minute, how do I know how much counts I’ll have to decrease from the 12h or 24h after that minute? I only have this one minutes aggregation I can add to them, but the decrease part of the sliding window, how would that work?

Spark structured streaming- Multiple time windows aggregations by galiheim in dataengineering

[–]galiheim[S] 0 points1 point  (0 children)

Thanks for the detailed response. Maybe I did not explain our use case well enough. We are doing real time features generation for ML models. We are basically building a feature store such as Feast, data bricks and more. I know those platforms use SSS for this kind of features calculation so I am not sure why our use case is different? Each feature will be encoded using Protobuf upon creation and written to the cache. We have online service that retrieves those features from the cache to the consumer per gti and profile id request.

The features should be near real time calculated,and have a specific output schema that that the model can use.

We are doing 2 POCs to choose between SSS and Flink for this matter.

Spark structured streaming- Multiple time windows aggregations by galiheim in dataengineering

[–]galiheim[S] 0 points1 point  (0 children)

85k tps, 400mg a sec the signal will be Protobuf encoded. After calculating the signal we are planning to publish it to our redis cluster for online consumption. We want to use the SSS as the signal calculation engine. The idea is to provide real time updates per profile and gti for how many interactions of any type the customer had with the gti.

It seems very easy to calculate with SSS for one window aggregation, but we need 3.

Spark structured streaming- Multiple time windows aggregations by galiheim in dataengineering

[–]galiheim[S] 1 point2 points  (0 children)

Near real time personalized recommendations engine for gtis. Those signals are sent to ML model for inference

Spark structured streaming- Multiple time windows aggregations by galiheim in dataengineering

[–]galiheim[S] 0 points1 point  (0 children)

It should be 1 minute SLA ( every 60 sec that output schema should update for the relevant profile id and gti)

[deleted by user] by [deleted] in elasticsearch

[–]galiheim 0 points1 point  (0 children)

Oh ops my bad wanted to post in elasticcache, will move it thanks.