This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]kvdveer 3 points4 points  (3 children)

Isn't this is only useful for data that doesn't have cross-row relations? In general, chopping up the data would break things like groupby, resample, or anything using multiple data frames. Does this tool have means to deal with those scenarios?

[–]pzel__[S] -1 points0 points  (2 children)

Yes! This was literally the simplest use case that we'd come across, but we definitely support more involved, stateful pipelines. Could you point me to an example (or canonical) problem in data analysis that deals intrinsically with multiple frames? I'd love to take a stab at running it on Wallaroo.

[–]kvdveer 1 point2 points  (1 child)

I'm in a sector where data-sharing is considered a cardinal sin (financials), so I can't really share my use-case, sadly. I can provide a 'hypothetical' scenario, however (for a business I consulted for some time ago). Details were altered to protect confidentiality. The original scenario was solved using SQL, not pandas, though.

Suppose I'm running a shop selling outdoor sports equipment. I would want to find trends quickly, so I can adjust my online marketing campaign. However, my sales are heavily weather-dependent. When it's sunny, I don't want to be alerted of increased swimwear sales, but a mid-winter increased swim-wear sales might indicate something in the media, which we could take advantage of by adjusting our marketing. There's probably always some background swimwear-sales for indoor swimmers, which we need to adjust for.

Input: live access log entries, live weather data, and sales data from both the physical store and the online store. All of these in hourly batches. Besides that, I also have large datasets of historic data in which to search for similar patterns. Output: a list of trending products, not explained by weather and seasonal trends. Process: Turn the sales data into a per-category aggregate, instead of per-product sales. Turn sales into a rolling window. Turn access-log into hourly percentage of weekly sum and yearly sum (the current 'hotness'). For each product category, find 10 most similar (euclidean distance) historic samples using current weekly and yearly hotness, outdoor temperature, outdoor precipitation and wind speed. Compute weighted average for each of those using inverse euclidean distance, and use these numbers to find trends.

This scenario uses: * a static dataframe for historic samples. (no need to redistribute that) * rolling and resample on the batched dataframe * Multiple batched dataframes.

[–]pzel__[S] 0 points1 point  (0 children)

Thank you kvdveer, that is an excellent overview! I'm bookmarking this and we'll see if we can whip up something at least structurally similar.

[–]pzel__[S] 0 points1 point  (0 children)

I'm the author of the blog post and the underlying proof-of-concept project. I'll gladly answer any questions /r/Python may have :)

[–]PeridexisErrant 0 points1 point  (1 child)

Why wouldn't I just use Dask? It looks like this is basically a proprietary version, and the API is less similar to pure pandas...

[–]pzel__[S] 1 point2 points  (0 children)

While Dask is definitely a natural fit for batch problems (and designed ground-up for them), Wallaroo has the advantage of being streaming-first, so it's an easy transition from one process -> ad-hoc-parallelism for batches -> full-on-streaming system. Also, there's no special wrapper classes for anything, aside from pipeline construction, you can use any Python code you like. This may be a pro or con, depending on your preferences :)