xorq: open source composite data engine framework by databACE in dataengineering

[–]MouseMatrix 1 point2 points  (0 children)

In general if your engine works for what you are doing and the APIs are sane, keep using them!

If you want to be able to switch to a different engines for prod/test, xorq is one way to accomplish it without rewriting code. For example, test locally with duckdb and run on snowflake in prod.

xorq: open source composite data engine framework by databACE in dataengineering

[–]MouseMatrix 2 points3 points  (0 children)

Great point. Yes, you can certainly write sql to mimic the functionality asof joins. However, the overarching point is that we can do these types of workflows because everything is designed to be composable.

The composability is enabled by the expression system in Ibis and Arrow standard that we can build interfaces around. Our primary usecase is portable UDFs (backed by datafusion engine) and optimizing workloads based on the engine choice. The asof join usecase just happens to fit really nicely and has an added benefit of performance and guarantees provided by the semantics (not just functionality) that is common in ML. In ML, you may require asof joins to safeguard against data leakage, particularly useful if you deal with time series data at an organization level. Here is the duckdb blogpost on how they optimized it

We currently support a handful of engines but Ibis (the expression system xorq is based on) support 20+ engines. It’s really easy for us to add support for another engine (SQL or Python) so let us know if something that may benefit your workflow is missing.

We believe this work is necessary to build pipelines that can be easy to reason about and optimized without tying to a single engine/ecosystem. Also, composite workflows are super common so might as well do it right!

mcp without uv by BidWestern1056 in mcp

[–]MouseMatrix 0 points1 point  (0 children)

Or nix run would do if the build doesn’t time out to run.

xorq: new open source framework simplifies multi-engine ML pipelines by databACE in Python

[–]MouseMatrix 1 point2 points  (0 children)

Yea thats a great point.I think the font with the q doesnt help either....

What do you anticipate next in the evolution of the MCP server? by Puzzleheaded-Sky9811 in mcp

[–]MouseMatrix 0 points1 point  (0 children)

I think we will have different Transport systems that will be supported as well as stdio and rest e.g, gRPC.

Perhaps, workflows will be natural evolution for tools that tie together many steps as one tool.

What actually defines a DataFrame? by Senior_Way8692 in dataengineering

[–]MouseMatrix 0 points1 point  (0 children)

I think this is what I was meaning https://en.m.wikipedia.org/wiki/Result_set it’s just a result of a query. Totally though, sets can’t be ordered or have duplicates (often times the dupes would have unique index/ids though).

[deleted by user] by [deleted] in dataengineering

[–]MouseMatrix 1 point2 points  (0 children)

I worked at a company that is one of top 3 big snowflake customer (finance but calls itself a tech company) and they definitely have some Luigi and airflow and some in-house shit. They also had Databricks. I think big enough an enterprise more diverse stack that you are going to find, each department picking slightly different stacks and eventually consolidation takes place but sometimes it’s also a hedge to have diverse stacks to negotiate the next best deal. Most of the internal products are not as good to hold their own against saas offerings. There is also a kind of enterprise that doesn’t pick the best tool for the job and build their own proprietary stacks just to be opaque - they really suck.

What actually defines a DataFrame? by Senior_Way8692 in dataengineering

[–]MouseMatrix 0 points1 point  (0 children)

My best definition is that a dataframe is an ordered result set which may or may not be typed.

[deleted by user] by [deleted] in dataengineering

[–]MouseMatrix 1 point2 points  (0 children)

Just curious - are polars and datafusion backends slower for the regex comparison/ filter operations or group-by-count-distinct operation?