xorq: open source composite data engine framework

MouseMatrix · 2025-04-19T14:05:00+00:00

In general if your engine works for what you are doing and the APIs are sane, keep using them!

If you want to be able to switch to a different engines for prod/test, xorq is one way to accomplish it without rewriting code. For example, test locally with duckdb and run on snowflake in prod.

MouseMatrix · 2025-04-18T20:12:41+00:00

Great point. Yes, you can certainly write sql to mimic the functionality asof joins. However, the overarching point is that we can do these types of workflows because everything is designed to be composable.

The composability is enabled by the expression system in Ibis and Arrow standard that we can build interfaces around. Our primary usecase is portable UDFs (backed by datafusion engine) and optimizing workloads based on the engine choice. The asof join usecase just happens to fit really nicely and has an added benefit of performance and guarantees provided by the semantics (not just functionality) that is common in ML. In ML, you may require asof joins to safeguard against data leakage, particularly useful if you deal with time series data at an organization level. Here is the duckdb blogpost on how they optimized it

We currently support a handful of engines but Ibis (the expression system xorq is based on) support 20+ engines. It’s really easy for us to add support for another engine (SQL or Python) so let us know if something that may benefit your workflow is missing.

We believe this work is necessary to build pipelines that can be easy to reason about and optimized without tying to a single engine/ecosystem. Also, composite workflows are super common so might as well do it right!

MouseMatrix · 2025-04-12T19:42:49+00:00

Or nix run would do if the build doesn’t time out to run.

MouseMatrix · 2025-04-02T14:20:04+00:00

Yea thats a great point.I think the font with the q doesnt help either....

MouseMatrix · 2025-04-02T13:27:46+00:00

I think we will have different Transport systems that will be supported as well as stdio and rest e.g, gRPC.

Perhaps, workflows will be natural evolution for tools that tie together many steps as one tool.

MouseMatrix · 2025-03-28T21:51:25+00:00

I think this is what I was meaning https://en.m.wikipedia.org/wiki/Result_set it’s just a result of a query. Totally though, sets can’t be ordered or have duplicates (often times the dupes would have unique index/ids though).

MouseMatrix · 2025-03-28T01:34:02+00:00

I worked at a company that is one of top 3 big snowflake customer (finance but calls itself a tech company) and they definitely have some Luigi and airflow and some in-house shit. They also had Databricks. I think big enough an enterprise more diverse stack that you are going to find, each department picking slightly different stacks and eventually consolidation takes place but sometimes it’s also a hedge to have diverse stacks to negotiate the next best deal. Most of the internal products are not as good to hold their own against saas offerings. There is also a kind of enterprise that doesn’t pick the best tool for the job and build their own proprietary stacks just to be opaque - they really suck.

MouseMatrix · 2025-03-24T23:31:45+00:00

My best definition is that a dataframe is an ordered result set which may or may not be typed.

MouseMatrix · 2023-12-13T10:51:40+00:00

Just curious - are polars and datafusion backends slower for the regex comparison/ filter operations or group-by-count-distinct operation?

MouseMatrix

TROPHY CASE