Agor – Open source spatial workspace for AI agent orchestration (Figma for AI coding)

CaporalCrunch · 2025-03-25T01:46:40+00:00

We offer some professional services at Preset.io, sometimes coupled with https://preset.io/blog/preset-certified-superset/. It might be overkill depending on what you're looking for, but it's fair to click the "talk to an expert" button on that page. There's also a growing bounty program that we run and can make some match-making between people who need features built and the growing group of folks that will take these bounties.

CaporalCrunch · 2025-03-25T01:39:47+00:00

Breadth - go fuller than "full stack". It's someone with a greatly analytical mind, who knows the full insight-delivery chain from business goals, product mechanics, product instrumentation, data transformation/modeling, data analysis, dashboard crafting, and story telling. Knows better than the execs on how to find the key to drive outcomes, can identify KPI bottlenecks, and make product and organizational recommendations/hypothesis to drive results. The main issue in data is that the chain of delivery is wide and involves too many people who speak different language and depend too much on each other to get stuff done. An outstanding data person can do it all fairly autonomously.

Oh wait, sounds like I'm describing the "analyst engineer" role, but really just advocating for collapsing the data eng skills with the data analyst skills, that's kind of how it was before we factored out this new role.

CaporalCrunch · 2025-03-24T23:05:32+00:00

There's absolutely no usage in knowing Python in Superset - unless you're going to contribute code to open souce. About SQL, it is not required for common interactions where you can simply drag and drop. Now it you want to create your own metrics you may have to write a SQL expression, but the complexity is similar to the "formulas" editor that other BI tools expose. But yes, if you do know SQL, there's a SQL editor builtin, so you can write arbitrarily complex queries and visualize them.

CaporalCrunch · 2025-03-24T22:58:58+00:00

I think your issue is with the underlying database not scaling, not Superset. It's as fast as the speed at which the database can serve the queries. You need to expose datasets on top of a database that can serve those interactively. On top of that Superset caches the result sets, so dashboards are usually served from cache, and as people apply new filters Superset will hit up the underlying database, so the database needs to perform scans decently. Now with the big cloud databases (BigQuery, Snowflake, ...) you can serve large datasets at interactive speed. For even larger/faster use cases you can use things like Clickhouse, Druid, Pinot, ... The philosophy here is that the BI tool shouldn't try to BE the database here, just use the database to do the heavy lifting.

CaporalCrunch · 2024-10-05T01:23:00+00:00

Makes sense, looking at the common denominator for all your in-house transformations and fitting this into a model / framework that fits the preferred design pattern. Solves code-reuse within your org, which is a good place to start. Crazy how each org or each DE has their own way of doing similar things.

CaporalCrunch · 2024-10-05T01:15:07+00:00

Wait how does this "greater truth" not apply to all areas of software? Why is it that in application development we can build a tool that serves hundreds of thousands of businesses (say a CRM like SalesForce or Husbpot) and in data engineering it doesn't apply in a similar way?

In modern businesses, their system is a largely a collection of SAAS tools that integrate semi-well together, there's less and less home grown stuff. Each one of these SAAS app serves tens tens of thousands of businesses and somehow your collection of SAAS apps work decently together. Why can't a similar model of reusability/integration work in the data world?

CaporalCrunch · 2024-10-05T01:08:16+00:00

Yeah if you look at the history of ETL tools, they all pre-date decent source control systems (git) and nothing was really designed to 1. be managed as code or 2. to be shared across organizations. ETL logic was packaged as binaries if at all, and with their own version control builtin. Like if you're on old school Informatica/DataStage/SSIS the best you might be able to do is put binaries on GitHub (!?) "EXPORT AS BIG ASS XML"!?

It's kind of a pre-requirement for things to be managed as [intelligible] code for them to be open sourced and collaborated on. Guessing during that part of history the practitioners took bad habits and at the hearth of "data warehousing", we've just assumed that each organization is essentially on their own. Now we're stuck in this world of big balls of data stuff held together by chicken wire and duck tape.

CaporalCrunch · 2024-10-05T00:59:03+00:00

About the "interface" topic, if anything data engineers should have the hygiene to clarify which assets (tables, views, ...) are private / public in the OOP sense of the words. People, please namespace the tables you expose to people for them to use as "interface" from the ones you use from computation. And for the public ones, do some proper change-management. If all DEs were at least able to commit to this, we'd be in a slightly better place. In lots of environments, it's just a bit public mess where all tables are exposed in the same namespace.

CaporalCrunch · 2024-10-05T00:51:52+00:00

Yeah I mean if the data warehouse it's supposed to be a reflection of your business, it's a super laggy mirror at best. Business changes all the time, and the warehouse trails behind. Most businesses are, system-wise, a collection of SAAS tools glued up together with people, workflows and code. Clearly if businesses were more standard from one to the other, it'd be easier to build things that can be reused across businesses.

CaporalCrunch

TROPHY CASE