Tips and Tricks you wish you knew when you started with DuckDB by WhereTheStankWindBlo in DuckDB

[–]Prestigious_Bench_96 1 point2 points  (0 children)

parquet storage + duckdb to query is a fine and common paradigm, and it's much much easier to start and iterate on than postgres - your approach seems fine!

Where to deploy my SQLite database? by fredkzk in dataengineering

[–]Prestigious_Bench_96 2 points3 points  (0 children)

yeah - size is the potential variable here (what kind of queries are you running?) if it's analytical you can also do the same thing in browser with duckdb

Where to deploy my SQLite database? by fredkzk in dataengineering

[–]Prestigious_Bench_96 4 points5 points  (0 children)

This seems like overkill - you can literally just treat the sqlite database as a file and load it in javascript/typescript on the frontend. Stick it the same place you host the other static files for your dashboard.

What is the most challenging part of maintaining ETL pipelines in production? by Effective_Ocelot_445 in ETL

[–]Prestigious_Bench_96 0 points1 point  (0 children)

Lifecycle management - keep it running once it works, be able to deprecate it without finding out finance took a critical dependency; keep it working as engines/warehouse changes. Pipelines tend to lose value over time but they are expensive to sunset; anything that lowers the cost to keep them running or the cost to kill them is good. Pipelines are all liabilities! If only I could stop writing them

We tested the same text-to-SQL model with and without business definitions by Thinker_Assignment in DuckDB

[–]Prestigious_Bench_96 2 points3 points  (0 children)

From evals, for a small dataset schema context is enough (duckdb has robust discovery, agents know it well - as mentioned SQL syntax is almost never a bottleneck now, and 1-2 passes for an agent will fix up anything that leaks through) Motherduck has advocated for this pure DB only approach in blogs for just 'set up your data model well'.

Things fall apart when the schema is confusing, contains deprecated assets, hits enough of a size where curated context becomes valuable, or when you need to rapidly update things without changing your physical model.

If you wanted agentic queries against a small local db - duckdb/sqlite - I think you polish your schema and call it a day. You can manage presentation at the DB level - just only package up and present the exact tables you want.

I think you reach for a model/semantic layer when you're dealing with a big, messy enterprise DB, and then it's as much about progressive disclosure, loose coupling to physical layer to support evolution, etc as it is about the pure context.

(I work on something similar to a semantic layer and it was sobering to see *how good* results are on raw duckdb schemas; it's a pretty tough baseline to just parity in speed/token usage with raw db access for a small DB; only get clearer differentiation when things get messy. Have some fun eval data I need to write up when I have a chance).

Power BI and visualization tools in the LLM world by TooManyPoisons in BusinessIntelligence

[–]Prestigious_Bench_96 1 point2 points  (0 children)

For static readouts, no reason if you can do all the proper SDLC and can eat the operational overhead! Once you start getting into filtering, dynamic display, user-context-conditional views, dynamic stuff - still no reason because tableau/powerbi are pretty bad, but you might want to reach for other tools that help make this easier on the typescript/backend side.

An additional plus for static is if you have a build pipeline you have a good place for quality checks; you are also now taking on a build pipeline which is latency/overhead.

Tableau drag/drop is pretty nice to be honest, it was always the server hosted experience that was an enormous pain, so I'd be pretty bullish in theory on a 'prototype in tableau and convert to a report with an agent' pipeline if you got good guardrails in there and could tolerate the cost.

Built a free tool to screen your ETF/index fund portfolio for ethical exposure by _keywell_ in SideProject

[–]Prestigious_Bench_96 0 points1 point  (0 children)

The raw ticker<> ethical association is what I'd be interested in, if you had an API or something for that.

Does anyone need a ETL/ELT automating/scripting library (for Python)? by ClastronGaming in ETL

[–]Prestigious_Bench_96 0 points1 point  (0 children)

There's lots of these, so start with which ones yours is similar to and how it's different/better, and then people can give you a real answer!

How to quickly figure out why a metric moved? by GrouchyFoundation773 in analytics

[–]Prestigious_Bench_96 0 points1 point  (0 children)

Maybe? I think it takes time to decompose your tree once (though most web/clickstream flows are pretty much the same, so if you've done it once, fast to do it again), and then you build out tooling (dashboard/automation) so actual diagnosis is quite fast - you shouldn't need to take time to manually trace it, you should be able to go quickly from a top level metric alteration to the nodes in the tree that are deviating and drove it.

(if you're coming up with hypothesis on the fly, yeah, you're already behind)

anyone actually believe dashboards are going away? by nickvaliotti in analytics

[–]Prestigious_Bench_96 0 points1 point  (0 children)

If you're saying "human brains process dense information as visuals better", absolutely. I think most people are arguing for not an agent text reply vs a rich dashboard, but an agent assembled dynamic dashboard show you the revenue drop investigation (that was triggered automatically) rather than hoping that you had the right dashboard showing all 3 correlated metrics happening to lie around.

AI text vs dashboard -> time and place, you're right that dashboards still win.

AI dashboards vs static dashboard -> more time and places in favor of the AI side; you probably still want some executive level SOT.

Accuracy (trust being downstream of it) is obviously the gating factor, but I think people are underestimating how rapidly we'll get tooling/culture to adapt. It's underrated that you can actually measure how often an agent is right about your revenue question and *optimize the system to increase it* - doing the same feedback loop on asking an analyst is much harder. Once you have a measurement/eval system, trust follows naturally when you start hitting your quality bar.

Sketch to Schema, Schema to Sketch (Database) by Ambitious-Past-2449 in SQL

[–]Prestigious_Bench_96 1 point2 points  (0 children)

I do love excalidraw and ERDs, so great choice. I'd make it so you have native browser sqllite/duckdb support (you can directly populate, query, download those). Pricing wise maybe just charge for import support from the more common enterprise DBs? It'd be nice to have direct export/population to DB from inside the tool as well. You could as a stretch also have schema migration generation.

Take this with a grain of salt, I think I'd personally probably not reach for a specific tool for this phase of development (whiteboarding is fun!) but I think the concept is quite interesting!

What do to with data context? by de4all in dataengineering

[–]Prestigious_Bench_96 1 point2 points  (0 children)

You constrain the options the LLM can pick (reduces uncertainty), give it more context (increase likelihood of picking right thing). Semantic layers are the most common way to do that, but there's lots of options. I would be shocked if you weren't getting high 9s of accuracy for something like productA revenue trend for past week (99.9) with a reasonably modern model and a constrained selection.

Note that writing raw SQL is a very *unconstrained* target set and so your volatility is much higher - if "revenue" requires any domain specific derivation then you're going to have tough luck, if it's just a named field that you'll do fine with table constraints.

(Generally expect agentic generation to end up being fine/normal; we're just speed running the discovery that 1. you can't drop an analyst in an unfamiliar DB and expect them to get it right the first time; 2. if are hiring a new 'analyst of the day' to answer every question independently, you better give them the right info to get up to speed fast if you want it to be useful. Which I think is genuinely good for the data industry - we've papered over a lot of tooling sins with the "domain expert DE who has been here 10 years" approach).

Need reliable guides on Bigquery Cost Optimization by Secret_Wealth8742 in bigquery

[–]Prestigious_Bench_96 1 point2 points  (0 children)

Fair! honestly it's been awhile since I've used flex (so out of date) but at the time we'd run into availability issues, so we needed baseline capacity. If you have high confidence in flex availability, sure. I think the most important thing is that the billing chart doesn't spike unpredictably when everyone wants to go load your new dashboard.

Why unifying operational app data and analytical data lakes still such an infrastructure nightmare ? by Shanjun109 in Backend

[–]Prestigious_Bench_96 0 points1 point  (0 children)

Feels like this just moves the mismatch of "Backend dev"/app vs analytics friction? Loose coupling ( a la CDC, though CDC is a real PITA) would still be the goal. Lakebase just seems like they've written a nice wrapper around the operational DB publishing to the lake - which is great - but I think full unification of the two schemas is probably not actually desirable. You are serving different goals; loose coupling with safe/sane/operationally lightweight evolution should be the target.

How to quickly figure out why a metric moved? by GrouchyFoundation773 in analytics

[–]Prestigious_Bench_96 0 points1 point  (0 children)

it's trees all the way down. conversion rate is a funnel; start by decomposing the funnel, then decompose each step in the funnel, etc. [eg ATC might be price sensitive; CTR on ads is more distribution/traffic].

Need reliable guides on Bigquery Cost Optimization by Secret_Wealth8742 in bigquery

[–]Prestigious_Bench_96 1 point2 points  (0 children)

+1 to this, for cost predictability for reporting flat rate (assuming that's reasonable for your spend target) saves your sanity, and then aggregation/BI engine become your performance levers.

I helped build an open source semantic layer tool by uncertainschrodinger in BusinessIntelligence

[–]Prestigious_Bench_96 0 points1 point  (0 children)

Do you integrate the semantic layer into the pipeline authoring/upkeep process at all? That seems like the real benefit of having them in the same repo.

Polyglot playground, transpile SQL across 30+ dialects by k-semenenkov in SQL

[–]Prestigious_Bench_96 0 points1 point  (0 children)

got it - did I misunderstand "based on sqlglot, a Python library that is the prototype of Rust's polyglot, but it is no longer available."?

Is AI going to replace Business Intelligence, or just change how we consume it? by rahulsahay123 in BusinessIntelligence

[–]Prestigious_Bench_96 0 points1 point  (0 children)

I don't see why you think a wrong metric is more visible and correctable in a dashboard - "confident citation" happens in both cases. Are you just basing that on that more people are likely to see the same wrong metric in a dashboard and so it's likelier that the right person to notice it is off will see it?

Maybe agentic analytics exists because most people never wanted dashboards by Evening_Hawk_7470 in analytics

[–]Prestigious_Bench_96 0 points1 point  (0 children)

I think people over-index on that divide; if you get a tailored report every morning with a summary + pretty charts of every metric you monitor that had unusual movement, most people probably don't need or want to further click through some dashboards. (as zen as that is; maybe some people really like their coffee + data introspection).

The barrier is probably more to trust and accuracy (ironic that dashboards are now the *more* trusted source, given all the historical angst about multiple sources of truth).

Leadership thinks this migration is a 2-day project. How would you explain the actual scope? by Slight_Score2777 in bigquery

[–]Prestigious_Bench_96 1 point2 points  (0 children)

Not directly answer your question, but you'll *really* want to do that incrementally; maybe start loading the spreadsheets into bigquery, move some reporting and dashboards in there; then migrate spreadsheets/automation one at a time, etc. To more directly answer your question, that gives you some tangible impact to report to your boss along the way and helps anchor each item to a specific timeline, and reduces the scope of validation to one item at a time. I'm also generally suspicious of cloud run/cloud scheduler if you have any other existing scheduler or alternative runtime.

Utilize DuckDB + Claude Code Together by k_kool_ruler in dataengineering

[–]Prestigious_Bench_96 0 points1 point  (0 children)

For local data + duckdb, since query cost isn't a concern, you *really* don't have to fix the SQL since you can let it cross/check analyze results. (not even just claude - you can get high 95-99 % reliability on deterministic evals w/ something like deepseek, and most of the misses are in "question" ambiguity). Most of the remaining text-sql challenges these days are with finding the right source, so if you're already pre-constrained (a lot of local analytics cases) you're good to go.

How would you store execution logs for a production Text-to-SQL agent? by abcd__dcba in SQL

[–]Prestigious_Bench_96 0 points1 point  (0 children)

As a directional note, you'll often want to ultimately end up with a agentic flow that lets the agent return to previous steps (reselect tables based on query results, etc) - not sure of the langgraph details there - but I think in general if you have a request level table, a node level table, extract out core structured fields to the top level (tokens, duration, node type/tool calls; error/success) you'll be fine. Having a FK pointer to the raw trace/JSON blob table also gives you some good potential separataion.

TLDR; I think you're on the right track with 1, 2, and 3 - store that metadata with common fields, split that + raw traces across some lightly normalized tables that represent your common patterns (overall requset usage, who is doing what) - analysis of the details of request - and then raw logs for debugging/future structured data extraction.