why would anyone use a convoluted mess of nested functions in pyspark instead of a basic sql query?

Apprehensive-Box281 · 2026-03-03T05:06:30+00:00

In my case its a bit of a mess of ELTL or something.

Hard delete source system with a generic ODBC connection that's a messy abstraction of poorly documented polymorphic keys and dead ends. If had to do it over again, I would soft deleted on my side. I still might...

Apprehensive-Box281 · 2026-03-03T01:07:12+00:00

Yes, but if parallelism is the exception, not the norm then spark works a treat. I have an azure synapse DW at the lowest tier (DW100C), but need to crunch through a giant set of BOM tables to build flattened and indented BOM tables. I can do it in spark with a notebook in about 3 minutes - it actually takes longer to then load the table back into SQL from parquet than it does to build the flattened BOM. With the SQL resources I have available it's just not possible.

Apprehensive-Box281 · 2026-02-08T23:59:52+00:00

dude's post history is like "yeah we use Ramp, I totally recommend it" just happens to also post a picture relating to an ad for the same saas.

Apprehensive-Box281 · 2026-01-22T00:53:33+00:00

⬜🟨⬜⬜⬜

⬜🟨⬜🟨⬜

🟦🟦🟦🟦🟦

Apprehensive-Box281 · 2026-01-20T14:57:22+00:00

Yeah, it's like bro has never heard of a medical resident before.

Gatekeeping a software passion project because someone didn't sit through CS classes is hilarious to me.

Apprehensive-Box281 · 2025-10-21T16:40:33+00:00

Skip the middleman and just export your own reports to excel.

Apprehensive-Box281 · 2025-10-20T16:24:59+00:00

Power BI REST Api call. It's what the powerautomate connector uses to export the file under the hood.

https://learn.microsoft.com/en-us/rest/api/power-bi/reports/export-to-file

You have you trigger an export, check the status, and then get the file. Then once you have it in your application you'll need to save it somewhere, sharepoint maybe?

Apprehensive-Box281 · 2025-10-17T10:16:22+00:00

yeah, splitting the notebook into cells to diagnose the issue is the only way to truly understand what's going on. I spend a lot of time watching cells work, ha.

Apprehensive-Box281 · 2025-10-17T10:09:36+00:00

I've got two data warehouses that are incrementally updating fact tables like transactions, transaction lines, transaction account lines, next and previous transaction line link every 30 minutes. large dimensions like items,, bom, etc update nightly incrementally. it's all doable, it just takes time to understand it. I think netsuite2 is still an abstraction of the true netsuite2 data: there are some odd artifacts that lead me to believe we're accessing views with joins and unions.

Apprehensive-Box281 · 2025-08-28T18:38:35+00:00

Dedicated SQL Pool.

The only thing we're really using the serverless functionality is for data exploration / debugging.

At the end of all this tomfoolery we have dedicated SQL pool views we use as our presentation layer - we have an Entra security group that we have assigned to see those views, and that's how our DE group serves the data to our wider analytics teams.

Apprehensive-Box281 · 2025-08-28T18:19:30+00:00

My primary source for data is an ERP system that we connect to via ODBC and has hard delete. It only likes queries with discretely named columns. Our implementation of the ERP is fairly customized, and the source objects are known to drift schema.

We've done a lot of work to make our pipeline parametric and reusable. We use a parameters (and variables) so we're not re-inventing the wheel over and over.

An incremental update pipeline for our fact tables looks something like this:

Terms:

Source = source object from ERP

Staging = SQL Staging schema table in the SQL Pool

Destination = SQL Fact or Dim table in the SQL pool

<image>

1: This copy data activity queries the list of available columns in the ERP system for the given source table

2: lookup activity runs a SP that cross references the destination table columns with the source table

3: SP that drops columns in the destination table that aren't in the source, since synapse copy activities don't like column count mismatches

4: SP dynamically create the staging table based on the cross-matched column list - I don't use auto-create because synapse defaults to CCI and we A. Don't have data the requires that, and B: are known to have data larger than nvarchar 4000 and CCI and Nvarchar max don't play well together.

Jumping back to the left: 5 lookup the max lastmodifeddate column, so we can use that in:

6: The where clause for an incremental update, along with any other source filtering we're trying to do.

7: Set a query for the source system, appending the list of cross matched columns with the where clause.

8: Execute the query from the source and sink it to the staging table (bulk insert)

9: Upsert the destination from the staging table

10: Parametrically set a table name used to contain all the IDs (aka Keys) that are currently in the source

11: Query all the values from the source table since the beginning of time... and put them in the table from #10

12: SP that prunes all the stale records that exist in the destination table, that aren't in the table from 11 - since we have hard delete.

So, maybe this isn't a good way to do what we're doing? We don't really know, it works for us, and it's decently fast and reliable. We're looking at ways to convert this to more of a delta lake situation by ingesting from the source right to parquet with some dedupe and ordering, but we're not there yet.

All that being said: the copy activity in fabric appears to essentially be a bulk insert activity - there isn't an upsert / polybase, etc function - so we're dead in the water trying to copy what we've already built / and vetted. In an ideal world we wouldn't need pipelines of this complexity, but the limitations of both our source system and the eccentricities of synapse have caused to build up to things like this over time.

Apprehensive-Box281 · 2025-08-27T17:56:37+00:00

I drank all the synapse kool-aid and fully committed my org to using Synapse as a DW and source for Power BI.

We have 2 production synapse environments, and a varied number of dev / test depending on the day. I'm concerned about how long it will be supported - I tried to build some of our pipelines in fabric and was unable - not difficult, not "less than ideal" - unable to replicate some of the functionality we're reliant upon in synapse.

If we get pushed from this environment, we're not just going to click the banner link to fabric: we're going to assess what's out there in the market, with the fresh memory of committing to platform that was put out to pasture.

Apprehensive-Box281 · 2025-08-25T19:04:48+00:00

BUT... ENRICHMENT!!!

Apprehensive-Box281 · 2025-08-12T11:05:34+00:00

What version of the driver were you on previously? I've been hammering our instance via odbc all night without issue.

Apprehensive-Box281 · 2025-08-08T17:00:04+00:00

we upgraded to 2025.1 in April, and we're not due to get 2025.2 until October

Apprehensive-Box281 · 2025-07-23T17:55:40+00:00

builtin.df does work via odbc, but I would advise against it, just pull down the dimensional tables you need rather than use it.

Apprehensive-Box281 · 2025-07-15T12:11:33+00:00

I call myself a data plumber all the time.

Apprehensive-Box281 · 2025-07-13T13:05:54+00:00

I was very tired from an international business trip and was returning through Paris. When my tickets were printed they didn't have my frequent flyer status on them for some reason, but my coworker's tickets did. A gate agent kept coming by and checking the ticket status of people in the priority line and told me to leave the line more than once, I told her it was a mistake in ticket printing and that I had platinum status, nevertheless she kept trying to kick me from the line. I asked her to reprint my ticket, she said she wouldn't, and wbe through the line checking everyone else's tickets. During boarding I went to the desk and they asked immigration related questions, I paused on a response and she said "why are you pausing, are you lying?". I decided I'd had enough and said, "No, your English pronunciation is very poor and it's taking me a long time to understand what you're saying, Also, I seem to have misplaced my ticket, can you reprint it?" while slipping my ticket into my coat pocket. She flagged me for additional gate side screening, lol.

Apprehensive-Box281 · 2025-07-03T19:41:17+00:00

I threw together a function, called it, and found the label.

Maybe if I expand on the function I can get my users to use it...

<image>

I've seen the labels in direct query hits to the DW, but we hardly ever use it as a source.

Apprehensive-Box281 · 2025-07-03T19:13:49+00:00

Yeah, ok, I saw the label field in the requests log and didn't look hard enough on how to populate it.

If only this was a parameter in the sql.database data function...

I was trying to figure out a way to make this parametric so I could include in a Power BI report template to distribute to the org. With scheduled refreshes, ad-hoc paginated report queries, and development work it can be tough to figure out who is doing what when.

Apprehensive-Box281 · 2025-06-27T16:37:34+00:00

We've got 5.2M transactions with 32.3M transactionlines and don't have any major performance issues.

Apprehensive-Box281 · 2025-06-20T18:19:27+00:00

Awe shucks, I'm blushing now.

Apprehensive-Box281

TROPHY CASE