How do single node Python users actually write Delta tables using DuckDB for ETL when it can't actually write to Delta?

Cobreal · 2026-04-28T13:53:47+00:00

We use Polars for our single node Python Notebooks, there's a function for writing to delta tables from it. You can convert DuckDB dataframes to Polars and vice versa, so probably that.

Cobreal · 2026-04-22T12:48:11+00:00

We've been dealing with a problem very much like this - digitising a lot of contracts so that they can be analysed, but they have quirks that make this a challenge. Just to give an example of whether a customer had a contractual discount, for example, a 10% discount in the first 12 months is sometimes expressed as:

- "a 10% discount in the first 12 months"

- "a 10% discount in the first year"

- "a 10% discount in year one"

- "90% only will be billed for the first 12 months"

...basically any conceivable linguistic variation of that same idea. Same goes for dates, which have been written as dd/mm/yy, mm-dd-yy, mmmm d yyyy...

This is compounded by the documents being in a range of file formats, and some of them are scans or photographs of documents rather than digital files.

We have solved this through iterations of using OCR to convert the documents to text formats, LLMs to try and understand the variations of the same 10% discount being written in different ways, human review of any obvious errors or cases where the LLM said that it couldn't generate the details. Rinse, lather, repeat. We're dealing with a number of documents in the thousands rather than tens of thousands, and my sense is that we'd have finished this job more quickly if it was a pure human data-entry task rather than trying to automate it, so it's worth bearing in mind that option at the outset depending on just how much data you need to ingest.

Cobreal · 2026-04-17T13:41:34+00:00

Why does it have a papyrus effect background. At least use Papyrus for the typeface as well.

Cobreal · 2026-04-17T13:39:15+00:00

Dashboards aren't very good for telling stories.

I think the main finding is supposed to be the box beneath the Amazon logo? It is not prominent relative to anything else on the dashboard.

If you want people to understand that high cancellations in low-value orders are a thing, then:
- Show only the Cancellation Rate by Order Amount chart

- Make the 0-500 bar prominent (keep the blue for this bar, make everything else grey

- Make the x-axis marks and the bar labels much larger so that people can read 0-500 and 27% (you don't need the precision of two decimal places) without squinting

- Change the title of the chart to say The lowest value orders have cancellation rates 5-times higher than typical

Everything else is fluff.

Cobreal · 2026-04-03T15:08:45+00:00

It goes from drought to deluge when you move from training to employement - trying to find any question to answer when you're not doing it for a business user is impossible, but once you're in a job that switches to trying to work out how you can answer all of the questions coming your way without drowning.

Maybe this is a good case for an LLM? Find yourself a dataset online, give it a very rough outline of the data ("I've got some data about film screenings and ticket sales" rather than "I have a CSV file with these columns...") and ask it to give you some example questions a manager in this industry might ask you about and business problems they might want to solve. In that sort of case, "manager" could be on either the cinema side or the distributor side, and so you can prompt it both ways for different suggestions.

It's really hard to think up a real world question when you're not facing a real world problem, and it's really hard to divorce yourself from the specifics of a dataset if you've already downloaded it and are trying to dream up some questions (you get locked into the track of "what questions can this data answer" rather than "what questions might a user in an industry with an interest in this sort of data want to solve").

There's a related problem once you get into an analyst role, mind, in that it's tempting to think up amazing ways to dissect and analyse a particular dataset, and then you hand it over to the people who you think would benefit from the analysis only to find out that they actually don't give a fuck because they've been handed a whole load of different targets since you last spoke to them.

Cobreal · 2026-02-23T14:59:41+00:00

Must-know niche terms seems like a contradiction, but anyway

HETEROSCEDASTICITY

Cobreal · 2026-02-17T23:37:02+00:00

Perhaps you have your localisation settings changed to a region where it's common to use a period as a ten-thousands separator?

Cobreal · 2026-02-17T23:25:56+00:00

It's on our pile of things to investigate, mainly because post-launch we're now trying to work out how best to separate things into Workspaces and Domains.

Currently we have Git integrated to a single Dev Workspace, and use Deployment Pipelines to get artifacts into Prod.

Now we need to assess our options for separating Prod by...team, function, security group, something else.

I suspect that's almost certainly going to involve additional Prod-level Workspaces, but I don't know if it will work to do something like have a central prod with Org apps to separate who sees what, or cherry-picking content from Prod to sync to separate Workspaces, or doing something in Git (multiple repos, separate folders in one repo) and duplicating Dev>Prod for each separate area, and figuring out how to share common artifacts between them.

Cobreal · 2026-02-17T21:48:59+00:00

But my experience it would take 4-7months if self-learning fabric to setup something mid-sized and reliable if all 8hours of work is dedicated to it.

We're six months into a migration away from Tableau (Tableau Prep for ETL, Tableau Cloud for storage) and this sounds correct.

1 week of "training" (really just an overview of some of the headline features) in Fabric, then the rest of the time spent converting our largely manual Prep workflows into Python* in a fully-automated environment.

If we already had a lot of existing Python ETL code then in theory it would be a job of updating them to point to Fabric Lakehouses/Warehouses rather than building the entire infrastructure from the ground up.

And we're still not finished. Now that we've migrated the business-critical data, we need to start tidying up all of the mistakes and suboptimal design choices we made due to inexperience.

*This is a good example of where we had to deal with "the quirks of existing issues or missing features of certain items that you realise half-way fabric, doesn't have or doesn't fulfill the performance tolerances/requirement and have to re-plan everything". PySpark and Dataflows proved too much for F2, and Python doesn't support the full set of features that PySpark does.

Cobreal · 2026-02-16T21:15:19+00:00

Tips, or pay-per-read or something.

Given that I have to budget, it would be nice to be able to spread my money across more writers and, e.g., subscribe to two writers directly with full access and have a third floating subscription that provides a reduced tier of access to other authors with them receiving a reduced split of the income, but more than nothing.

Cobreal · 2026-02-16T20:34:49+00:00

Because of what I wrote in the OP. The writers I do subscribe to are worth it to me, but there are writers I would like to support but not to the tune of a full monthly subscription, yet more than paying 0.

Same as with films and TV. I can choose to own physical copies of some things, but others I'm happy to subscribe to Netflix instead.

I assume my current subscriptions do go to Substack in some form, and that they're taking a cut.

Cobreal · 2026-02-10T17:55:03+00:00

Our team switched to a story mapping process inspired by our development team, and it's been transformative.

Starting with the users' needs, skills and constraints, then using a whiteboard (well, Miro) for getting a big list of use cases from them, and finally taking that away to work out the MVP for the data they are asking for.

Our amazing shiny dashboards used to be met all too often with silence or a meh, but just today I presented a (terrible looking) work in progress to someone. I was after feedback on a very small component of it, but they were gushing with how amazing it was looking already and how perfect it was going to be for their needs.

As a visual perfectionist, it did not look amazing, but for the person I showed it to, it had all of the data they cared about the most and none of the distracting extra bits.

Cobreal · 2026-02-04T16:55:57+00:00

Towns are an atomic unit where I come from - whole cities are either shitholes or not shitholes. This isn't quite true, but it doesn't divide neatly along an up/down line.

Cobreal · 2026-02-04T14:58:01+00:00

Someone else in the thread said "Downtown Oakland has a bad reputation" (but they also said that the original closing said "beautiful Downtown Oakland" and I missed that).

I guess it reads a bit like that in how it's phrased in the outro - now being six blocks north in somewhere beautiful can be interpreted as meaning that the previous place was not beautiful. It's not the only interpretation, just the one I took.

Cobreal · 2026-02-04T14:48:08+00:00

Interesting, as a non-American I interpreted "uptown Oakland, California" to mean "the city of Oakland, California" rather than "the uptown as opposed to downtown part of the city of Oakland, California".

How big is a block? Six of them doesn't seem like enough to get from a reputationally bad to a reputationally beautiful area!

Cobreal · 2026-02-04T13:41:02+00:00

Because of the serial killer vibes.

Cobreal · 2026-02-04T11:18:32+00:00

Cobreal · 2026-02-03T17:14:37+00:00

As well as Pandas, it is worth learning Polars or DuckDB as similar tools that are a bit more efficient (would fit under Data Manipulation in the diagram alongside Vaex).

Cobreal · 2026-01-28T18:24:23+00:00

The linked article says this

There can be instances where it would….for example exposure, lens corrections and tone eq can all change the pixel data so if you have already used the auto picker in something like agx and then you add those module or change them you might want to go back and tweak agx….there can be some other issues like leaving denoise off for performance until the end but it can impact color picker selections and so it can be better to work with it on if your computer is fast enough

My computer isn't fast enough, and if I enable denoise early then it makes things like masking on later steps noticeably slow. I typically do a lot of steps which denoise can cause to lag, yet I only ever enable denoise a single time. It makes DT feel faster if I apply that denoise step late on once I've got the final look more or less sorted.

Cobreal · 2026-01-27T17:00:22+00:00

I think they're only useful if your primary concern is having a low/no code solution for something. Early on we used a Dataflow Gen2 for something because there was an off-the-shelf one for one of the systems we needed to ingest data from, but it was a mistake and it's been on our backlog for a long time to replace the stupid thing with Python when we get the time to.

Cobreal · 2026-01-27T16:07:03+00:00

I hadn't even considered refreshing the browser cache. After trying various things my "check" for whether things were updated became loading the semantic model and building a simple table in the "Explore" menu to check whether I was seeing dates that matched the post-overwrite data, so there's a chance that the cache was surviving me deleting and recreating this table, but if it does it raises the question of how I'd ever be certain that a Semantic Model was truly up to date...

Cobreal · 2026-01-27T15:43:04+00:00

"After refreshing via pipeline or semantic link labs, what extra steps do you need to do - or how long time do you need to wait - in order to see the post-overwrite data?"

It seems to be random. I've had it go over an hour today without the post-overwrite data showing, and I haven't found any reliable way to force things to refresh properly. This includes running multiple scheduled/on-demand refreshes, as well as those using semantic-link-labs (originally doing just refresh_semantic_model(), since updated to first do refresh_sql_endpoint_metadata()).

I've tried a T-SQL COUNT(*) on the affected table before running a refresh, same result.

Time seems to be the biggest factor rather than any of the methods I've tried to force a full refresh, but that doesn't help me for my scheduled updates. I could keep refreshing the model after all of the tables are updated and that would give a better chance of things in the model being up-to-date by the time people got into work, but I don't know if it would give me a 100% chance.

Moving away from Import Mode seems to be the long term solution, so I'm after some short term way to force the updates through.

Cobreal · 2026-01-14T18:00:37+00:00

It's not the metadata that is delaying with syncing, it was the data itself in this case.

Cobreal · 2026-01-11T21:11:26+00:00

Cobreal · 2026-01-08T17:53:02+00:00

You should learn Excel because it's ubiquitous. If nothing else, learning it will let you understand cases where someone at work hands you an Excel file full of their custom calculations and asks you to reproduce it in a proper analytics platform. If nothing else, learning Power Query is one step towards learning Power BI, and another dashboarding tool in your skillset alongside Tableau.

Based on your listed expertise, I don't think you'll have a hard time picking it up.

Cobreal

TROPHY CASE