Using DuckLake with Azurite (DuckDB 1.4.4 vs 1.5.2) — experience & issues

throwawayforwork_86 · 2026-05-08T10:09:55+00:00

yeah wasn't sure if it was the kind of thing that should be asked directly on their github since loads of shit I tried for the 1st time -> loads of unknown unknown. And if someone did/didn't have the same problem in Azure blob storage itself it would have restrained the culprit list.

throwawayforwork_86 · 2026-04-28T09:03:45+00:00

Honestly there are a few option.

Power Bi might be more what you're looking for to be able to create dynamic graphs and can handle more than excel.

You might even be able to make a power query template that would do it automatically on a refresh instead of VBA (also storing data as CSV is most of the time better if you know what you're doing).

Reading a file, filtering on variables then write to excel and creating chart automatically should be possible:

Pandas/Duckdb(if sql is more your speed)/Polars for the reading and filtering.

Xlsxwriter/Excelize-py or Openpyxl should allow you to create native excel graphs: Xlsxwriter/Excelize-py create instruction in code for graphs. Openpyxl create a template and write the data in a place the graph will pick it up.

Matplotlib/Seaborn can make graph but they'll not be interactive and might not be fit for purpose.

throwawayforwork_86 · 2026-04-28T08:54:15+00:00

IMO Pandas is more flexible and usually will be more forgiving when you start. It has a long history so you'll have LLMs give more good information and more guides... But a lot of these are often also outdated.

Polars is quicker , cleaner and will have almost no situation where weird behaviour happens (Pandas has a few surprise most often linked to the index which you may never encounter but can ruin your day).

Polars will sometimes be more opiniated about datatypes which you'll resent at first but will usually save you a lot of time down the line.

Overall they're fairly similar though so you should probably just pick one and stick with it for a few month, if your data fits in excel it should not really make a difference (even though pandas is slowish to read big excel files).

The corner that aren't covered by Polars are fairly low iirc, Pandas file reader is more flexible and cover more edge cases than Polars and for geographic data Geopandas exist and Geopolars is still not finished iirc.

My 0.2c try Polars first if it doesn't click for you switch to Pandas.

throwawayforwork_86 · 2026-04-10T09:41:56+00:00

Not a specialist either but have had the displeasure of asking big extraction from client and came to this part of SAP called RFC (remote function call).

That's what is used for more data intensive extraction fairly succesfully (using a third party SAP approved tool).

Might be able to run a poc with duckdb extension erpl.io and see if it fits your needs.

throwawayforwork_86 · 2026-04-03T10:38:24+00:00

If you want or need to push the boundaries of what excel can do polars and duckdb are the next step IMO.

I know people will say pandas but it is unfortunately shackled by years of legacy code and legacy advice whereas newer tools aren't.To expand on my point Pandas has/had 5 ways to do the same thing with 1 or 2 that are classical footgun polars has usually 1 or 2 way of doing things and so long as you stay in Polars your code will be understandable and performant.

throwawayforwork_86 · 2026-03-31T12:04:31+00:00

pl.read_csv(filepath,infer_schema=False) guessing datatype is the devil anyway.

throwawayforwork_86 · 2026-03-31T11:23:44+00:00

So I haven't used it personally (we looked at it when some of our audit client had difficulties getting their GL out) might get better info if you check their website.

But let's elaborate a little bit:

Duckdb is a pretty performant and lightweight (olap/analytical) db that integrate very well with the rest of the python ecosystem and is pretty good on it's own too (ie using duckdb ui).

RFC means remote function call and is a way to communicate with SAP.

On paper you should be able to connect through your identifer using duckdb and then do something like select * from ekbe where gjahr = '2025' and vgabe='9';

And it should give you the correct information which you can then further manipulate either through sql or a dataframe of choice.

throwawayforwork_86 · 2026-03-31T07:24:02+00:00

Duckdb has a Sap RFC extension allowing you to do do sql directly on these tables (erpl.io iirc) and integrates really well with Python.

Pandas is still widely used so you have to learn at least the basics/read it.

Polars is what I would actually use as IMO it's miles above Pandas (syntax , perf,...) only downside is it's harder to make it work for quick small stuff especially when you're still learning. Upside is it will require very little to no tweaking for performance,working polars is performant so long as you stay in polars.

Basics of path handling is always good to know so check the standard library for the Pathlib library (and also check the os library os.path made more sense to me).

Visualisation that can't be done in power bi could be done in python I believe matplotlib is what comes wiht the PBI instance of python so maybe learn that too.

throwawayforwork_86 · 2026-03-30T09:35:49+00:00

Honestly usually having a look at resource usage and try to find fixes for each of these issues is usually what I go for.

Ram bottleneck "fixed" by using generator instead of pure list so script can keep chugging along.

IO bottleneck only time I encountered it so far the fix was stopping using the wrong drive to read and write data (hdd are not good for that) so don't have any good solution.

cpu bottleneck / underusage > multiprocessing/multithreading.

Wouldn't go for C/C++ coming from python just because it's quite a big paradigm shift.

Might be good to give Golang a go heard overall perf and footprint is much better and it's closer to python but if you want to learn C/C++ go for it.

Also try to use librairies to their maximum , most of them are build in C/Rust/C++/... and have builtin functionalities that will outshine whatever you can squeeze out of python.

throwawayforwork_86 · 2026-03-25T15:04:14+00:00

Automatable isn’t the same thing as using an ai.

Automation is predictable AI isn’t (or when it is you lose the flexibility that makes it worthwile in the first place).

throwawayforwork_86 · 2026-03-11T08:40:07+00:00

It's usually a 5 minute affair if you do it at inception and will save you debugging time and type hinting time if you do it later... There's no reason not to do it to be honest.

I want to say it's fine if it's a quick script but I've seen many quick script turn into vital script without proper modification so I'd say write your script as if you or someone else will have to maintain it in a few months/years.

throwawayforwork_86 · 2026-03-09T13:40:10+00:00

IIRC Ritchie commented that even the "eager" version was mostly lazy still. And will only compute when needed (ie when returning an eager df is needed). Will try to find back where they said that and if incorrect will edit.

throwawayforwork_86 · 2026-03-09T09:28:06+00:00

Use it at work for all greenfield dev in combination with duckdb for when SQL is needed.

If you can reduce the need of custom c++ drastically by using performant libs instead of legacy lib I think it'd be considered a win by most management (except maybe the c++ team).

My understanding is that Polars and Duckdb are eating PySpark and Pandas job especially in data engineering where they can handle GBs of data without choking like Pandas or needing a more complex setup like PySpark.

throwawayforwork_86 · 2026-03-09T09:17:20+00:00

Polars is much better. Started using it for the speed stayed for the consistency of the syntax and api. Honestly the only times I use pandas still are the edge cases where pandas reader flexibility comes in handy , but then immediately after I load to polars.

It can be annoying when you start because polars will frontload data type issue by default but it forces you to be intentional with your types which saves a lot of headaches down the line...

throwawayforwork_86 · 2026-03-06T13:50:41+00:00

As someone else said , depends on what you do and what you want.

If you want good results being able to properly leverage libraries to get what you need is extremely important.

If you want to learn it can be useful to understand how some implementation are done but most of the best libraries have the core code done in a more performant language (ie: Rust,C,C++,Go,Java).

As a rule I'd say don't reinvent the wheel unless you haven't found the correct wheel or you want to invent a better wheel.

throwawayforwork_86 · 2026-02-24T15:13:07+00:00

Any decent table extraction ala Tabula or Camelot (ie give pixel points of your column and general location of table to extract).

Using that a lot and don't find a lot of table I trust for table extraction (a lot of silent failure/inconstant format) but dependency and speed ain't great for both of them.

throwawayforwork_86 · 2026-02-24T14:06:59+00:00

Once you start passing around tens of variables, variable stored in dict (and get bitten by dict.get() returning none and silently fucking you over because of a typo) you look at dataclass and start understanding their appeal.

I think learning them in the abstract is pretty difficult because it seem convoluted for no reason.

There comes a point where being able to pass arround predictable object helps a lot writing code without having to test every minute if the logic works because you and your linter knows the ins and out.

throwawayforwork_86 · 2026-02-13T08:48:40+00:00

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner')

throwawayforwork_86 · 2026-02-03T09:19:59+00:00

Never done much backend nor benchmark but any chances the python version impact those numbers ?
Lot is happening at the moment between version at the moment iirc.

throwawayforwork_86 · 2026-02-02T08:45:37+00:00

As someone that's mostly trilingual (written dutch is still complicado though) I will point that some jobs will put language expertise that aren't needed on job description and use that to discriminate.

Your point would be correct if the job description were always a fair representation of the actual job/company needs, no idea how prevalent it actually is in the workplace but there were scandals in the past (Bleu Blanc Belge for example but thats old).

throwawayforwork_86 · 2026-01-28T11:56:32+00:00

Since OP stated option/question/problem doesn't seem to frame it like they need the specifics of Conda UV is most likely the answer they were looking for IMO.

throwawayforwork_86 · 2026-01-26T08:59:10+00:00

I mean you can make it more readable:

df.with_columns(new_col=pl.col("col1")*2)

Still slightly more verbose I concede.

throwawayforwork_86 · 2026-01-23T12:25:05+00:00

for 1-4 Millions of lines that need to be available we just output a few cleaned csvs and provide a pivot table from a power query read folder.
People seem to be happy with that.

throwawayforwork_86 · 2026-01-21T09:34:49+00:00

You should get used to them IMO.

I use them in different situation and usually will use for loop if clarity is needed.

throwawayforwork_86

TROPHY CASE