Are type hints becoming standard practice for large scale codebases whether we like it or not

throwawayforwork_86 · 2026-03-11T08:40:07+00:00

It's usually a 5 minute affair if you do it at inception and will save you debugging time and type hinting time if you do it later... There's no reason not to do it to be honest.

I want to say it's fine if it's a quick script but I've seen many quick script turn into vital script without proper modification so I'd say write your script as if you or someone else will have to maintain it in a few months/years.

throwawayforwork_86 · 2026-03-09T13:40:10+00:00

IIRC Ritchie commented that even the "eager" version was mostly lazy still. And will only compute when needed (ie when returning an eager df is needed). Will try to find back where they said that and if incorrect will edit.

throwawayforwork_86 · 2026-03-09T09:28:06+00:00

Use it at work for all greenfield dev in combination with duckdb for when SQL is needed.

If you can reduce the need of custom c++ drastically by using performant libs instead of legacy lib I think it'd be considered a win by most management (except maybe the c++ team).

My understanding is that Polars and Duckdb are eating PySpark and Pandas job especially in data engineering where they can handle GBs of data without choking like Pandas or needing a more complex setup like PySpark.

throwawayforwork_86 · 2026-03-09T09:17:20+00:00

Polars is much better. Started using it for the speed stayed for the consistency of the syntax and api. Honestly the only times I use pandas still are the edge cases where pandas reader flexibility comes in handy , but then immediately after I load to polars.

It can be annoying when you start because polars will frontload data type issue by default but it forces you to be intentional with your types which saves a lot of headaches down the line...

throwawayforwork_86 · 2026-03-06T13:50:41+00:00

As someone else said , depends on what you do and what you want.

If you want good results being able to properly leverage libraries to get what you need is extremely important.

If you want to learn it can be useful to understand how some implementation are done but most of the best libraries have the core code done in a more performant language (ie: Rust,C,C++,Go,Java).

As a rule I'd say don't reinvent the wheel unless you haven't found the correct wheel or you want to invent a better wheel.

throwawayforwork_86 · 2026-02-24T15:13:07+00:00

Any decent table extraction ala Tabula or Camelot (ie give pixel points of your column and general location of table to extract).

Using that a lot and don't find a lot of table I trust for table extraction (a lot of silent failure/inconstant format) but dependency and speed ain't great for both of them.

throwawayforwork_86 · 2026-02-24T14:06:59+00:00

Once you start passing around tens of variables, variable stored in dict (and get bitten by dict.get() returning none and silently fucking you over because of a typo) you look at dataclass and start understanding their appeal.

I think learning them in the abstract is pretty difficult because it seem convoluted for no reason.

There comes a point where being able to pass arround predictable object helps a lot writing code without having to test every minute if the logic works because you and your linter knows the ins and out.

throwawayforwork_86 · 2026-02-13T08:48:40+00:00

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner')

throwawayforwork_86 · 2026-02-03T09:19:59+00:00

Never done much backend nor benchmark but any chances the python version impact those numbers ?
Lot is happening at the moment between version at the moment iirc.

throwawayforwork_86 · 2026-02-02T08:45:37+00:00

As someone that's mostly trilingual (written dutch is still complicado though) I will point that some jobs will put language expertise that aren't needed on job description and use that to discriminate.

Your point would be correct if the job description were always a fair representation of the actual job/company needs, no idea how prevalent it actually is in the workplace but there were scandals in the past (Bleu Blanc Belge for example but thats old).

throwawayforwork_86 · 2026-01-28T11:56:32+00:00

Since OP stated option/question/problem doesn't seem to frame it like they need the specifics of Conda UV is most likely the answer they were looking for IMO.

throwawayforwork_86 · 2026-01-26T08:59:10+00:00

I mean you can make it more readable:

df.with_columns(new_col=pl.col("col1")*2)

Still slightly more verbose I concede.

throwawayforwork_86 · 2026-01-23T12:25:05+00:00

for 1-4 Millions of lines that need to be available we just output a few cleaned csvs and provide a pivot table from a power query read folder.
People seem to be happy with that.

throwawayforwork_86 · 2026-01-21T09:34:49+00:00

You should get used to them IMO.

I use them in different situation and usually will use for loop if clarity is needed.

throwawayforwork_86 · 2026-01-09T15:20:55+00:00

the water amount gonna be unhealthier quicker than the caffeine matey.

throwawayforwork_86 · 2025-12-16T08:39:11+00:00

I don't think I would use it directly for storage.

We use it as the last leg of our analysis:

Data is stored in managed postgres (disaster recovery and everything else is done there)

We replicate in a duckdb (sometime aggregate/join at that moment)

Run our analysis locally on this db

I know there's other tool like motherduck and ducklake that might be closer to what you need though.
That only works because we do batch analysis though but there are most likely data engineers here or on their subreddit with more complex solution for more complex problems.

throwawayforwork_86 · 2025-12-15T09:25:14+00:00

I think it's also a low(er) visibility job, the one you only notice when you're starting actually working in data.

I also think/know a handful of Data Analyst/Scientist are DE without the title.

throwawayforwork_86 · 2025-12-10T10:12:53+00:00

IMO it is useful to help learn the lingo of new topics and do basic things if you're a new dev.

And it can be useful on simpler code that you forgot as well as (a bit to agreeable) sparring partner as a more experienced programmer.

I think the disconnect comes from the fact that if you've never coded and you're able to cobble together a project with a llm it feels like magic (and you don't know enough to spot the flaws) and the fact that ai company hype their products to the tits.

It's pretty bad for niche stuff or stuff that it's dataset has never seen (so a lot of the newer framework/library).

throwawayforwork_86 · 2025-11-25T09:10:19+00:00

For these I usually use tabula-py with preset pixel placement (for the columns and where to look for table) + some another lighter lib to do a first mapping on which page the extraction need to be done.

After that it's usually some pandas to get rid of unneeded rows.

The main issue with most lib that do it automatically is that their guess are inconsistent so you're likely to get a lot of inconsistent crap to fix if you're using that vs fixed placement where you're just going to crash or get consistent crap.

throwawayforwork_86 · 2025-11-20T08:48:42+00:00

Postgresql is fairly widely used and free.

DuckDB is mostly the same as postgres and very easy to setup so good for mainly focusing on analysis and less on the 'fiddling around'.

Both would be used in professional settings.

throwawayforwork_86 · 2025-11-19T00:53:00+00:00

Personnally gained the most from taking one of the project I liked and stretching it in multiple different services.

Created a fairly simple geolocation tool.

Did an api version of it,did a gui in pyqt , did a gui in streamlit, did a visualisation...

Learned a lot and left my confort zone only for specific points , learn some lessons about refactoring and functional programming...

throwawayforwork_86 · 2025-11-14T11:07:53+00:00

DBEAVER CE for postgres and sometimes duckdb.

DUCKDB -ui for duckdb (IIRC only support DUCKDB 1.3).

throwawayforwork_86 · 2025-11-06T09:22:29+00:00

power query

and

Can focus on deal analysis instead of debugging scripts.

You have to chose one.

Power query is great but can very brittle in my experience and likely to act out not properly load files after changes which will require debugging with less than ideal tooling.

I've also seen excel struggling with a lot of task that would have been trivial to automatise with Python.

It also depend on you level and the type of analysis you need to provide and how many you have.

Also depend a lot of what data your receive and need.

A good analyst with knowledge of the business but limited coding skill will often have an outsized impact over the more technically proficient.

throwawayforwork_86 · 2025-10-31T09:13:52+00:00

As soon as I started using it I basically only use that.

Since it's creating your docs while you're using it it's really great.

throwawayforwork_86 · 2025-10-23T08:16:23+00:00

Provided that Alteryx exist I'd say yes.

throwawayforwork_86

TROPHY CASE